Thursday, 2 February 2017

Data Mining Introduction

Data Mining Introduction


We have been "manually" extracting data in relation to the patterns they form for many years but as the volume of data and the varied sources from which we obtain it grow a more automatic approach is required.

The cause and solution to this increase in data to be processed has been because the increasing power of computer technology has increased data collection and storage. Direct hands-on data analysis has increasingly been supplemented, or even replaced entirely, by indirect, automatic data processing. Data mining is the process uncovering hidden data patterns and has been used by businesses, scientists and governments for years to produce market research reports. A primary use for data mining is to analyse patterns of behaviour.

It can be easily be divided into stages


Once the objective for the data that has been deemed to be useful and able to be interpreted is known, a target data set has to be assembled. Logically data mining can only discover data patterns that already exist in the collected data, therefore the target dataset must be able to contain these patterns but small enough to be able to succeed in its objective within an acceptable time frame.

The target set then has to be cleansed. This removes sources that have noise and missing data.

The clean data is then reduced into feature vectors,(a summarized version of the raw data source) at a rate of one vector per source. The feature vectors are then split into two sets, a "training set" and a "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.

Data mining

Data mining commonly involves four classes of task:

Classification - Arranges the data into predefined groups. For example email could be classified as legitimate or spam.
Clustering - Arranges data in groups defined by algorithms that attempt to group similar items together
Regression - Attempts to find a function which models the data with the least error.
Association rule learning - Searches for relationships between variables. Often used in supermarkets to work out what products are frequently bought together. This information can then be used for marketing purposes.

Validation of Results

The final stage is to verify that the patterns produced by the data mining algorithms occur in the wider data set as not all patterns found by the data mining algorithms are necessarily valid.

If the patterns do not meet the required standards, then the preprocessing and data mining stages have to be re-evaluated. When the patterns meet the required standards then these patterns can be turned into knowledge.

Source :

No comments:

Post a Comment