Introduction to Datamining Concept and Techniques

29
Introduction to Datamining using Practical View Created : Ngô Tùng Sơn Part 1

Transcript of Introduction to Datamining Concept and Techniques

Introduction to Dataminingusing Practical View

Created : Ngô Tùng Sơn

Part 1

Schedule:1. Example of Datamining2. What and Where is Datamining in the System3. Datamining Techniques

Data preprocessing Data Analysis Data Visualization

How data look like?

X Y

3 33 12 24 62 36 77 55 6

Can we get some thing from this?

The row represents an object and its columns represent its attributes

Ex: can we identify the group of these objects? YES

1. Example of Datamining

Now, forget the table, consider a row as a point then we have

1 2 3 4 5 6 7 8012345678

X

Y

BA

C

From each data point, we find its neighbors by scanning with a radius r . For Example : A will have 2 Neighbors B and C , denoted: A{B,C}

r

D

A and D have same neighbors so they are considered as neighbors

Same for B {A,B,C,D} ,C{A,B,C,D}, D{B,C}

The points have neighborhood will be in the same group.

1. Example of Datamining

Finally we have 2 groups after considering all points

1 2 3 4 5 6 7 8012345678

X

Y

What do we see here?

Data has not been classified into groups but we now have the groups

This is just an example of technique called CLUSTERING in DATAMINING

1. Example of Datamining

2. What and Where is Datamining in the System

So. What exactly is Datamining?

Datamining is the set of tools and techniques to retrieve hidden Knowledge/Rules from data

The name of datamining could make us to misunderstand

Data was there, we do not need to ‘mining’ it

For ore mining you need hammers and shovels

However, for datamining you need mathematic, statistic and probability, machine learning, computer programming, database techniques,...

2. What and Where is Datamining in the System

Where is Datamining in the system?

Employee/Staff

Day by day, The staff using the software (Web/

Desktop/Mobile application) to generate data by recording

all of his/her business activities (customers, products,

order detail, contracts ,…) Database

Data is added to Database

Online transaction processing (OLTP)

Database

Database

….

Data from several data sources (OLTP) will be collected to a common repository

Data warehouse

Integration Service

Datamining service will access to the Data warehouse to process

Data Mining

3. Datamining Techniques

What are the techniques in Datamining?

There are so many techniques can be applied in datamining

Basically we can classify them into 3 groups / phases

Data-Preprocessing

Data Analysis

Data Presentation

3. Datamining Techniques

Data-Preprocessing

3. Datamining Techniques

We can understand that:The quality of collected data would be not good. It is necessary to clean / format / transform .... Before analyzing

This is very important process. It is very hard to find an abstract way to describe.

Data-Preprocessing

Here we will see few examples of data pre-processing techniques:

• Similarity Measure

• Down Sampling• Dimension Reduction• Vectorization

3. Datamining Techniques

How can we know which object are similar?

Data-Preprocessing Similarity Measure

A(x1,y1)

B(x2,y2)

C(x1,y1)D2D1

Measure the distance between AB and AC

We see that D1 < D2 -> A is more similar with B than C

Every point can be represented as vector. Measure the angle between pair of vectors: A and B, then A and C

We see that < -> A is more similar with B than C

𝜶

𝜷

3. Datamining Techniques

What if, you have so many data, performing data analysis on all of them may be not necessary and reducing performance ?

Data-Preprocessing Down Sampling

Just pick some of them to evaluate

Example: using a cell-size of . Keep only object / cell

𝑔

𝑔Origin Data Down Sampling

3. Datamining Techniques

All example data have been presented to you are in 2 dimensions, 2 attributes (X,Y) . What if it was ~10.000 attributes for each object

Data-Preprocessing Dimension Reduction

This could reduce the performance (and or accuracy) of data-analysis algorithms . Somehow we need to reduce number of dimensions

Principal component Analysis & Singular value Decomposition are 2 of most effective methods to do this

3. Datamining Techniques

Data-Preprocessing Dimension Reduction - PCA

PCA

X

Y𝑃1

𝑃2

Origin Data Data projected to Principal Components

We Only keep Principal Components that have highest eigenvalues. On above example. We can let then keep instead of both ,

By this way the number of dimensions has been reduced

3. Datamining Techniques

Data-Preprocessing Vectorization

Most of Data Analysis algorithms consider the input as set of vectors, so we need to transform the collected data into set of vectors.

Ex: Giving a document: “Mr A has not passed the exam this year. He will do it again next year”

Some of important words will be extracted like “Mr A” , “not” , “pass” ,”exam” , “again” , “next” , “year”

Measure the frequency of each word, we get the vector that represent the document

Mr A not pass exam again next year

1 1 1 1 1 1 2

3. Datamining Techniques

Data Analysis

3. Datamining Techniques

There are so many techniques in this phase:

• Clustering

• Classification

• Regression

• Rule Bases

• ….

This is the most important phase, where we find all of hidden knowledge/ rules in the data

Data Analysis

3. Datamining Techniques

The process of clustering is to find ways to group objects into groups (clusters)

Data Analysis Clustering

The objects in the same cluster are similar and otherwise they are not similar.

There are 2 types of clustering : Partional & Hierarchical

In this presentation: we see an example of the most famous clustering method : K-Mean

3. Datamining Techniques

Data Analysis Clustering – K mean Algorithm

1. Randomly select K center (centroid) for K clusters (cluster).

2. Calculate the distance between objects (objects) to the K center

3. Group objects to the nearest group

4. Defining the new focus for the group

5. Repeat step 2 until no change of subject groups

3. Datamining Techniques

Data Analysis Clustering – K mean Algorithm

Consider the below data

Plot them we have:

3. Datamining Techniques

Data Analysis Clustering – K mean Algorithm

Select K=2 centroids Compute the new position of centroids

Finally centroids stop changing

The object belongs to the group of its closest centroid

The key point of algorithm is to select a good k

3. Datamining Techniques

Data Analysis Classification

How can we identify the group of unclassified object ?

Sure! we can perform clustering to do this.

However, what if we know some of classified objects in the past? Can we do better than Clustering? YES.

We can construct a prediction model to predict the group of unclassified objects based on the classified objects

This process called CLASSIFICATION

3. Datamining Techniques

Data Analysis Classification

The process of Classification can be described as below

Learning Algorithm

Model

3. Datamining Techniques

Data Analysis Classification - SVM

Support Vector Machine (SVM) is one of famous classification method. It belongs to group of linear classifiersFor example: data classified in red and blue Training Data

: normal vector

: bias / distance from the line to origin

?

Classification Model ?

3. Datamining Techniques

Data Analysis Regression

Use for prediction: but to predict the missing value of an attributeFor example:

Y

X𝑥𝑖

𝑦 𝑖

• How to find , if known?

• We can estimate the line that describe The data

• Plug to line equation toFind

• This is just an example ofLinear Regression

3. Datamining Techniques

Data Analysis Rule Base

Rule Base techniques : to find hidden patterns in the data

Example of rule base techniques:

• Customer normally buy rice always buy vegetable

• Young people want to more expensive phone than others

• People always buy laptop before buying cell-phone

Frequent Pattern

Gradual Pattern

Sequential Pattern

3. Datamining Techniques

Data Visualization

3. Datamining Techniques

Data Visualization

Techniques to present knowledge that you retrieved to user

Categ

ory

1

Categ

ory

30

4

8

12

Series 3Series 2Series 1

Series 1 Series 2 Series 3

Category 1 4.3 2.4 2

Category 2 2.5 4.4 2

Category 3 3.5 1.8 3

Category 4 4.5 2.8 5

Thank you for your attention