Data Mining

109
Data Mining Rajendra Akerkar

description

 

Transcript of Data Mining

Page 1: Data Mining

Data Mining

Rajendra Akerkar

Page 2: Data Mining

Data Mining: R. Akerkar 2July 7, 2009

What Is Data Mining?

n Data mining (knowledge discovery from data) ¨ Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) patterns or knowledge from huge amount of data

n Is everything “data mining”?¨ (Deductive) query processing. ¨ Expert systems or small ML/statistical programs

Build computer programs that sift through databases automatically, seeking regularities or patterns

Page 3: Data Mining

Data Mining: R. Akerkar 3July 7, 2009

Data Mining — What’s in a Name?

Data Mining Knowledge Mining

Knowledge Discoveryin Databases

Data Archaeology

Data Dredging

Database MiningKnowledge Extraction

Data Pattern Processing

Information Harvesting

Siftware

The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of stored data, using pattern recognition technologies and statistical and mathematical techniques

Page 4: Data Mining

Data Mining: R. Akerkar 4July 7, 2009

Definition

nSeveral Definitions¨ Non-trivial extraction of implicit, previously unknown

and potentially useful information from data

¨ Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 5: Data Mining

Data Mining: R. Akerkar 5July 7, 2009

What is (not) Data Mining?

l What is Data Mining?

– Certain names are more common in certain Indian states (Joshi, Patil, Kulkarni… in Pune area).

– Group together similar documents returned by search engine according to their context (e.g. Google Scholar, Amazon.com,).

l What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Pune”

Page 6: Data Mining

Data Mining: R. Akerkar 6July 7, 2009

n Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

n Traditional Techniquesmay be unsuitable due to ¨ Enormity of data¨ High dimensionality

of data¨ Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Page 7: Data Mining

Data Mining: R. Akerkar 7July 7, 2009

Data Mining Tasks

n Prediction Methods¨Use some variables to predict unknown or

future values of other variables.

n Description Methods¨Find human-interpretable patterns that

describe the data.

Page 8: Data Mining

Data Mining: R. Akerkar 8July 7, 2009

Data Mining Tasks...

n Classification [Predictive] predicting an item class

n Clustering [Descriptive] finding clusters in data

n Association Rule Discovery [Descriptive] frequent occurring events

n Deviation/Anomaly Detection [Predictive] finding changes

Page 9: Data Mining

Data Mining: R. Akerkar 9July 7, 2009

Classification: Definition

n Given a collection of records (training set )¨ Each record contains a set of attributes, one of the

attributes is the class.n Find a model for class attribute as a function

of the values of other attributes.n Goal: previously unseen records should be

assigned a class as accurately as possible.¨ A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 10: Data Mining

Data Mining: R. Akerkar 10July 7, 2009

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes1 0

categ

orical

categ

orical

conti

nuou

s

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

Page 11: Data Mining

Data Mining: R. Akerkar 11July 7, 2009

Classification: Application 1

n Direct Marketing¨Goal: Reduce cost of mailing by targeting a set of

consumers likely to buy a new cell-phone product.¨ Approach:

n Use the data for a similar product introduced before. n We know which customers decided to buy and which decided

otherwise. This {buy, don’t buy} decision forms the class attribute.

n Collect various demographic, lifestyle, and company-interaction related information about all such customers.¨ Type of business, where they stay, how much they earn, etc.

n Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997

Page 12: Data Mining

Data Mining: R. Akerkar 12July 7, 2009

Classification: Application 2

n Fraud Detection¨Goal: Predict fraudulent cases in credit card

transactions.¨ Approach:

n Use credit card transactions and the information on its account-holder as attributes.¨ When does a customer buy, what does he buy, how often he

pays on time, etcn Label past transactions as fraud or fair transactions. This

forms the class attribute.n Learn a model for the class of the transactions.n Use this model to detect fraud by observing credit card

transactions on an account.

Page 13: Data Mining

Data Mining: R. Akerkar 13July 7, 2009

Classification: Application 3

n Customer Attrition/Churn:¨Goal: To predict whether a customer is likely

to be lost to a competitor.¨Approach:

n Use detailed record of transactions with each of the past and present customers, to find attributes.¨ How often the customer calls, where he calls, what time-

of-the day he calls most, his financial status, marital status, etc.

n Label the customers as loyal or disloyal.n Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

Page 14: Data Mining

Decision Tree

Page 15: Data Mining

Data Mining: R. Akerkar 15July 7, 2009

Introductionn A classification scheme which generates a tree

and a set of rules from given data set.

n The set of records available for developing classification methods is divided into two disjoint subsets – a training set and a test set.

n The attributes of the records are categorise into two types:¨ Attributes whose domain is numerical are called

numerical attributes.¨ Attributes whose domain is not numerical are called

the categorical attributes.

Page 16: Data Mining

Data Mining: R. Akerkar 16July 7, 2009

Introduction

n A decision tree is a tree with the following properties:¨ An inner node represents an attribute.¨ An edge represents a test on the attribute of the father

node.¨ A leaf represents one of the classes.

n Construction of a decision tree¨ Based on the training data¨ Top-Down strategy

Page 17: Data Mining

Data Mining: R. Akerkar 17July 7, 2009

Decision TreeExample

n The data set has five attributes. n There is a special attribute: the attribute class is the class label. n The attributes, temp (temperature) and humidity are numerical

attributesn Other attributes are categorical, that is, they cannot be ordered.

n Based on the training data set, we want to find a set of rules to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.

Page 18: Data Mining

Data Mining: R. Akerkar 18July 7, 2009

Decision TreeExample

n We have five leaf nodes. n In a decision tree, each leaf node represents a rule.

n We have the following rules corresponding to the tree given in Figure.

n RULE 1 If it is sunny and the humidity is not above 75%, then play.n RULE 2 If it is sunny and the humidity is above 75%, then do not play.n RULE 3 If it is overcast, then play.n RULE 4 If it is rainy and not windy, then play.n RULE 5 If it is rainy and windy, then don't play.

Page 19: Data Mining

Data Mining: R. Akerkar 19July 7, 2009

Classification

n The classification of an unknown input vector is done by traversing the tree from the root node to a leaf node.

n A record enters the tree at the root node. n At the root, a test is applied to determine which child

node the record will encounter next. n This process is repeated until the record arrives at a leaf

node. n All the records that end up at a given leaf of the tree are

classified in the same way. n There is a unique path from the root to each leaf. n The path is a rule which is used to classify the records.

Page 20: Data Mining

Data Mining: R. Akerkar 20July 7, 2009

n In our tree, we can carry out the classification for an unknown record as follows.

n Let us assume, for the record, that we know the values of the first four attributes (but we do not know the value of class attribute) as

n outlook= rain; temp = 70; humidity = 65; and windy= true.

Page 21: Data Mining

Data Mining: R. Akerkar 21July 7, 2009

n We start from the root node to check the value of the attribute associated at the root node.

n This attribute is the splitting attribute at this node. n For a decision tree, at every node there is an attribute

associated with the node called the splitting attribute.

n In our example, outlook is the splitting attribute at root. n Since for the given record, outlook = rain, we move to

the right-most child node of the root. n At this node, the splitting attribute is windy and we find

that for the record we want classify, windy = true. n Hence, we move to the left child node to conclude that

the class label Is "no play".

Page 22: Data Mining

Data Mining: R. Akerkar 22July 7, 2009

n The accuracy of the classifier is determined by the percentage of the test data set that is correctly classified.

n We can see that for Rule 1 there are two records of the test data set satisfying outlook= sunny and humidity < 75, and only one of these is correctly classified as play.

n Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is 0.66.

RULE 1If it is sunny and the humidity is not above 75%, then play.

Page 23: Data Mining

Data Mining: R. Akerkar 23July 7, 2009

Concept of Categorical Attributes

n Consider the following training data set.

n There are three attributes, namely, age, pincode and class.

n The attribute class is used for

class label.

The attribute age is a numeric attribute, whereas pincode is a categorical one.

Though the domain of pincode is numeric, no ordering can be defined among pincode values.

You cannot derive any useful information if one pin-code is greater than another pincode.

Page 24: Data Mining

Data Mining: R. Akerkar 24July 7, 2009

n Figure gives a decision tree for the training data.

n The splitting attribute at the root is pincode and the splitting criterion here is pincode = 500 046.

n Similarly, for the left child node, the splitting criterion is age < 48 (the splitting attribute is age).

n Although the right child node has the same attribute as the splitting attribute, the splitting criterion is different.

At root level, we have 9 records. The associated splitting criterion is pincode = 500 046.

As a result, we split the records into two subsets. Records 1, 2, 4, 8, and 9 are to the left child note and remaining to the right node.

The process is repeated at every node.

Page 25: Data Mining

Data Mining: R. Akerkar 25July 7, 2009

Advantages and Shortcomings of Decision Tree Classificationsn A decision tree construction process is concerned with

identifying the splitting attributes and splitting criterion at every level of the tree.

n Major strengths are:¨ Decision tree able to generate understandable rules.¨ They are able to handle both numerical and categorical attributes.¨ They provide clear indication of which fields are most important for

prediction or classification.

n Weaknesses are:¨ The process of growing a decision tree is computationally expensive. At

each node, each candidate splitting field is examined before its best split can be found.

¨ Some decision tree can only deal with binary-valued target classes.

Page 26: Data Mining

Data Mining: R. Akerkar 26July 7, 2009

Iterative Dichotomizer (ID3)n Quinlan (1986)n Each node corresponds to a splitting attributen Each arc is a possible value of that attribute.

n At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root.

n Entropy is used to measure how informative is a node.n The algorithm uses the criterion of information gain to

determine the goodness of a split.¨ The attribute with the greatest information gain is

taken as the splitting attribute, and the data set is split for all distinct values of the attribute.

Page 27: Data Mining

Data Mining: R. Akerkar 27July 7, 2009

Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3The class label attribute, buys_computer, has two distinct values.

Thus there are two distinct classes. (m =2)

Class C1 corresponds to yesand class C2 corresponds to no.

There are 9 samples of class yesand 5 samples of class no.

Page 28: Data Mining

Data Mining: R. Akerkar 28July 7, 2009

Extracting Classification Rules from Trees

n Represent the knowledge in the form of IF-THEN rules

n One rule is created for each path from the root to a leaf

n Each attribute-value pair along a path forms a conjunction

n The leaf node holds the class prediction

n Rules are easier for humans to understand

What are the rules?

Page 29: Data Mining

Data Mining: R. Akerkar 29July 7, 2009

Solution (Rules)

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

Page 30: Data Mining

Data Mining: R. Akerkar 30July 7, 2009

Algorithm for Decision Tree Inductionn Basic algorithm (a greedy algorithm)¨ Tree is constructed in a top-down recursive divide-and-conquer

manner¨ At start, all the training examples are at the root¨ Attributes are categorical (if continuous-valued, they are

discretized in advance)¨ Examples are partitioned recursively based on selected attributes¨ Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)n Conditions for stopping partitioning¨ All samples for a given node belong to the same class¨ There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf¨ There are no samples left

Page 31: Data Mining

Data Mining: R. Akerkar 31July 7, 2009

Attribute Selection Measure: Information Gain (ID3/C4.5)

n Select the attribute with the highest information gainn S contains si tuples of class Ci for i = {1, …, m} n information measures info required to classify any

arbitrary tuplen ….information is encoded in bits.

n entropy of attribute A with values {a1,a2,…,av}

n information gained by branching on attribute A

ss

logss

),...,s,ssI(i

m

i

im21 2

1∑

=

−=

)s,...,s(Is

s...sE(A) mjj

v

j

mjj1

1

1∑=

++=

E(A))s,...,s,I(sGain(A) m −= 21

Page 32: Data Mining

Data Mining: R. Akerkar 32July 7, 2009

Entropyn Entropy measures the homogeneity (purity) of a set of examples. n It gives the information content of the set in terms of the class labels of the

examples. n Consider that you have a set of examples, S with two classes, P and N. Let the

set have p instances for the class P and n instances for the class N. n So the total number of instances we have is t = p + n. The view [p, n] can be

seen as a class distribution of S.

The entropy for S is defined as n Entropy(S) = - (p/t).log2(p/t) - (n/t).log2(n/t)

n Example: Let a set of examples consists of 9 instances for class positive, and 5 instances for class negative.

n Answer: p = 9 and n = 5. n So Entropy(S) = - (9/14).log2(9/14) - (5/14).log2(5/14)n = -(0.64286)(-0.6375) - (0.35714)(-1.48557)n = (0.40982) + (0.53056)n = 0.940

Page 33: Data Mining

Data Mining: R. Akerkar 33July 7, 2009

The entropy for a completely pure set is 0 and is 1 for a set with equal occurrences for both the classes.

i.e. Entropy[14,0] = - (14/14).log2(14/14) - (0/14).log2(0/14)= -1.log2(1) - 0.log2(0)= -1.0 - 0 = 0

i.e. Entropy[7,7] = - (7/14).log2(7/14) - (7/14).log2(7/14)= - (0.5).log2(0.5) - (0.5).log2(0.5)= - (0.5).(-1) - (0.5).(-1)= 0.5 + 0.5 = 1

Entropy

Page 34: Data Mining

Data Mining: R. Akerkar 34July 7, 2009

Attribute Selection by Information Gain Computation

g Class P: buys_computer = “yes”g Class N: buys_computer = “no”g I(p, n) = I(9, 5) =0.940g Compute the entropy for age:

means “age <=30” has

5 out of 14 samples, with 2

yes's and 3 no’s. Hence

Similarly,

age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971

694.0)2,3(145

)0,4(144

)3,2(145

)(

=+

+=

I

IIageE

048.0)_(151.0)(029.0)(

===

ratingcreditGainstudentGainincomeGain

246.0)(),()( =−= ageEnpIageGainage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

)3,2(145

I

Since, age has the highest information gain among the attributes, it is selected as the test attribute.

Page 35: Data Mining

Data Mining: R. Akerkar 35July 7, 2009

Exercise 1n The following table consists of training data from an employee

database.

n Let status be the class attribute. Use the ID3 algorithm to construct a decision tree from the given data.

Page 36: Data Mining

Data Mining: R. Akerkar 36July 7, 2009

Solution 1

Page 37: Data Mining

Data Mining: R. Akerkar 37July 7, 2009

Other Attribute Selection Measuresn Gini index (CART, IBM IntelligentMiner)¨ All attributes are assumed continuous-valued

¨ Assume there exist several possible split values for each attribute

¨May need other tools, such as clustering, to get the possible split values

¨ Can be modified for categorical attributes

Page 38: Data Mining

Data Mining: R. Akerkar 38July 7, 2009

Gini Index (IBM IntelligentMiner)n If a data set T contains examples from n classes, gini index,

gini(T) is defined as

where pj is the relative frequency of class j in T.n If a data set T is split into two subsets T1 and T2 with sizes N1 and

N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as

n The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

∑=

−=n

jp jTgini

1

21)(

)()()( 22

11 Tgini

NN

TginiNNTgini split +=

Page 39: Data Mining

Data Mining: R. Akerkar 39July 7, 2009

Exercise 2

Page 40: Data Mining

Data Mining: R. Akerkar 40July 7, 2009

Solution 2n SPLIT: Age <= 50n ----------------------n | High | Low | Totaln --------------------n S1 (left) | 8 | 11 | 19n S2 (right) | 11 | 10 | 21n --------------------n For S1: P(high) = 8/19 = 0.42 and P(low) = 11/19 = 0.58n For S2: P(high) = 11/21 = 0.52 and P(low) = 10/21 = 0.48n Gini(S1) = 1-[0.42x0.42 + 0.58x0.58] = 1-[0.18+0.34] = 1-0.52 = 0.48n Gini(S2) = 1-[0.52x0.52 + 0.48x0.48] = 1-[0.27+0.23] = 1-0.5 = 0.5n Gini-Split(Age<=50) = 19/40 x 0.48 + 21/40 x 0.5 = 0.23 + 0.26 = 0.49

n SPLIT: Salary <= 65Kn ----------------------n | High | Low | Totaln --------------------n S1 (top) | 18 | 5 | 23n S2 (bottom) | 1 | 16 | 17n --------------------n For S1: P(high) = 18/23 = 0.78 and P(low) = 5/23 = 0.22n For S2: P(high) = 1/17 = 0.06 and P(low) = 16/17 = 0.94n Gini(S1) = 1-[0.78x0.78 + 0.22x0.22] = 1-[0.61+0.05] = 1-0.66 = 0.34n Gini(S2) = 1-[0.06x0.06 + 0.94x0.94] = 1-[0.004+0.884] = 1-0.89 = 0.11n Gini-Split(Age<=50) = 23/40 x 0.34 + 17/40 x 0.11 = 0.20 + 0.05 = 0.25

Page 41: Data Mining

Data Mining: R. Akerkar 41July 7, 2009

Exercise 3

n In previous exercise, which is a better split of the data among the two split points? Why?

Page 42: Data Mining

Data Mining: R. Akerkar 42July 7, 2009

Solution 3n Intuitively Salary <= 65K is a better split point since it produces

relatively ``pure'' partitions as opposed to Age <= 50, which results in more mixed partitions (i.e., just look at the distribution of Highs and Lows in S1 and S2).

n More formally, let us consider the properties of the Gini index. If a partition is totally pure, i.e., has all elements from the same class, then gini(S) = 1-[1x1+0x0] = 1-1 = 0 (for two classes).

On the other hand if the classes are totally mixed, i.e., both classes have equal probability then gini(S) = 1 - [0.5x0.5 + 0.5x0.5] = 1-[0.25+0.25] = 0.5.

In other words the closer the gini value is to 0, the better the partition is. Since Salary has lower gini it is a better split.

Page 43: Data Mining

Clustering

Page 44: Data Mining

Data Mining: R. Akerkar 44July 7, 2009

Clustering: Definition

n Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that¨ Data points in one cluster are more similar to one

another.¨ Data points in separate clusters are less similar to

one another.n Similarity Measures:¨ Euclidean Distance if attributes are continuous.¨Other Problem-specific Measures.

Page 45: Data Mining

Data Mining: R. Akerkar 45July 7, 2009

Clustering: Illustration

⌧ Euclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intercluster distancesare maximized

Page 46: Data Mining

Data Mining: R. Akerkar 46July 7, 2009

Clustering: Application 1

n Market Segmentation:¨Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

¨ Approach: n Collect different attributes of customers based on their

geographical and lifestyle related information.n Find clusters of similar customers.n Measure the clustering quality by observing buying patterns

of customers in same cluster vs. those from different clusters.

Page 47: Data Mining

Data Mining: R. Akerkar 47July 7, 2009

Clustering: Application 2

n Document Clustering:¨Goal: To find groups of documents that are similar to

each other based on the important terms appearing in them.

¨ Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

¨Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Page 48: Data Mining

k- mean

Page 49: Data Mining

Data Mining: R. Akerkar 49July 7, 2009

Clustering

n Clustering is the process of grouping data into clusters so that objects within a cluster have similarity in comparison to one another, but are very dissimilar to objects in other clusters.

n The similarities are assessed based on the attributes values describing these objects.

Page 50: Data Mining

Data Mining: R. Akerkar 50July 7, 2009

The K-Means Clustering Method

n Given k, the k-means algorithm is implemented in four steps:¨ Partition objects into k nonempty subsets

¨ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)

¨ Assign each object to the cluster with the nearest seed point

¨ Go back to Step 2, stop when no more new assignment

Page 51: Data Mining

Data Mining: R. Akerkar 51July 7, 2009

The K-Means Clustering Method

n Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Page 52: Data Mining

Data Mining: R. Akerkar 52July 7, 2009

K-Means Clustering

n K-means is a partition based clustering algorithm.

n K-means’ goal: Partition database D into K parts, where there is little similarity across groups, but great similarity within a group. More specifically, K-means aims to minimize the mean square error of each point in a cluster, with respect to its cluster centroid.

Page 53: Data Mining

Data Mining: R. Akerkar 53July 7, 2009

K-Means Example

n Consider the following one-dimensional database with attribute A1.

n Let us use the k-means algorithm to partition this database into k = 2 clusters. We begin by choosing two random starting points, which will serve as the centroids of the two clusters.

25

11

30

20

3

12

10

4

2

A1

21

=Cµ

42

=Cµ

Page 54: Data Mining

Data Mining: R. Akerkar 54July 7, 2009

n To form clusters, we assign each point in the database to the nearest centroid.

n For instance, 10 is closer to c2 than to c1.

n If a point is the same distance from two centroids, such as point 3 in our example, we make an arbitrary assignment.

25C2

11C2

30C2

20C2

3C1

12C2

10C2

4C2

2C1

A1Cluster Assignment

Page 55: Data Mining

Data Mining: R. Akerkar 55July 7, 2009

n Once all points have been assigned, we recompute the means of the clusters.

5.22

321

=+

=Cµ

167

1127

25113020121042

==++++++

=Cµ

Page 56: Data Mining

Data Mining: R. Akerkar 56July 7, 2009

n We then reassign each point to the two clusters based on the new means.

n Remark: point 4 now belongs to cluster C1.

n The steps are repeated until the means converge to their optimal values. In each iteration, the means are re-computed and all points are reassigned.

25C2

11C2

30C2

20C2

3C1

12C2

10C2

4C1

2C1

A1Clusters

Page 57: Data Mining

Data Mining: R. Akerkar 57July 7, 2009

n In this example, only one more iteration is needed before the means converge. We compute the new means:

33

4321

=++

=Cµ

186

1086

2511302012102

==+++++

=Cµ

Now if we reassign the points there is no change in the clusters. Hence the means have converged to their optimal values and the algorithm terminates.

Page 58: Data Mining

Data Mining: R. Akerkar 58July 7, 2009

Visualization of k-means algorithm

Page 59: Data Mining

Data Mining: R. Akerkar 59July 7, 2009

Exercise

n Apply the K-means algorithm for the following 1-dimensional points (for k=2): 1; 2; 3; 4; 6; 7; 8; 9.

n Use 1 and 2 as the starting centroids.

Page 60: Data Mining

Data Mining: R. Akerkar 60July 7, 2009

SolutionIteration #11: 1 mean = 12: 2,3,4,6,7,8,9 mean = 5.57

Iteration #21: 1,2,3 mean = 25.57: 4,6,7,8,9 mean = 6.8

Iteration #32: 1,2,3,4 mean = 2.56.8: 6,7,8,9 mean = 7.5

Iteration #42.5: 1,2,3,4 mean = 2.57.5: 6,7,8,9 mean = 7.5

Means haven’t changed, so stop iterating. The final clusters are {1,2,3,4} and {6,7,8,9}.

Page 61: Data Mining

Data Mining: R. Akerkar 61July 7, 2009

K – Mean for 2-dimensional databasen Let us consider {x1, x2, x3, x4, x5} with following coordinates as

two-dimensional sample for clustering:

n x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)

n Suppose that required number of clusters is 2.n Initially, clusters are formed from random distribution of

samples:n C1 = {x1, x2, x4} and C2 = {x3, x5}.

Page 62: Data Mining

Data Mining: R. Akerkar 62July 7, 2009

Centroid Calculationn Suppose that the given set of N samples in an n-dimensional

space has somehow be partitioned into K clusters {C1, C2, …, Ck}

n Each Ck has nk samples and each sample is exactly in one cluster.

n Therefore, Σ nk = N, where k = 1, …, K.n The mean vector Mk of cluster Ck is defined as centroid of the

cluster,

Mk = (1/ nk) Σi = 1 xik

n In our example, The centroids for these two clusters aren M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}n M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}

nk

Where xik is the ith sample belonging to cluster Ck.

Page 63: Data Mining

Data Mining: R. Akerkar 63July 7, 2009

The Square-error of the cluster

n The square-error for cluster Ck is the sum of squared Euclidean distances between each sample in Ck and its centroid.

n This error is called the within-cluster variation.

ek2 = Σi = 1 (xik – Mk)

2

n Within cluster variations, after initial random distribution of samples, are

n e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2]

+ [(5 – 1.66)2 + (0 – 0.66)2] = 19.36n e2

2 = [(1.5 – 3.25)2 + (0 – 1)2] + [(5 – 3.25)2 + (2 – 1)2] = 8.12

nk

Page 64: Data Mining

Data Mining: R. Akerkar 64July 7, 2009

Total Square-error

n The square error for the entire clustering space containing K clusters is the sum of the within-cluster variations.

Ek2 = Σk = 1 ek

2

n The total square error isE2 = e1

2 + e22 = 19.36 + 8.12 = 27.48

K

Page 65: Data Mining

Data Mining: R. Akerkar 65July 7, 2009

n When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will be,

n d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 ⇒ x1∈ C1

n d(M1, x2) = 1.79 and d(M2, x2) = 3.40 ⇒ x2∈ C1 d(M1, x3) = 0.83 and d(M2, x3) = 2.01 ⇒ x3 ∈ C1 d(M1, x4) = 3.41 and d(M2, x4) = 2.01 ⇒ x4 ∈ C2 d(M1, x5) = 3.60 and d(M2, x5) = 2.01 ⇒ x5 ∈ C2

Above calculation is based on Euclidean distance formula,

d(xi, xj) = Σk = 1 (xik – xjk)1/2

m

Page 66: Data Mining

Data Mining: R. Akerkar 66July 7, 2009

n New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids

n M1 = {0.5, 0.67}n M2 = {5.0, 1.0}

n The corresponding within-cluster variations and the total square error are,

n e12 = 4.17

n e22 = 2.00

n E2 = 6.17

Page 67: Data Mining

Data Mining: R. Akerkar 67July 7, 2009

The cluster membership stabilizes…n After the first iteration, the total square error is

significantly reduced: ( from 27.48 to 6.17)

n In this example, if we analysis the distances between the new centroids and the samples, the second iteration will be assigned to the same clusters.

n Thus no further reassignment and algorithm halts.

Page 68: Data Mining

Data Mining: R. Akerkar 68July 7, 2009

Variations of the K-Means Method

n A few variants of the k-means which differ in

¨ Selection of the initial k means

¨ Strategies to calculate cluster means

n Handling categorical data: k-modes (Huang’98)

¨ Replacing means of clusters with modes

¨ Using new dissimilarity measures to deal with categorical objects

¨ Using a frequency-based method to update modes of clusters

¨ A mixture of categorical and numerical data: k-prototype method

Page 69: Data Mining

Data Mining: R. Akerkar 69July 7, 2009

What is the problem of k-Means Method?n The k-means algorithm is sensitive to outliers !

¨ Since an object with an extremely large value may substantially

distort the distribution of the data.

n K-Medoids: Instead of taking the mean value of the object in a cluster

as a reference point, medoids can be used, which is the most

centrally located object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 70: Data Mining

Data Mining: R. Akerkar 70July 7, 2009

Exercise 2

n Let the set X consist of the following sample points in 2 dimensional space:

n X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}

n Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X. n What are the revised values of c1 and c2 after 1 iteration of k-means

clustering (k = 2)?

Page 71: Data Mining

Data Mining: R. Akerkar 71July 7, 2009

Solution 2

n For each data point, calculate the distance to each centroid:

x y d(xi,c1) d(xi,c2)x1 1 2 0.707107 2.236068x2 1.5 2.2 0.3 1.920937x3 3 2.3 1.513275 1.3x4 2.5 -1 3.640055 2.061553x5 0 1.6 1.749286 3.059412x6 -1 1.5 2.692582 4.031129

Page 72: Data Mining

Data Mining: R. Akerkar 72July 7, 2009

n It follows that x1, x2, x5 and x6 are closer to c1 and the other points are closer to c2. Hence replace c1 with the average of x1, x2, x5 and x6 and replace c2 with the average of x3 and x4. This gives:

n c1’ = (0.375, 1.825) n c2’ = (2.75, 0.65)

Page 73: Data Mining

Data Mining: R. Akerkar 73July 7, 2009

Association Rule Discovery

Page 74: Data Mining

Data Mining: R. Akerkar 74July 7, 2009

Market-basket problem.

n We are given a set of items and a large collection of transactions, which are subsets (baskets) of these items.

n Task: To find relationships between the presences of various items within these baskets.

n Example: To analyze customers' buying habits by finding associations between the different items that customers place in their shopping baskets.

Page 75: Data Mining

Data Mining: R. Akerkar 75July 7, 2009

Associations discovery

n Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories

¨ Associations discovery uncovers affinities amongst collection ofitems

¨ Affinities are represented by association rules¨ Associations discovery is an unsupervised approach to data

mining.

Page 76: Data Mining

Data Mining: R. Akerkar 76July 7, 2009

Association Rule : Application 2

n Supermarket shelf management.¨Goal: To identify items that are bought together by

sufficiently many customers.¨ Approach: Process the point-of-sale data collected

with barcode scanners to find dependencies among items.

¨A classic rule --n If a customer buys diaper and milk, then he is very likely to

buy beer.n So, don’t be surprised if you find six-packs stacked next to

diapers!

Page 77: Data Mining

Data Mining: R. Akerkar 77July 7, 2009

Association Rule : Application 3

n Inventory Management:¨Goal: A consumer appliance repair company

wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.

¨Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.

Page 78: Data Mining

Data Mining: R. Akerkar 78July 7, 2009

What is a rule?

n The rule in a rule induction system comes in the form “If this and this and this then this”

n For a rule to be useful two pieces of information are needed

1. Accuracy (The lower the accuracy the closer the rule comes to random guessing)

2. Coverage (how often you can use a useful rule)n A Rule consists of two parts

1. The antecedent or the LHS 2. The consequent or the RHS.

Page 79: Data Mining

Data Mining: R. Akerkar 79July 7, 2009

An example

0.01%95%If 42 years old and purchased dry roasted peanuts, beer purchased

6%15%If bread purchased, then swisscheese will be purchased

20%85%If breakfast cereal purchased then milk will be purchased

CoverageAccuracyRule

Page 80: Data Mining

Data Mining: R. Akerkar 80July 7, 2009

What is association rule mining?

Page 81: Data Mining

Data Mining: R. Akerkar 81July 7, 2009

Frequent Itemset

Page 82: Data Mining

Data Mining: R. Akerkar 82July 7, 2009

Support and Confidence

Page 83: Data Mining

Data Mining: R. Akerkar 83July 7, 2009

What to do with a rule?

n Target the antecedentn Target the consequentn Target based on accuracyn Target based on coveragen Target based on “interestingness”

n Antecedent can be one or more conditions all of which must be true in order for the consequent to be true at the given accuracy.

n Generally the consequent is just a simple condition (eg purchasing one grocery store item) rather than multiple items.

Page 84: Data Mining

Data Mining: R. Akerkar 84July 7, 2009

• All rules that have a certain value for the antecedent are gathered and presented to the user.

• For example the grocery store may request all rules that have nails or bolts or screws in the antecedent and try and conclude whether discontinuing sales of these lower priced items will have any effect on the higher margin items like hammers.

• All rules that have a certain value for the consequent are gathered. Can be used to understand what affects the consequent.

• For instance might be useful to know what rules have “coffee” in their RHS. Store owner might want to put coffee close to other items in order to increase sales of both items, or a manufacturer may determine in which magazine to place next coupons.

Page 85: Data Mining

Data Mining: R. Akerkar 85July 7, 2009

• Sometimes accuracy most important. Highly accurate rules of 80 or 90% of the time imply strong relationships even if the coverage is very low.

• For example lets say a rule can only be applied one time out of 1000 but if this rule is very profitable the one time then it can be worthwhile. This is how most successful data mining applications work in the financial markets looking for that limited amount of time in which a very confident prediction can be made.

• Sometimes users want to know the rules that are most widely applicable. By looking at rules ranked by coverage they can get a high level view of what is happening in the database most of thetime

• Rules are interesting when they have high coverage and high accuracy but they deviate from the norm. Eventually there may bea tradeoff between coverage and accuracy can be made using a measure of interestingness

Page 86: Data Mining

Data Mining: R. Akerkar 86July 7, 2009

Evaluating and using rules

n Look at simple statistics.n Using conjunctions and disjunctionsn Defining “interestingness”n Other Heuristics

Page 87: Data Mining

Data Mining: R. Akerkar 87July 7, 2009

Using conjunctions and disjunctions

n This dramatically increases or decreases the coverage. For example¨ If diet soda or regular soda or beer then potato chips,

covers a lot more shopping baskets than just one of the constraints by themselves.

Page 88: Data Mining

Data Mining: R. Akerkar 88July 7, 2009

Defining “interestingness”

n Interestingness must have 4 basic behaviors 1. Interestingness=0. Rule accuracy is equal to

background (a priori probability of the LHS), then discard rule.

2. Interestingness increases as accuracy increases if coverage fixed

3. Interestingness increases or decreases with coverage if accuracy stays fixed.

4. Interestingness decreases with coverage for a fixed number of correct responses.

Page 89: Data Mining

Data Mining: R. Akerkar 89July 7, 2009

Other Heuristics

n Look at the actual number of records covered and not as a probability or a percentage.

n Compare a given pattern to random chance. This will be an “out of the ordinary measure”.

n Keep it simple

Page 90: Data Mining

Data Mining: R. Akerkar 90July 7, 2009

Example

Here t supports items C, DM, and CO. The item DM is supported by 4 out of 6 transactions in T. Thus, the support of DM is 66.6%.

Page 91: Data Mining

Data Mining: R. Akerkar 91July 7, 2009

Definition

Page 92: Data Mining

Data Mining: R. Akerkar 92July 7, 2009

Association Rules

n Algorithms that obtain association rules from data usually divide the task into two parts: ¨ find the frequent itemsets and ¨ form the rules from them.

Page 93: Data Mining

Data Mining: R. Akerkar 93July 7, 2009

Association Rules

n The problem of mining association rules can be divided into two subproblems:

Page 94: Data Mining

Data Mining: R. Akerkar 94July 7, 2009

Definitions

Page 95: Data Mining

Data Mining: R. Akerkar 95July 7, 2009

a priori algorithm

n Agrawal and Srikant in 1994. n It is also called the level-wise algorithm. ¨ It is the most accepted algorithm for finding all the

frequent sets. ¨ It makes use of the downward closure property. ¨ The algorithm is a bottom-up search, progressing

upward level-wise in the lattice. n The interesting fact –¨ before reading the database at every level, it prunes

many of the sets, which are unlikely to be frequent sets.

Page 96: Data Mining

Data Mining: R. Akerkar 96July 7, 2009

a priori algorithm

Page 97: Data Mining

Data Mining: R. Akerkar 97July 7, 2009

a priori candidate-generation method

Page 98: Data Mining

Data Mining: R. Akerkar 98July 7, 2009

Pruning algorithm

Page 99: Data Mining

Data Mining: R. Akerkar 99July 7, 2009

a priori Algorithm

Page 100: Data Mining

Data Mining: R. Akerkar 100July 7, 2009

Exercise 3

Suppose that L3 is the list{{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w}, {b,c,x},

{p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}}

Which itemsets are placed in C4 by the join step of the Apriori algorithm? Which are then removed by the prune step?

Page 101: Data Mining

Data Mining: R. Akerkar 101July 7, 2009

Solution3

n At the join step of Apriori Algorithm, each member (set) is compared with every other member.

n If all the elements of the two members are identical except the right most ones, the union of the two sets is placed into C4.

n For the members of L3 given the following sets of four elements are placed into C4: {a,b,c,d}, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,s}, {p,q,r,t}

and {p,q,s,t}.

Page 102: Data Mining

Data Mining: R. Akerkar 102July 7, 2009

Solution3 (continued)

n At the prune step of the algorithm, each member of C4 is checked to see whether all its subsets of 3 elements are members of L3.

n The result in this case is as follows:

Page 103: Data Mining

Data Mining: R. Akerkar 103July 7, 2009

Solution3 (continued)

n Therefore,{b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,t} and

{p.q.s.t} are removed by the prune step

n Leaving C4 as,{{a,b,c,d}, {p,q,r,s}}

Page 104: Data Mining

Data Mining: R. Akerkar 104July 7, 2009

Exercise 4

n Given a dataset with four attributes w, x, y and z, each with three values, how many rules can be generated with one term on the right-hand side?

Page 105: Data Mining

Data Mining: R. Akerkar 105July 7, 2009

Solution 4 n Let us assume that the attribute w has 3 values w1, w2,

and w3, and similarly for x, y, and z.

n If we select arbitrarily attribute w to be on the right-hand side of each rule, there are 3 possible types of rule:¨ IF…THEN w=w1¨ IF…THEN w=w2¨ IF…THEN w=w3

n Now choose one of these rules, say the first, and calculate how many possible left hand sides there are for such rules.

Page 106: Data Mining

Data Mining: R. Akerkar 106July 7, 2009

Solution 4 (continued)

n The number of “attribute=value” terms on the LHS can be 1, 2, or 3.

n Case I: One trem on LHS¨There are 3 possible terms: x, y, and z. Each

has 3 possible values, so there are 3x3=9 possible LHS, e.g. IF x=x1.

Page 107: Data Mining

Data Mining: R. Akerkar 107July 7, 2009

Solution 4 (continued)

n Case II: 2 terms on LHS¨There are 3 ways in which combination of 2

attributes may appear on the LHS: x and y, y and z, and x and z.

¨Each attribute has 3 values, so for each pair there are 3x3=9 possible LHS, e.g. IF x=x1 AND y=y1

¨There are 3 possible pairs of attributes, so the totle number of possible LHS is 3x9=27.

Page 108: Data Mining

Data Mining: R. Akerkar 108July 7, 2009

Solution 4 (continued)

n Case III: 3 terms on LHS¨ All 3 attributes x, y and z must be on LHS.¨ Each has 3 values, so 3x3x3=27 possible LHS, e.g.

IF x=x1 AND y=y1 AND z=z1.

¨ Thus for each of the 3 possible “w=value” terms on the RHS, the total number of LHS with 1,2 or 3 terms is 9+27+27=63.

¨ So there are 3x63 = 189 possible rules with attribute w on the RHS.

¨ The attribute on the RHS could be any of four possibilities (notjust w). Therefore total possible number of rules is 4x189=756.

Page 109: Data Mining

Data Mining: R. Akerkar 109July 7, 2009

Referencesn R. Akerkar and P. Lingras. Building an Intelligent Web: Theory &

Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House,

2009)

n U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.

Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,

1996

n U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data

Mining and Knowledge Discovery, Morgan Kaufmann, 2001

n J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan

Kaufmann, 2001

n D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT

Press, 2001