Data Mining

Data Mining

Rajendra Akerkar

Data Mining: R. Akerkar 2July 7, 2009

What Is Data Mining?

n Data mining (knowledge discovery from data) ¨ Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) patterns or knowledge from huge amount of data

n Is everything “data mining”?¨ (Deductive) query processing. ¨ Expert systems or small ML/statistical programs

Build computer programs that sift through databases automatically, seeking regularities or patterns


Data Mining — What’s in a Name?

Data Mining Knowledge Mining

Knowledge Discoveryin Databases

Data Archaeology

Data Dredging

Database MiningKnowledge Extraction

Data Pattern Processing

Information Harvesting

Siftware

The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of stored data, using pattern recognition technologies and statistical and mathematical techniques


Definition

nSeveral Definitions¨ Non-trivial extraction of implicit, previously unknown

and potentially useful information from data

¨ Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996


What is (not) Data Mining?

l What is Data Mining?

– Certain names are more common in certain Indian states (Joshi, Patil, Kulkarni… in Pune area).

– Group together similar documents returned by search engine according to their context (e.g. Google Scholar, Amazon.com,).

l What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Pune”


n Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

n Traditional Techniquesmay be unsuitable due to ¨ Enormity of data¨ High dimensionality

of data¨ Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems


Data Mining Tasks

n Prediction Methods¨Use some variables to predict unknown or

future values of other variables.

n Description Methods¨Find human-interpretable patterns that

describe the data.


Data Mining Tasks...

n Classification [Predictive] predicting an item class

n Clustering [Descriptive] finding clusters in data

n Association Rule Discovery [Descriptive] frequent occurring events

n Deviation/Anomaly Detection [Predictive] finding changes


Classification: Definition

n Given a collection of records (training set )¨ Each record contains a set of attributes, one of the

attributes is the class.n Find a model for class attribute as a function

of the values of other attributes.n Goal: previously unseen records should be

assigned a class as accurately as possible.¨ A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.


Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes1 0

categ

orical

categ

orical

conti

nuou

s

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier


Classification: Application 1

n Direct Marketing¨Goal: Reduce cost of mailing by targeting a set of

consumers likely to buy a new cell-phone product.¨ Approach:

n Use the data for a similar product introduced before. n We know which customers decided to buy and which decided

otherwise. This {buy, don’t buy} decision forms the class attribute.

n Collect various demographic, lifestyle, and company-interaction related information about all such customers.¨ Type of business, where they stay, how much they earn, etc.

n Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997



n Fraud Detection¨Goal: Predict fraudulent cases in credit card

transactions.¨ Approach:

n Use credit card transactions and the information on its account-holder as attributes.¨ When does a customer buy, what does he buy, how often he

pays on time, etcn Label past transactions as fraud or fair transactions. This

forms the class attribute.n Learn a model for the class of the transactions.n Use this model to detect fraud by observing credit card

transactions on an account.



n Customer Attrition/Churn:¨Goal: To predict whether a customer is likely

to be lost to a competitor.¨Approach:

n Use detailed record of transactions with each of the past and present customers, to find attributes.¨ How often the customer calls, where he calls, what time-

of-the day he calls most, his financial status, marital status, etc.

n Label the customers as loyal or disloyal.n Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

Decision Tree


Introductionn A classification scheme which generates a tree

and a set of rules from given data set.

n The set of records available for developing classification methods is divided into two disjoint subsets – a training set and a test set.

n The attributes of the records are categorise into two types:¨ Attributes whose domain is numerical are called

numerical attributes.¨ Attributes whose domain is not numerical are called

the categorical attributes.


Introduction

n A decision tree is a tree with the following properties:¨ An inner node represents an attribute.¨ An edge represents a test on the attribute of the father

node.¨ A leaf represents one of the classes.

n Construction of a decision tree¨ Based on the training data¨ Top-Down strategy


Decision TreeExample

n The data set has five attributes. n There is a special attribute: the attribute class is the class label. n The attributes, temp (temperature) and humidity are numerical

attributesn Other attributes are categorical, that is, they cannot be ordered.

n Based on the training data set, we want to find a set of rules to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.


Decision TreeExample

n We have five leaf nodes. n In a decision tree, each leaf node represents a rule.

n We have the following rules corresponding to the tree given in Figure.

n RULE 1 If it is sunny and the humidity is not above 75%, then play.n RULE 2 If it is sunny and the humidity is above 75%, then do not play.n RULE 3 If it is overcast, then play.n RULE 4 If it is rainy and not windy, then play.n RULE 5 If it is rainy and windy, then don't play.


Classification

n The classification of an unknown input vector is done by traversing the tree from the root node to a leaf node.

n A record enters the tree at the root node. n At the root, a test is applied to determine which child

node the record will encounter next. n This process is repeated until the record arrives at a leaf

node. n All the records that end up at a given leaf of the tree are

classified in the same way. n There is a unique path from the root to each leaf. n The path is a rule which is used to classify the records.


n In our tree, we can carry out the classification for an unknown record as follows.

n Let us assume, for the record, that we know the values of the first four attributes (but we do not know the value of class attribute) as

n outlook= rain; temp = 70; humidity = 65; and windy= true.


n We start from the root node to check the value of the attribute associated at the root node.

n This attribute is the splitting attribute at this node. n For a decision tree, at every node there is an attribute

associated with the node called the splitting attribute.

n In our example, outlook is the splitting attribute at root. n Since for the given record, outlook = rain, we move to

the right-most child node of the root. n At this node, the splitting attribute is windy and we find

that for the record we want classify, windy = true. n Hence, we move to the left child node to conclude that

the class label Is "no play".


n The accuracy of the classifier is determined by the percentage of the test data set that is correctly classified.

n We can see that for Rule 1 there are two records of the test data set satisfying outlook= sunny and humidity < 75, and only one of these is correctly classified as play.

n Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is 0.66.

RULE 1If it is sunny and the humidity is not above 75%, then play.


Concept of Categorical Attributes

n Consider the following training data set.

n There are three attributes, namely, age, pincode and class.

n The attribute class is used for

class label.

The attribute age is a numeric attribute, whereas pincode is a categorical one.

Though the domain of pincode is numeric, no ordering can be defined among pincode values.

You cannot derive any useful information if one pin-code is greater than another pincode.


n Figure gives a decision tree for the training data.

n The splitting attribute at the root is pincode and the splitting criterion here is pincode = 500 046.

n Similarly, for the left child node, the splitting criterion is age < 48 (the splitting attribute is age).

n Although the right child node has the same attribute as the splitting attribute, the splitting criterion is different.

At root level, we have 9 records. The associated splitting criterion is pincode = 500 046.

As a result, we split the records into two subsets. Records 1, 2, 4, 8, and 9 are to the left child note and remaining to the right node.

The process is repeated at every node.


Advantages and Shortcomings of Decision Tree Classificationsn A decision tree construction process is concerned with

identifying the splitting attributes and splitting criterion at every level of the tree.

n Major strengths are:¨ Decision tree able to generate understandable rules.¨ They are able to handle both numerical and categorical attributes.¨ They provide clear indication of which fields are most important for

prediction or classification.

n Weaknesses are:¨ The process of growing a decision tree is computationally expensive. At

each node, each candidate splitting field is examined before its best split can be found.

¨ Some decision tree can only deal with binary-valued target classes.


Iterative Dichotomizer (ID3)n Quinlan (1986)n Each node corresponds to a splitting attributen Each arc is a possible value of that attribute.

n At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root.

n Entropy is used to measure how informative is a node.n The algorithm uses the criterion of information gain to

determine the goodness of a split.¨ The attribute with the greatest information gain is

taken as the splitting attribute, and the data set is split for all distinct values of the attribute.


Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3The class label attribute, buys_computer, has two distinct values.

Thus there are two distinct classes. (m =2)

Class C1 corresponds to yesand class C2 corresponds to no.

There are 9 samples of class yesand 5 samples of class no.


Extracting Classification Rules from Trees

n Represent the knowledge in the form of IF-THEN rules

n One rule is created for each path from the root to a leaf

n Each attribute-value pair along a path forms a conjunction

n The leaf node holds the class prediction

n Rules are easier for humans to understand

What are the rules?


Solution (Rules)

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”


Algorithm for Decision Tree Inductionn Basic algorithm (a greedy algorithm)¨ Tree is constructed in a top-down recursive divide-and-conquer

manner¨ At start, all the training examples are at the root¨ Attributes are categorical (if continuous-valued, they are

discretized in advance)¨ Examples are partitioned recursively based on selected attributes¨ Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)n Conditions for stopping partitioning¨ All samples for a given node belong to the same class¨ There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf¨ There are no samples left


Attribute Selection Measure: Information Gain (ID3/C4.5)

n Select the attribute with the highest information gainn S contains si tuples of class Ci for i = {1, …, m} n information measures info required to classify any

arbitrary tuplen ….information is encoded in bits.

n entropy of attribute A with values {a1,a2,…,av}

n information gained by branching on attribute A

ss

logss

),...,s,ssI(i

m

i

im21 2

1∑

=

−=

)s,...,s(Is

s...sE(A) mjj

v

j

mjj1

1

1∑=

++=

E(A))s,...,s,I(sGain(A) m −= 21


Entropyn Entropy measures the homogeneity (purity) of a set of examples. n It gives the information content of the set in terms of the class labels of the

examples. n Consider that you have a set of examples, S with two classes, P and N. Let the

set have p instances for the class P and n instances for the class N. n So the total number of instances we have is t = p + n. The view [p, n] can be

seen as a class distribution of S.

The entropy for S is defined as n Entropy(S) = - (p/t).log2(p/t) - (n/t).log2(n/t)

n Example: Let a set of examples consists of 9 instances for class positive, and 5 instances for class negative.

n Answer: p = 9 and n = 5. n So Entropy(S) = - (9/14).log2(9/14) - (5/14).log2(5/14)n = -(0.64286)(-0.6375) - (0.35714)(-1.48557)n = (0.40982) + (0.53056)n = 0.940


The entropy for a completely pure set is 0 and is 1 for a set with equal occurrences for both the classes.

i.e. Entropy[14,0] = - (14/14).log2(14/14) - (0/14).log2(0/14)= -1.log2(1) - 0.log2(0)= -1.0 - 0 = 0

i.e. Entropy[7,7] = - (7/14).log2(7/14) - (7/14).log2(7/14)= - (0.5).log2(0.5) - (0.5).log2(0.5)= - (0.5).(-1) - (0.5).(-1)= 0.5 + 0.5 = 1

Entropy


Attribute Selection by Information Gain Computation

g Class P: buys_computer = “yes”g Class N: buys_computer = “no”g I(p, n) = I(9, 5) =0.940g Compute the entropy for age:

means “age <=30” has

5 out of 14 samples, with 2

yes's and 3 no’s. Hence

Similarly,

age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971

694.0)2,3(145

)0,4(144

)3,2(145

)(

=+

+=

I

IIageE

048.0)_(151.0)(029.0)(

===

ratingcreditGainstudentGainincomeGain

246.0)(),()( =−= ageEnpIageGainage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

)3,2(145

I

Since, age has the highest information gain among the attributes, it is selected as the test attribute.


Exercise 1n The following table consists of training data from an employee

database.

n Let status be the class attribute. Use the ID3 algorithm to construct a decision tree from the given data.


Solution 1


Other Attribute Selection Measuresn Gini index (CART, IBM IntelligentMiner)¨ All attributes are assumed continuous-valued

¨ Assume there exist several possible split values for each attribute

¨May need other tools, such as clustering, to get the possible split values

¨ Can be modified for categorical attributes


Gini Index (IBM IntelligentMiner)n If a data set T contains examples from n classes, gini index,

gini(T) is defined as

where pj is the relative frequency of class j in T.n If a data set T is split into two subsets T1 and T2 with sizes N1 and

N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as

n The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

∑=

−=n

jp jTgini

1

21)(

)()()( 22

11 Tgini

NN

TginiNNTgini split +=


Exercise 2


Solution 2n SPLIT: Age <= 50n ----------------------n | High | Low | Totaln --------------------n S1 (left) | 8 | 11 | 19n S2 (right) | 11 | 10 | 21n --------------------n For S1: P(high) = 8/19 = 0.42 and P(low) = 11/19 = 0.58n For S2: P(high) = 11/21 = 0.52 and P(low) = 10/21 = 0.48n Gini(S1) = 1-[0.42x0.42 + 0.58x0.58] = 1-[0.18+0.34] = 1-0.52 = 0.48n Gini(S2) = 1-[0.52x0.52 + 0.48x0.48] = 1-[0.27+0.23] = 1-0.5 = 0.5n Gini-Split(Age<=50) = 19/40 x 0.48 + 21/40 x 0.5 = 0.23 + 0.26 = 0.49

n SPLIT: Salary <= 65Kn ----------------------n | High | Low | Totaln --------------------n S1 (top) | 18 | 5 | 23n S2 (bottom) | 1 | 16 | 17n --------------------n For S1: P(high) = 18/23 = 0.78 and P(low) = 5/23 = 0.22n For S2: P(high) = 1/17 = 0.06 and P(low) = 16/17 = 0.94n Gini(S1) = 1-[0.78x0.78 + 0.22x0.22] = 1-[0.61+0.05] = 1-0.66 = 0.34n Gini(S2) = 1-[0.06x0.06 + 0.94x0.94] = 1-[0.004+0.884] = 1-0.89 = 0.11n Gini-Split(Age<=50) = 23/40 x 0.34 + 17/40 x 0.11 = 0.20 + 0.05 = 0.25


Exercise 3

n In previous exercise, which is a better split of the data among the two split points? Why?


Solution 3n Intuitively Salary <= 65K is a better split point since it produces

relatively ``pure'' partitions as opposed to Age <= 50, which results in more mixed partitions (i.e., just look at the distribution of Highs and Lows in S1 and S2).

n More formally, let us consider the properties of the Gini index. If a partition is totally pure, i.e., has all elements from the same class, then gini(S) = 1-[1x1+0x0] = 1-1 = 0 (for two classes).

On the other hand if the classes are totally mixed, i.e., both classes have equal probability then gini(S) = 1 - [0.5x0.5 + 0.5x0.5] = 1-[0.25+0.25] = 0.5.

In other words the closer the gini value is to 0, the better the partition is. Since Salary has lower gini it is a better split.

Clustering


Clustering: Definition

n Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that¨ Data points in one cluster are more similar to one

another.¨ Data points in separate clusters are less similar to

one another.n Similarity Measures:¨ Euclidean Distance if attributes are continuous.¨Other Problem-specific Measures.


Clustering: Illustration

⌧ Euclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intercluster distancesare maximized


Clustering: Application 1

n Market Segmentation:¨Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

¨ Approach: n Collect different attributes of customers based on their

geographical and lifestyle related information.n Find clusters of similar customers.n Measure the clustering quality by observing buying patterns

of customers in same cluster vs. those from different clusters.


Clustering: Application 2

n Document Clustering:¨Goal: To find groups of documents that are similar to

each other based on the important terms appearing in them.

¨ Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

¨Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

k- mean


Clustering

n Clustering is the process of grouping data into clusters so that objects within a cluster have similarity in comparison to one another, but are very dissimilar to objects in other clusters.

n The similarities are assessed based on the attributes values describing these objects.


The K-Means Clustering Method

n Given k, the k-means algorithm is implemented in four steps:¨ Partition objects into k nonempty subsets

¨ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)

¨ Assign each object to the cluster with the nearest seed point

¨ Go back to Step 2, stop when no more new assignment


The K-Means Clustering Method

n Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign


K-Means Clustering

n K-means is a partition based clustering algorithm.

n K-means’ goal: Partition database D into K parts, where there is little similarity across groups, but great similarity within a group. More specifically, K-means aims to minimize the mean square error of each point in a cluster, with respect to its cluster centroid.


K-Means Example

n Consider the following one-dimensional database with attribute A1.

n Let us use the k-means algorithm to partition this database into k = 2 clusters. We begin by choosing two random starting points, which will serve as the centroids of the two clusters.

25

11

30

20

3

12

10

4

2

A1

21

=Cµ

42

=Cµ


n To form clusters, we assign each point in the database to the nearest centroid.

n For instance, 10 is closer to c2 than to c1.

n If a point is the same distance from two centroids, such as point 3 in our example, we make an arbitrary assignment.

25C2

11C2

30C2

20C2

3C1

12C2

10C2

4C2

2C1

A1Cluster Assignment


n Once all points have been assigned, we recompute the means of the clusters.

5.22

321

=+

=Cµ

167

1127

25113020121042

==++++++

=Cµ


n We then reassign each point to the two clusters based on the new means.

n Remark: point 4 now belongs to cluster C1.

n The steps are repeated until the means converge to their optimal values. In each iteration, the means are re-computed and all points are reassigned.

25C2

11C2

30C2

20C2

3C1

12C2

10C2

4C1

2C1

A1Clusters


n In this example, only one more iteration is needed before the means converge. We compute the new means:

33

4321

=++

=Cµ

186

1086

2511302012102

==+++++

=Cµ

Now if we reassign the points there is no change in the clusters. Hence the means have converged to their optimal values and the algorithm terminates.


Visualization of k-means algorithm


Exercise

n Apply the K-means algorithm for the following 1-dimensional points (for k=2): 1; 2; 3; 4; 6; 7; 8; 9.

n Use 1 and 2 as the starting centroids.


SolutionIteration #11: 1 mean = 12: 2,3,4,6,7,8,9 mean = 5.57

Iteration #21: 1,2,3 mean = 25.57: 4,6,7,8,9 mean = 6.8

Iteration #32: 1,2,3,4 mean = 2.56.8: 6,7,8,9 mean = 7.5

Iteration #42.5: 1,2,3,4 mean = 2.57.5: 6,7,8,9 mean = 7.5

Means haven’t changed, so stop iterating. The final clusters are {1,2,3,4} and {6,7,8,9}.


K – Mean for 2-dimensional databasen Let us consider {x1, x2, x3, x4, x5} with following coordinates as

two-dimensional sample for clustering:

n x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)

n Suppose that required number of clusters is 2.n Initially, clusters are formed from random distribution of

samples:n C1 = {x1, x2, x4} and C2 = {x3, x5}.


Centroid Calculationn Suppose that the given set of N samples in an n-dimensional

space has somehow be partitioned into K clusters {C1, C2, …, Ck}

n Each Ck has nk samples and each sample is exactly in one cluster.

n Therefore, Σ nk = N, where k = 1, …, K.n The mean vector Mk of cluster Ck is defined as centroid of the

cluster,

Mk = (1/ nk) Σi = 1 xik

n In our example, The centroids for these two clusters aren M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}n M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}

nk

Where xik is the ith sample belonging to cluster Ck.


The Square-error of the cluster

n The square-error for cluster Ck is the sum of squared Euclidean distances between each sample in Ck and its centroid.

n This error is called the within-cluster variation.

ek2 = Σi = 1 (xik – Mk)

2

n Within cluster variations, after initial random distribution of samples, are

n e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2]

+ [(5 – 1.66)2 + (0 – 0.66)2] = 19.36n e2

2 = [(1.5 – 3.25)2 + (0 – 1)2] + [(5 – 3.25)2 + (2 – 1)2] = 8.12

nk


Total Square-error

n The square error for the entire clustering space containing K clusters is the sum of the within-cluster variations.

Ek2 = Σk = 1 ek

2

n The total square error isE2 = e1

2 + e22 = 19.36 + 8.12 = 27.48

K


n When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will be,

n d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 ⇒ x1∈ C1

n d(M1, x2) = 1.79 and d(M2, x2) = 3.40 ⇒ x2∈ C1 d(M1, x3) = 0.83 and d(M2, x3) = 2.01 ⇒ x3 ∈ C1 d(M1, x4) = 3.41 and d(M2, x4) = 2.01 ⇒ x4 ∈ C2 d(M1, x5) = 3.60 and d(M2, x5) = 2.01 ⇒ x5 ∈ C2

Above calculation is based on Euclidean distance formula,

d(xi, xj) = Σk = 1 (xik – xjk)1/2

m


n New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids

n M1 = {0.5, 0.67}n M2 = {5.0, 1.0}

n The corresponding within-cluster variations and the total square error are,

n e12 = 4.17

n e22 = 2.00

n E2 = 6.17


The cluster membership stabilizes…n After the first iteration, the total square error is

significantly reduced: ( from 27.48 to 6.17)

n In this example, if we analysis the distances between the new centroids and the samples, the second iteration will be assigned to the same clusters.

n Thus no further reassignment and algorithm halts.


Variations of the K-Means Method

n A few variants of the k-means which differ in

¨ Selection of the initial k means

¨ Strategies to calculate cluster means

n Handling categorical data: k-modes (Huang’98)

¨ Replacing means of clusters with modes

¨ Using new dissimilarity measures to deal with categorical objects

¨ Using a frequency-based method to update modes of clusters

¨ A mixture of categorical and numerical data: k-prototype method


What is the problem of k-Means Method?n The k-means algorithm is sensitive to outliers !

¨ Since an object with an extremely large value may substantially

distort the distribution of the data.

n K-Medoids: Instead of taking the mean value of the object in a cluster

as a reference point, medoids can be used, which is the most

centrally located object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


Exercise 2

n Let the set X consist of the following sample points in 2 dimensional space:

n X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}

n Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X. n What are the revised values of c1 and c2 after 1 iteration of k-means

clustering (k = 2)?


Solution 2

n For each data point, calculate the distance to each centroid:

x y d(xi,c1) d(xi,c2)x1 1 2 0.707107 2.236068x2 1.5 2.2 0.3 1.920937x3 3 2.3 1.513275 1.3x4 2.5 -1 3.640055 2.061553x5 0 1.6 1.749286 3.059412x6 -1 1.5 2.692582 4.031129


n It follows that x1, x2, x5 and x6 are closer to c1 and the other points are closer to c2. Hence replace c1 with the average of x1, x2, x5 and x6 and replace c2 with the average of x3 and x4. This gives:

n c1’ = (0.375, 1.825) n c2’ = (2.75, 0.65)


Association Rule Discovery


Market-basket problem.

n We are given a set of items and a large collection of transactions, which are subsets (baskets) of these items.

n Task: To find relationships between the presences of various items within these baskets.

n Example: To analyze customers' buying habits by finding associations between the different items that customers place in their shopping baskets.


Associations discovery

n Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories

¨ Associations discovery uncovers affinities amongst collection ofitems

¨ Affinities are represented by association rules¨ Associations discovery is an unsupervised approach to data

mining.


Association Rule : Application 2

n Supermarket shelf management.¨Goal: To identify items that are bought together by

sufficiently many customers.¨ Approach: Process the point-of-sale data collected

with barcode scanners to find dependencies among items.

¨A classic rule --n If a customer buys diaper and milk, then he is very likely to

buy beer.n So, don’t be surprised if you find six-packs stacked next to

diapers!


Association Rule : Application 3

n Inventory Management:¨Goal: A consumer appliance repair company

wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.

¨Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.


What is a rule?

n The rule in a rule induction system comes in the form “If this and this and this then this”

n For a rule to be useful two pieces of information are needed

1. Accuracy (The lower the accuracy the closer the rule comes to random guessing)

2. Coverage (how often you can use a useful rule)n A Rule consists of two parts

1. The antecedent or the LHS 2. The consequent or the RHS.


An example

0.01%95%If 42 years old and purchased dry roasted peanuts, beer purchased

6%15%If bread purchased, then swisscheese will be purchased

20%85%If breakfast cereal purchased then milk will be purchased

CoverageAccuracyRule


What is association rule mining?


Frequent Itemset


Support and Confidence


What to do with a rule?

n Target the antecedentn Target the consequentn Target based on accuracyn Target based on coveragen Target based on “interestingness”

n Antecedent can be one or more conditions all of which must be true in order for the consequent to be true at the given accuracy.

n Generally the consequent is just a simple condition (eg purchasing one grocery store item) rather than multiple items.


• All rules that have a certain value for the antecedent are gathered and presented to the user.

• For example the grocery store may request all rules that have nails or bolts or screws in the antecedent and try and conclude whether discontinuing sales of these lower priced items will have any effect on the higher margin items like hammers.

• All rules that have a certain value for the consequent are gathered. Can be used to understand what affects the consequent.

• For instance might be useful to know what rules have “coffee” in their RHS. Store owner might want to put coffee close to other items in order to increase sales of both items, or a manufacturer may determine in which magazine to place next coupons.


• Sometimes accuracy most important. Highly accurate rules of 80 or 90% of the time imply strong relationships even if the coverage is very low.

• For example lets say a rule can only be applied one time out of 1000 but if this rule is very profitable the one time then it can be worthwhile. This is how most successful data mining applications work in the financial markets looking for that limited amount of time in which a very confident prediction can be made.

• Sometimes users want to know the rules that are most widely applicable. By looking at rules ranked by coverage they can get a high level view of what is happening in the database most of thetime

• Rules are interesting when they have high coverage and high accuracy but they deviate from the norm. Eventually there may bea tradeoff between coverage and accuracy can be made using a measure of interestingness


Evaluating and using rules

n Look at simple statistics.n Using conjunctions and disjunctionsn Defining “interestingness”n Other Heuristics


Using conjunctions and disjunctions

n This dramatically increases or decreases the coverage. For example¨ If diet soda or regular soda or beer then potato chips,

covers a lot more shopping baskets than just one of the constraints by themselves.


Defining “interestingness”

n Interestingness must have 4 basic behaviors 1. Interestingness=0. Rule accuracy is equal to

background (a priori probability of the LHS), then discard rule.

2. Interestingness increases as accuracy increases if coverage fixed

3. Interestingness increases or decreases with coverage if accuracy stays fixed.

4. Interestingness decreases with coverage for a fixed number of correct responses.


Other Heuristics

n Look at the actual number of records covered and not as a probability or a percentage.

n Compare a given pattern to random chance. This will be an “out of the ordinary measure”.

n Keep it simple


Example

Here t supports items C, DM, and CO. The item DM is supported by 4 out of 6 transactions in T. Thus, the support of DM is 66.6%.


Definition


Association Rules

n Algorithms that obtain association rules from data usually divide the task into two parts: ¨ find the frequent itemsets and ¨ form the rules from them.


Association Rules

n The problem of mining association rules can be divided into two subproblems:


Definitions


a priori algorithm

n Agrawal and Srikant in 1994. n It is also called the level-wise algorithm. ¨ It is the most accepted algorithm for finding all the

frequent sets. ¨ It makes use of the downward closure property. ¨ The algorithm is a bottom-up search, progressing

upward level-wise in the lattice. n The interesting fact –¨ before reading the database at every level, it prunes

many of the sets, which are unlikely to be frequent sets.


a priori algorithm


a priori candidate-generation method


Pruning algorithm


a priori Algorithm


Exercise 3

Suppose that L3 is the list{{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w}, {b,c,x},

{p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}}

Which itemsets are placed in C4 by the join step of the Apriori algorithm? Which are then removed by the prune step?


Solution3

n At the join step of Apriori Algorithm, each member (set) is compared with every other member.

n If all the elements of the two members are identical except the right most ones, the union of the two sets is placed into C4.

n For the members of L3 given the following sets of four elements are placed into C4: {a,b,c,d}, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,s}, {p,q,r,t}

and {p,q,s,t}.


Solution3 (continued)

n At the prune step of the algorithm, each member of C4 is checked to see whether all its subsets of 3 elements are members of L3.

n The result in this case is as follows:


Solution3 (continued)

n Therefore,{b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,t} and

{p.q.s.t} are removed by the prune step

n Leaving C4 as,{{a,b,c,d}, {p,q,r,s}}


Exercise 4

n Given a dataset with four attributes w, x, y and z, each with three values, how many rules can be generated with one term on the right-hand side?


Solution 4 n Let us assume that the attribute w has 3 values w1, w2,

and w3, and similarly for x, y, and z.

n If we select arbitrarily attribute w to be on the right-hand side of each rule, there are 3 possible types of rule:¨ IF…THEN w=w1¨ IF…THEN w=w2¨ IF…THEN w=w3

n Now choose one of these rules, say the first, and calculate how many possible left hand sides there are for such rules.


Solution 4 (continued)

n The number of “attribute=value” terms on the LHS can be 1, 2, or 3.

n Case I: One trem on LHS¨There are 3 possible terms: x, y, and z. Each

has 3 possible values, so there are 3x3=9 possible LHS, e.g. IF x=x1.



n Case II: 2 terms on LHS¨There are 3 ways in which combination of 2

attributes may appear on the LHS: x and y, y and z, and x and z.

¨Each attribute has 3 values, so for each pair there are 3x3=9 possible LHS, e.g. IF x=x1 AND y=y1

¨There are 3 possible pairs of attributes, so the totle number of possible LHS is 3x9=27.



n Case III: 3 terms on LHS¨ All 3 attributes x, y and z must be on LHS.¨ Each has 3 values, so 3x3x3=27 possible LHS, e.g.

IF x=x1 AND y=y1 AND z=z1.

¨ Thus for each of the 3 possible “w=value” terms on the RHS, the total number of LHS with 1,2 or 3 terms is 9+27+27=63.

¨ So there are 3x63 = 189 possible rules with attribute w on the RHS.

¨ The attribute on the RHS could be any of four possibilities (notjust w). Therefore total possible number of rules is 4x189=756.


Referencesn R. Akerkar and P. Lingras. Building an Intelligent Web: Theory &

Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House,

2009)

n U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.

Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,

1996

n U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data

Mining and Knowledge Discovery, Morgan Kaufmann, 2001

n J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan

Kaufmann, 2001

n D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT

Press, 2001

Data Mining

Education

Transcript of Data Mining