Data Mining: Introduction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining: Introduction

Part of Lecture Notes for

Introduction to Data Miningby

Tan, Steinbach, Kumar


Lots of data is being collected and warehoused

– Web data, e-commerce

– purchases at department/grocery stores

– Bank/Credit Card transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong

– Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Commercial Viewpoint

Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene expression data

– scientific simulations generating terabytes of data

Traditional techniques infeasible for raw data Data mining may help scientists

– in classifying and segmenting data

– in Hypothesis Formation


Mining Large Data Sets - Motivation

There is often information “hidden” in the data that is not readily evident

Human analysts may take weeks to discover useful information

Much of the data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”


What is Data Mining?

Many Definitions– Non-trivial extraction of implicit, previously

unknown and potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns


What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”


Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

Traditional Techniquesmay be unsuitable due to

– Enormity of data

– High dimensionality of data

– Heterogeneous, distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems


Data Mining Tasks

Prediction Methods

– Use some variables to predict unknown or future values of other variables.

Description Methods

– Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996


Data Mining Tasks...

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]


Classification: Definition

Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class. Find a model for class attribute as a function

of the values of other attributes. Goal: previously unseen records should be

assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.


Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier


Examples of Classification Task

Predicting tumor cells as benign or malignant

Classifying credit card transactions as legitimate or fraudulent

Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

Categorizing news stories as finance, weather, entertainment, sports, etc


Classification: Application 1

Direct Marketing

– Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.

– Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which

decided otherwise. This {buy, don’t buy} decision forms the class attribute.

Collect various demographic, lifestyle, and company-interaction related information about all such customers.

– Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier

model.From [Berry & Linoff] Data Mining Techniques, 1997



Fraud Detection– Goal: Predict fraudulent cases in credit card

transactions.– Approach:

Use credit card transactions and the information on its account-holder as attributes.

– When does a customer buy, what does he buy, how often he pays on time, etc

Label past transactions as fraud or fair transactions. This forms the class attribute.

Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card

transactions on an account.



Customer Attrition/Churn:

– Goal: To predict whether a customer is likely to be lost to a competitor.

– Approach:Use detailed record of transactions with each of the

past and present customers, to find attributes.– How often the customer calls, where he calls, what time-of-the

day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal.Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997



Sky Survey Cataloging

– Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).

– 3000 images with 23,040 x 23,040 pixels per image.

– Approach: Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars,

some of the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996


Classifying Galaxies

Early

Intermediate

Late

Data Size: • 72 million stars, 20 million galaxies• Object Catalog: 9 GB• Image Database: 150 GB

Class: • Stages of

Formation

Attributes:• Image features, • Characteristics of

light waves received, etc.

Courtesy: http://aps.umn.edu


Clustering Definition

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to

one another.– Data points in separate clusters are less

similar to one another. Similarity Measures:

– Euclidean Distance if attributes are continuous.

– Other Problem-specific Measures.


What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized


Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters


Types of Clusterings

A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering– A division data objects into non-overlapping subsets (clusters)

such that each data object is in exactly one subset

Hierarchical clustering– A set of nested clusters organized as a hierarchical tree


Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram


Types of Clusters

Well-separated clusters

Center-based clusters

Contiguous clusters

Density-based clusters

Property or Conceptual

Described by an Objective Function


Types of Clusters: Well-Separated

Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is

closer (or more similar) to every other point in the cluster than to any point not in the cluster.

3 well-separated clusters


Types of Clusters: Center-Based

Center-based– A cluster is a set of objects such that an object in a cluster is

closer (more similar) to the “center” of a cluster, than to the center of any other cluster

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters


Types of Clusters: Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)– A cluster is a set of points such that a point in a cluster is

closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

8 contiguous clusters


Types of Clusters: Density-Based

Density-based– A cluster is a dense region of points, which is separated by

low-density regions, from other regions of high density.

– Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters


Types of Clusters: Conceptual Clusters

Shared Property or Conceptual Clusters– Finds clusters that share some common property or represent

a particular concept.

.

2 Overlapping Circles


K-means Clustering

Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple


K-means Clustering – Details Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations.

– Often the stopping condition is changed to ‘Until relatively few points change clusters’

Complexity is O( n * K * I * d )– n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes


Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points


Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6


Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)– For each point, the error is the distance to the nearest cluster– To get SSE, we square these errors and sum them.

– x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster

– Given two clusters, we can choose the one with the smallest error

– One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

K

i Cxi

i

xmdistSSE1

2 ),(


Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5


Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

xy

Iteration 5


Problems with Selecting Initial Points

If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. – Chance is relatively small when K is large

– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036

– Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t

– Consider an example of five pairs of clusters


10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters


10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters


10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4


10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4


Solutions to Initial Centroids Problem

Multiple runs– Helps, but probability is not on your side

Sample and use hierarchical clustering to determine initial centroids

Select more than k initial centroids and then select among these initial centroids– Select most widely separated

Postprocessing Bisecting K-means

– Not as susceptible to initialization issues



Produces a set of nested clusters organized as a hierarchical tree

Can be visualized as a dendrogram

– A tree like diagram that records the sequences of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5


Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters– Any desired number of clusters can be

obtained by ‘cutting’ the dendogram at the proper level

They may correspond to meaningful taxonomies– Example in biological sciences (e.g., animal

kingdom, phylogeny reconstruction, …)



Two main types of hierarchical clustering– Agglomerative:

Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

– Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)

Traditional hierarchical algorithms use a similarity or distance matrix

– Merge or split one cluster at a time


Agglomerative Clustering Algorithm

More popular hierarchical clustering technique

Basic algorithm is straightforward1. Compute the proximity matrix

2. Let each data point be a cluster

3. Repeat

4. Merge the two closest clusters

5. Update the proximity matrix

6. Until only a single cluster remains

Key operation is the computation of the proximity of two clusters

– Different approaches to defining the distance between clusters distinguish the different algorithms


Starting Situation

Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12


Intermediate Situation

After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12


Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12


After Merging

The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12


How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function– Ward’s Method uses squared error

Proximity Matrix



p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix




Graph-Based Clustering

Graph-Based clustering uses the proximity graph

– Start with the proximity matrix

– Consider each point as a node in a graph

– Each edge between two nodes has a weight which is the proximity between the two points

– Initially the proximity graph is fully connected

– MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph

In the simplest case, clusters are connected components in the graph.


MST: Divisive Hierarchical Clustering

Build MST (Minimum Spanning Tree)– Start with a tree that consists of any point

– In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not

– Add q to the tree and put an edge between p and q


MST: Divisive Hierarchical Clustering

Use MST for constructing hierarchy of clusters


Clustering: Application 1

Market Segmentation:– Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

– Approach: Collect different attributes of customers based on their

geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns

of customers in same cluster vs. those from different clusters.


Clustering: Application 2

Document Clustering:

– Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.

– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.


Illustrating Document Clustering

Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in

these documents (after some word filtering).

Category TotalArticles

CorrectlyPlaced

Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278


Clustering of S&P 500 Stock Data

Discovered Clusters Industry Group

1Applied-Matl-DOW N,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,

Sun-DOW N

Technology1-DOWN

2Apple-Comp-DOW N,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3Fannie-Mae-DOWN,Fed-Home-Loan-DOW N,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP

Oil-UP

Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the

events described by them frequently happen together on the same day.

We used association rules to quantify a similarity measure.


Association Rule Mining

Given a set of transactions, i.e. a set of records each of which contain some number of items from a given collection, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!


Definition: Frequent Itemset

Itemset– A collection of one or more items

Example: {Milk, Bread, Diaper}

– k-itemset An itemset that contains k items

Support count ()– Frequency of occurrence of an itemset

– E.g. ({Milk, Bread,Diaper}) = 2

Support– Fraction of transactions that contain an

itemset

– E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset– An itemset whose support is greater

than or equal to a minsup threshold

TID Items

1 Bread, Milk






Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule– An implication expression of the form

X Y, where X and Y are itemsets

– Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics– Support (s)

Fraction of transactions that contain both X and Y

– Confidence (c) Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk






Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold

– confidence ≥ minconf threshold

Brute-force approach:– List all possible association rules

– Compute support and confidence for each rule

– Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!


Mining Association Rules

Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk





Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements


Mining Association Rules

Two-step approach:

1. Frequent Itemset Generation– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each

frequent itemset, where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive


Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets


Frequent Itemset Generation

Brute-force approach:

– Each itemset in the lattice is a candidate frequent itemset

– Count the support of each candidate by scanning the database

– Match each transaction against every candidate

– Complexity ~ O(NMw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w


Computational Complexity

Given d unique items:– Total number of itemsets = 2d

– Total number of possible association rules:

123 1

1

1 1

dd

d

k

kd

j j

kd

k

dR

If d=6, R = 602 rules


Frequent Itemset Generation Strategies

Reduce the number of candidates (M)– Complete search: M=2d

– Use pruning techniques to reduce M

Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases– Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or

transactions– No need to match every candidate against every

transaction


Reducing Number of Candidates

Apriori principle:– If an itemset is frequent, then all of its subsets must also

be frequent

Apriori principle holds due to the following property of the support measure:

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support

)()()(:, YsXsYXYX


Found to be Infrequent

null


A B C D E



ABCDE

Illustrating Apriori Principle

null


A B C D E



ABCDEPruned supersets


Illustrating Apriori Principle

Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1

Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Itemset Count {Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C3 = 41

With support-based pruning,6 + 6 + 1 = 13


Apriori Algorithm

Method:

– Let k=1– Generate frequent itemsets of length 1– Repeat until no new frequent itemsets are identified

Generate length (k+1) candidate itemsets from length k frequent itemsets

Prune candidate itemsets containing subsets of length k that are infrequent

Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those

that are frequent


Rule Generation

Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement

– If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)


Rule Generation

How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-monotone

propertyc(ABC D) can be larger or smaller than c(AB

D)

– But confidence of rules generated from the same itemset has an anti-monotone property

– e.g., L = {A,B,C,D}:

c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule


Rule Generation for Apriori Algorithm

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Lattice of rulesABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned Rules

Low Confidence Rule


Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules that share the same prefixin the rule consequent

join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC

Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence

BD=>ACCD=>AB

D=>ABC


Association Rule Discovery: Application 1

Marketing and Sales Promotion:– Let the rule discovered be {Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to

determine what should be done to boost its sales.– Bagels in the antecedent => Can be used to see

which products would be affected if the store discontinues selling bagels.

– Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!



Supermarket shelf management.

– Goal: To identify items that are bought together by sufficiently many customers.

– Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.

– A classic rule --If a customer buys diaper and milk, then he is very

likely to buy beer.So, don’t be surprised if you find six-packs stacked

next to diapers!



Inventory Management:

– Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.

– Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.


Sequential Pattern Discovery: Definition

Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.

Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.

(A B) (C) (D E)

<= ms

<= xg >ng <= ws

(A B) (C) (D E)


Sequential Pattern Discovery: Examples

In telecommunications alarm logs,

– (Inverter_Problem Excessive_Line_Current)

(Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences,

– Computer Bookstore:

(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk)

– Athletic Apparel Store:

(Shoes) (Racket, Racketball) --> (Sports_Jacket)


Regression

Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.

Greatly studied in statistics, neural network fields. Examples:

– Predicting sales amounts of new product based on advetising expenditure.

– Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

– Time series prediction of stock market indices.


Deviation/Anomaly Detection

Detect significant deviations from normal behavior Applications:

– Credit Card Fraud Detection

– Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day


Challenges of Data Mining

Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

Data Mining: Introduction

Documents

Transcript of Data Mining: Introduction