Information Systems Data Analysis – Association Mining Prof. Les Sztandera.

25
Information Systems Data Analysis – Association Mining Prof. Les Sztandera

Transcript of Information Systems Data Analysis – Association Mining Prof. Les Sztandera.

Information Systems

Data Analysis – Association Mining

Prof. Les Sztandera

Information Systems

Data Analysis - Association Mining

• Association rule mining:– Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects in transactional databases, relational databases, and other information repositories.

• Applications:– Basket data analysis, cross-marketing, catalog design,

loss-leader analysis, clustering, classification.• Examples.

– Rule form: “Body ead [support, confidence]”.– buys(x, “diapers”) buys(x, “beer”) [0.5%, 60%]– major(x, “Business”) ^ takes(x, “MIS”) grade(x, “A”)

[1%, 75%]

Information Systems

Association Rule: Basic Concepts

• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

• Find: All rules that correlate the presence of one set of items with that of another set of items– E.g., 98% of people who purchase tires and auto accessories also

get automotive services done (confidence = 98%)

• Applications Maintenance Agreement (What the store should do to boost

Maintenance Agreement sales)– Home Electronics (What other products should the store stock

up?)– Attached mailing in direct marketing– Detecting “ping-ponging” of patients, faulty “collisions”

Information Systems

Loss-Leader Strategy (1)

• A business strategy in which a business offers a product or service at a price that is not profitable for the sake of offering another product/service at a greater profit or to attract new customers. This is a common practice when a business first enters a market.

• A loss leader introduces new customers to a service or product in the hope of building a customer base and securing future recurring revenue.    The loss leader strategy is more than just a nifty business trick - it is a successful strategy if executed properly.  

Information Systems

Loss-Leader Strategy (2)

• A classic example is that of razor blades. Companies like Gillette essentially give their razor units away for free, knowing that customers will have to buy their replacement blades, which is where the company makes all of its profit. 

• Another example is Microsoft's Xbox video game system, which was sold at a loss of more than $100 per unit to create more potential to profit from the sale of higher-margin video games. 

Information Systems

Cross-marketing (1)

• A cross-marketing or marketing cooperation is a partnership of at least two companies on the value chain level of marketing with the objective to tap the full potential of a market by bundling specific competences or resources.

Information Systems

Cross-marketing (2)

• An example of cross-marketing:

• Apple Inc. and Nike Inc. have formed a long term partnership to jointly develop and sell “Nike+iPod” products. The "Nike + iPod Sport Kit" links Nike+ products with Apples MP3-Player iPod nano, so that performance data such as distance, pace or burned calories can be displayed on the MP3-Player’s interface.

Information Systems

Classification and Clustering

• Classification – a process of finding models that describe classes or concepts, for the purpose of predicting a class of objects.

• Clustering - a process of finding models that describe clusters or concepts when the label of the class is unknown.

Information Systems

Basket Data Analysis (1)

Information Systems

Basket Analysis (2)

• Every extracted rule has Support and Confidence coefficients associated with it.

• Support (A => B) = (# of cases containing both A and B) / (total # of cases)

• Confidence (A => B) = (# of cases containing both A and B) / (# of cases containing A)

Information Systems

Rule Measures: Support and Confidence

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let the minimum support be 50%, and the minimum

confidence 50%, then we have

– A C (50%, 66.6%)– C A (50%, 100%)

Customerbuys diapers

Customerbuys both

Customerbuys beer

Information Systems

Association Rule Mining• Boolean vs. quantitative associations (Based on the types of

values handled)– buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%,

60%]– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

• Single dimension vs. multiple dimensional associations (see ex. above)

• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of diapers?

• Various extensions– Correlation, causality analysis

• Association does not necessarily imply correlation or causality

– Maxpatterns and closed itemsets– Constraints enforced

• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Information Systems

Mining Association Rules

For rule A C:support = support({A C}) = 50%

confidence = support({A C})/support({A}) = 66.6%

The A priori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Information Systems

The a priori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Information Systems

Is a priori Fast Enough? — Performance Bottlenecks

• The core of the A priori algorithm:– Use frequent (k – 1)-item sets to generate candidate frequent k-

item sets– Use database scan and pattern matching to collect counts for the

candidate item sets

• The bottleneck of Apriori: candidate generation– Huge candidate sets:

• 104 frequent 1-itemset will generate 107 candidate 2-item sets

• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

– Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern

Information Systems

Presentation of Association Rules (Table Form )

Information Systems

Visualization of Association Rule Using Plane Graph

Information Systems

Visualization of Association Rule Using Rule Graph

Information Systems

Multiple-Level Association Rules

• Items often form hierarchy

• Items at the lower level are expected to have lower support.

• Rules regarding itemsets at appropriate levels could be quite useful.

• Transaction database can be encoded based on dimensions and levels

• We can explore shared multi-level mining

Food

breadmilk

skim

GiantAcme

2% whitewheat

TID ItemsT1 {111, 121, 211, 221}T2 {111, 211, 222, 323}T3 {112, 122, 221, 411}T4 {111, 121}T5 {111, 122, 211, 221, 413}

Information Systems

Mining Multi-Level Associations

• A top_down, progressive deepening approach:– First find high-level strong rules:

milk bread [20%, 60%].– Then find their lower-level “weaker” rules:

2% milk wheat bread [6%, 50%].

• Variations at mining multiple-level association rules.– Level-crossed association rules:

2% milk Wonder wheat bread

– Association rules with multiple, alternative hierarchies:

2% milk Wonder bread

Information Systems

Multi-level Association: Uniform Support vs. Reduced Support

• Uniform Support: the same minimum support for all levels– + One minimum support threshold. No need to examine itemsets

containing any item whose ancestors do not have minimum support.

– – Lower level items do not occur as frequently. If support threshold • too high miss low level associations• too low generate too many high level associations

• Reduced Support: reduced minimum support at lower levels– There are 4 search strategies:

• Level-by-level independent• Level-cross filtering by k-itemset• Level-cross filtering by single item• Controlled level-cross filtering by single item

Information Systems

Uniform Support

Multi-level mining with uniform support

Milk

[support = 10%]

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Back

Information Systems

Reduced Support

Multi-level mining with reduced support

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 3%

Back

Milk

[support = 10%]

Information Systems

Multi-level Association: Redundancy Filtering

• Some rules may be redundant due to “ancestor” relationships between items.

• Example– milk wheat bread [support = 8%, confidence = 70%]

– 2% milk wheat bread [support = 2%, confidence = 72%]

• We say the first rule is an ancestor of the second rule.

• A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

Information Systems

Criticism to Support and Confidence

• Example: Among 5000 students• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal

– play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

– play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000