CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS....

16
1 1 CSI 3117/3517 Data Mining ªS. Matwin, 1999 Data Mining ¢What is data mining? ¢Motivating example ¢Why now? ¢Technological foundations ¢Tasks ¢Architectures and processes ¢data warehouse, data mart ¢middleware ¢OLAP ¢Conclusion 2 CSI 3117/3517 Data Mining ªS. Matwin, 1999 Definition Technology that fins implicit, unexpected relationships in the data the K-mart example

Transcript of CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS....

Page 1: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

1

1

CSI 3117/3517 Data Mining

S. Matwin, 1999

Data Mining

¢What is data mining?¢Motivating example¢Why now?¢Technological foundations¢Tasks ¢Architectures and processes¢data warehouse, data mart¢middleware¢OLAP

¢Conclusion

2

CSI 3117/3517 Data Mining

S. Matwin, 1999

Definition

Technology that fins implicit, unexpected relationships in the data

the K-mart example

Page 2: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

2

3

CSI 3117/3517 Data Mining

S. Matwin, 1999

Why now?

¢Bar codes¢networks/connectivity¢IT-maturity of management

4

CSI 3117/3517 Data Mining

S. Matwin, 1999

Technological foundations

¢Databases¢machine learning¢visualization¢statistics

Page 3: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

3

5

CSI 3117/3517 Data Mining

S. Matwin, 1999

Tasks

¢Associations/MBA¢estimation¢classification¢clustering¢...

6

CSI 3117/3517 Data Mining

S. Matwin, 1999

Associations

Given:I = {i1,…, im} set of itemsD set of transactions (a database), each transaction is a set of items T⊂2I

Association rule: X⇒Y, X ⊂I, Y ⊂I, X∩Y=0confidence c: ratio of # transactions that contain Y to #

of all transaction that contain Xsupport s: ratio of # of transactions that contain both X

and Y to # of transactions in D

Page 4: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

4

7

CSI 3117/3517 Data Mining

S. Matwin, 1999

Associations - mining

Given D, generate all assoc rules with c, s > thresholds minc, mins

(items are ordered, e.g. by barcode)

Idea: find all itemsets that have transaction support > mins : large itemsets

8

CSI 3117/3517 Data Mining

S. Matwin, 1999

Associations - mining

to do that: start with indiv. items with large support

in ea next step, k, ¢ use itemsets from step k-1, generate new

itemset Ck, ¢ count support of Ck (by counting the

candidates in which are contained in any t), ¢ prune the ones that are not large

Page 5: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

5

9

CSI 3117/3517 Data Mining

S. Matwin, 1999

Associations - mining

Only keep those thatare contained in sometransaction

10

CSI 3117/3517 Data Mining

S. Matwin, 1999

Candidate generation

Ck = apriori-gen(Lk-1)

Page 6: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

6

11

CSI 3117/3517 Data Mining

S. Matwin, 1999

Example

L3={{1 2 3}, {1 2 4},{1 3 4},{1 3 5},{2 3 4}}C4={{1 2 3 4} {1 3 4 5}}pruning deletes {1 3 4 5} because {1 4 5}

is not in L3.See http://www. almaden.ibm.com/u/ ragrawal /pubs.html#associations

for details

12

CSI 3117/3517 Data Mining

S. Matwin, 1999

Classification (supervised learning)

Given:¢a concept with k classes C1,…Ck (but the

definition of the concept is NOT known)¢a set of training instances T={e t}, where each t

is a class label : one of the classes C1,…Ck

Find: ¢a description for each class which will perform

well in determining (predicting) class membership for unseen instances

Page 7: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

7

13

CSI 3117/3517 Data Mining

S. Matwin, 1999

Decision Trees

A decision tree as a concept representation: wage incr. 1st yr

working hrsstatutory holidays

contribution to hlth plan wage incr. 1st yrgood

good

good

goodbad bad bad

≤2.5≤2.5 >2.5>2.5

≤36≤36 >36>36 >10>10 ≤10≤10

≤4≤4>4>4

14

CSI 3117/3517 Data Mining

S. Matwin, 1999

building a univariate (single attribute is tested) decision tree from a set T of training cases for a concept C with classes C1,…Ck

Consider three possibilities:• T contains 1 or more cases all belonging to

the same class Cj. The dec. tree for T is a leaf identifying class Cj

• T contains no cases. The tree is a leaf, but the label is assigned heuristically, e.g. the majority class in the parent of this node

Building decision trees

Page 8: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

8

15

CSI 3117/3517 Data Mining

S. Matwin, 1999

¢T contains cases from different classes. T is divided into subsets that seem to lead towards collections of cases. A test t based on a single attribute is chosen, and it partitions T into subsets {T1,…,Tn}. The decision tree consists of a decision node identifying the tested attribute, and one branch for ea. outcome of the test. Then, the same process is applied recursively to ea.Ti

16

CSI 3117/3517 Data Mining

S. Matwin, 1999

Choosing the test

¢why not explore all possible trees and choose the simplest (Occam’s razor)? But this is an NP complete pbm. E.g. in the ‘union’ example there are millions of trees consistent with the data¢idea: to maximize the difference between

the info needed to identif. a class of an example in T, and the the same info after T has been partitioned in accord. with a test X

Page 9: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

9

17

CSI 3117/3517 Data Mining

S. Matwin, 1999

¢Predictive accuracy¢interpretability¢liftcharts and ROI

Evaluation of the mining results

18

CSI 3117/3517 Data Mining

S. Matwin, 1999

¢how can we predict the err rate? Either put aside part of the training set for that purpose, or¢apply crossvalidation: divide the training

data into C equal-sized blocks, and for ea. block a tree is constructed from te’sin C-1 remaining blocks and tested on the ‘reserved’ block+

Predictive accuracy

Page 10: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

10

19

CSI 3117/3517 Data Mining

S. Matwin, 1999

Lift chart

population 1005% response ratecontacting 10 best chances, we obtain 20% of the 5% who respond, so 1 person. Without a model, 0.5 pers. The lift is 2.

Oftentimes, cost has to be taken into account for samples of small and large size

20

CSI 3117/3517 Data Mining

S. Matwin, 1999

Architectures

¢data warehouse¢metadata¢middleware¢data mart¢data cube

Page 11: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

11

21

CSI 3117/3517 Data Mining

S. Matwin, 1999

Architecture - defs

¢ data warehouse: several heterogeneous databases that contain data relevant to a given problem (e.g. transactions, customer info, …)

¢metadata = data about the data. Describes the hierarchy of attributes and the logical organization of the data (e.g. customer data consists of the number, name, accounts, … accounts is …)the database scheme is an example of metadatametadata describes the data in the data warehouse from the business perspective

22

CSI 3117/3517 Data Mining

S. Matwin, 1999

Architecture - defs

¢middleware: software protocol for a single interface to a distributed DW. E.g. standards such as OpenDataBase Connectivity (ODBC) and Java DBC (JDBC) APIsproblems (efficiency) when querying

¢multitiered approach:

¢datamarts: data warehouse needed for a given dept.

Page 12: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

12

23

CSI 3117/3517 Data Mining

S. Matwin, 1999

24

CSI 3117/3517 Data Mining

S. Matwin, 1999

Processes:OLAP: On Line Analytical Processing

¢from de-normalized data (source system, e.g. transactions) to a star topology¢analyzing the reports that are likely to be

needed¢the star and the report define the dimension¢the dimensions define the cube

Page 13: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

13

25

CSI 3117/3517 Data Mining

S. Matwin, 1999

Example

moviegoers database (de-normalized):

name sex age source movie name

Amy f 27 Oberlin Independence dayAndy m 34 Oberlin The BirdcageBob m 51 Pinewoods Schindler’s listCathy f 39 124 Mt. Auburn The BirdcageCurt m 30 MRJ Judgement dayDavid m 40 MRJ Independence dayErica f 23 124 Mt. Auburn Trainspotting

26

CSI 3117/3517 Data Mining

S. Matwin, 1999

Sou r ce Nam e

1 Ob erli n

2 Pin ewo od s

3 MR J

4 12 Mt. Au b ur n

Person Gender

1 F

2 M

3 M

4 F

5 M

6 M

7 F

P ers o n S o u r ce M o v ie

1 1 5

2 1 1 2

3 2 4

4 4 1 2

5 3 1 9

6 3 5

7 4 2 1

Person Age

1 27

2 34

3 51

4 39

5 30

6 40

7 23

Person Name

1 Amy

2 Andy

3 Bob

4 Cathy

5 Curt

6 David

7 Erica

Movie Name

4 Schindler’s list

5 Independence day

12 Birdcage

19 Judgement day

31 Trainspott ing

central fact table

dimension tables

Page 14: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

14

27

CSI 3117/3517 Data Mining

S. Matwin, 1999

Typical reports

¢# of times ea. movie was seen for movies seen > 5 times¢for what movies is the avg age of viewers >

30?¢the # of people and their ages by source¢the # of people from ea. source by gender

28

CSI 3117/3517 Data Mining

S. Matwin, 1999

Cube

¢ is formed by representing the whole database (denormalized) by the dimensions

¢ the size of the cube does not depend on the number of people

¢ the cube has subcubes, ea. containing the “key” info that identifies it plus the summary aggregate info

¢ cube = MDD (multi-dimensional Database)¢ real cubes have more than three dimensions¢ ea. record belongs to exactly one subcube

Page 15: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

15

29

CSI 3117/3517 Data Mining

S. Matwin, 1999

Tasks

¢ drilling: looking inside a subcube at the records(of the original database) that are represented in thatsubcube

¢ churning/attrition: loosing customers. can be cast as a classification problem on historical data (two classes: customers who have churned and those who have not)then a classification system (e.g. decision tree induction) can induce the classifiers

¢ fraud detection: learning regular patterns, watching for discrepancies

30

CSI 3117/3517 Data Mining

S. Matwin, 1999

Page 16: CSI 3117/3517 Data Mining Data Miningstan/csi5387/dm3517~1.pdf · CSI 3117/3517 Data Mining ªS. Matwin, 1999 Classification (supervised learning) Given: ¢a concept with k classes

16

31

CSI 3117/3517 Data Mining

S. Matwin, 1999

32

CSI 3117/3517 Data Mining

S. Matwin, 1999

Data mining - conclusion

¢Treats historical data as an organizational asset, rather than burden¢tries to ¢find out the unknown¢predict the unknown

¢applies to¢marketing ¢internet mining¢E-commerce¢...