Post on 28-Sep-2020
1
1
CSI 3117/3517 Data Mining
S. Matwin, 1999
Data Mining
¢What is data mining?¢Motivating example¢Why now?¢Technological foundations¢Tasks ¢Architectures and processes¢data warehouse, data mart¢middleware¢OLAP
¢Conclusion
2
CSI 3117/3517 Data Mining
S. Matwin, 1999
Definition
Technology that fins implicit, unexpected relationships in the data
the K-mart example
2
3
CSI 3117/3517 Data Mining
S. Matwin, 1999
Why now?
¢Bar codes¢networks/connectivity¢IT-maturity of management
4
CSI 3117/3517 Data Mining
S. Matwin, 1999
Technological foundations
¢Databases¢machine learning¢visualization¢statistics
3
5
CSI 3117/3517 Data Mining
S. Matwin, 1999
Tasks
¢Associations/MBA¢estimation¢classification¢clustering¢...
6
CSI 3117/3517 Data Mining
S. Matwin, 1999
Associations
Given:I = {i1,…, im} set of itemsD set of transactions (a database), each transaction is a set of items T⊂2I
Association rule: X⇒Y, X ⊂I, Y ⊂I, X∩Y=0confidence c: ratio of # transactions that contain Y to #
of all transaction that contain Xsupport s: ratio of # of transactions that contain both X
and Y to # of transactions in D
4
7
CSI 3117/3517 Data Mining
S. Matwin, 1999
Associations - mining
Given D, generate all assoc rules with c, s > thresholds minc, mins
(items are ordered, e.g. by barcode)
Idea: find all itemsets that have transaction support > mins : large itemsets
8
CSI 3117/3517 Data Mining
S. Matwin, 1999
Associations - mining
to do that: start with indiv. items with large support
in ea next step, k, ¢ use itemsets from step k-1, generate new
itemset Ck, ¢ count support of Ck (by counting the
candidates in which are contained in any t), ¢ prune the ones that are not large
5
9
CSI 3117/3517 Data Mining
S. Matwin, 1999
Associations - mining
Only keep those thatare contained in sometransaction
10
CSI 3117/3517 Data Mining
S. Matwin, 1999
Candidate generation
Ck = apriori-gen(Lk-1)
6
11
CSI 3117/3517 Data Mining
S. Matwin, 1999
Example
L3={{1 2 3}, {1 2 4},{1 3 4},{1 3 5},{2 3 4}}C4={{1 2 3 4} {1 3 4 5}}pruning deletes {1 3 4 5} because {1 4 5}
is not in L3.See http://www. almaden.ibm.com/u/ ragrawal /pubs.html#associations
for details
12
CSI 3117/3517 Data Mining
S. Matwin, 1999
Classification (supervised learning)
Given:¢a concept with k classes C1,…Ck (but the
definition of the concept is NOT known)¢a set of training instances T={e t}, where each t
is a class label : one of the classes C1,…Ck
Find: ¢a description for each class which will perform
well in determining (predicting) class membership for unseen instances
7
13
CSI 3117/3517 Data Mining
S. Matwin, 1999
Decision Trees
A decision tree as a concept representation: wage incr. 1st yr
working hrsstatutory holidays
contribution to hlth plan wage incr. 1st yrgood
good
good
goodbad bad bad
≤2.5≤2.5 >2.5>2.5
≤36≤36 >36>36 >10>10 ≤10≤10
≤4≤4>4>4
14
CSI 3117/3517 Data Mining
S. Matwin, 1999
building a univariate (single attribute is tested) decision tree from a set T of training cases for a concept C with classes C1,…Ck
Consider three possibilities:• T contains 1 or more cases all belonging to
the same class Cj. The dec. tree for T is a leaf identifying class Cj
• T contains no cases. The tree is a leaf, but the label is assigned heuristically, e.g. the majority class in the parent of this node
Building decision trees
8
15
CSI 3117/3517 Data Mining
S. Matwin, 1999
¢T contains cases from different classes. T is divided into subsets that seem to lead towards collections of cases. A test t based on a single attribute is chosen, and it partitions T into subsets {T1,…,Tn}. The decision tree consists of a decision node identifying the tested attribute, and one branch for ea. outcome of the test. Then, the same process is applied recursively to ea.Ti
16
CSI 3117/3517 Data Mining
S. Matwin, 1999
Choosing the test
¢why not explore all possible trees and choose the simplest (Occam’s razor)? But this is an NP complete pbm. E.g. in the ‘union’ example there are millions of trees consistent with the data¢idea: to maximize the difference between
the info needed to identif. a class of an example in T, and the the same info after T has been partitioned in accord. with a test X
9
17
CSI 3117/3517 Data Mining
S. Matwin, 1999
¢Predictive accuracy¢interpretability¢liftcharts and ROI
Evaluation of the mining results
18
CSI 3117/3517 Data Mining
S. Matwin, 1999
¢how can we predict the err rate? Either put aside part of the training set for that purpose, or¢apply crossvalidation: divide the training
data into C equal-sized blocks, and for ea. block a tree is constructed from te’sin C-1 remaining blocks and tested on the ‘reserved’ block+
Predictive accuracy
10
19
CSI 3117/3517 Data Mining
S. Matwin, 1999
Lift chart
population 1005% response ratecontacting 10 best chances, we obtain 20% of the 5% who respond, so 1 person. Without a model, 0.5 pers. The lift is 2.
Oftentimes, cost has to be taken into account for samples of small and large size
20
CSI 3117/3517 Data Mining
S. Matwin, 1999
Architectures
¢data warehouse¢metadata¢middleware¢data mart¢data cube
11
21
CSI 3117/3517 Data Mining
S. Matwin, 1999
Architecture - defs
¢ data warehouse: several heterogeneous databases that contain data relevant to a given problem (e.g. transactions, customer info, …)
¢metadata = data about the data. Describes the hierarchy of attributes and the logical organization of the data (e.g. customer data consists of the number, name, accounts, … accounts is …)the database scheme is an example of metadatametadata describes the data in the data warehouse from the business perspective
22
CSI 3117/3517 Data Mining
S. Matwin, 1999
Architecture - defs
¢middleware: software protocol for a single interface to a distributed DW. E.g. standards such as OpenDataBase Connectivity (ODBC) and Java DBC (JDBC) APIsproblems (efficiency) when querying
¢multitiered approach:
¢datamarts: data warehouse needed for a given dept.
12
23
CSI 3117/3517 Data Mining
S. Matwin, 1999
24
CSI 3117/3517 Data Mining
S. Matwin, 1999
Processes:OLAP: On Line Analytical Processing
¢from de-normalized data (source system, e.g. transactions) to a star topology¢analyzing the reports that are likely to be
needed¢the star and the report define the dimension¢the dimensions define the cube
13
25
CSI 3117/3517 Data Mining
S. Matwin, 1999
Example
moviegoers database (de-normalized):
name sex age source movie name
Amy f 27 Oberlin Independence dayAndy m 34 Oberlin The BirdcageBob m 51 Pinewoods Schindler’s listCathy f 39 124 Mt. Auburn The BirdcageCurt m 30 MRJ Judgement dayDavid m 40 MRJ Independence dayErica f 23 124 Mt. Auburn Trainspotting
26
CSI 3117/3517 Data Mining
S. Matwin, 1999
Sou r ce Nam e
1 Ob erli n
2 Pin ewo od s
3 MR J
4 12 Mt. Au b ur n
Person Gender
1 F
2 M
3 M
4 F
5 M
6 M
7 F
P ers o n S o u r ce M o v ie
1 1 5
2 1 1 2
3 2 4
4 4 1 2
5 3 1 9
6 3 5
7 4 2 1
Person Age
1 27
2 34
3 51
4 39
5 30
6 40
7 23
Person Name
1 Amy
2 Andy
3 Bob
4 Cathy
5 Curt
6 David
7 Erica
Movie Name
4 Schindler’s list
5 Independence day
12 Birdcage
19 Judgement day
31 Trainspott ing
central fact table
dimension tables
14
27
CSI 3117/3517 Data Mining
S. Matwin, 1999
Typical reports
¢# of times ea. movie was seen for movies seen > 5 times¢for what movies is the avg age of viewers >
30?¢the # of people and their ages by source¢the # of people from ea. source by gender
28
CSI 3117/3517 Data Mining
S. Matwin, 1999
Cube
¢ is formed by representing the whole database (denormalized) by the dimensions
¢ the size of the cube does not depend on the number of people
¢ the cube has subcubes, ea. containing the “key” info that identifies it plus the summary aggregate info
¢ cube = MDD (multi-dimensional Database)¢ real cubes have more than three dimensions¢ ea. record belongs to exactly one subcube
15
29
CSI 3117/3517 Data Mining
S. Matwin, 1999
Tasks
¢ drilling: looking inside a subcube at the records(of the original database) that are represented in thatsubcube
¢ churning/attrition: loosing customers. can be cast as a classification problem on historical data (two classes: customers who have churned and those who have not)then a classification system (e.g. decision tree induction) can induce the classifiers
¢ fraud detection: learning regular patterns, watching for discrepancies
30
CSI 3117/3517 Data Mining
S. Matwin, 1999
16
31
CSI 3117/3517 Data Mining
S. Matwin, 1999
32
CSI 3117/3517 Data Mining
S. Matwin, 1999
Data mining - conclusion
¢Treats historical data as an organizational asset, rather than burden¢tries to ¢find out the unknown¢predict the unknown
¢applies to¢marketing ¢internet mining¢E-commerce¢...