Data Mining Processes Identify actionable results.

45
Data Mining Processes Identify actionable results

Transcript of Data Mining Processes Identify actionable results.

Page 1: Data Mining Processes Identify actionable results.

Data Mining Processes

Identify actionable results

Page 2: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-2

CRISP-DM

• Cross-Industry Standard Process for Data Mining

– One of first comprehensive attempts toward standard process model for data mining

– Independent of industry sector & technology

Page 3: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-3

CRISP-DM Phases1. Business (or problem) understanding2. Data understanding3. Data preparation

• Transform & create data set for modeling

4. Modeling5. Evaluation

• Check good models, evaluate to assure nothing missing

6. Deployment

Page 4: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-4

Business Understanding

• Solve a specific problem• Clear definition helps

– Measurable success criteria

• Convert business objectives to set of data-mining goals– What to achieve in technical terms

Page 5: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-5

Data Understanding• Related data

Can come from many sources

– Internal• ERP (or MIS)• Data Warehouse

– External• Government data• Commercial data

– Created• Research

Page 6: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-6

Data PreparationClean data• Formats, gaps, filters

outliers & redundanciesUnified numerical scales• Nominal data

– code

• Ordinal data– Nominal code or scale

• Cardinal data

Page 7: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-7

Types of DataType Features Synonyms

Numerical Continuous Range

Integer Range

Binary Yes/No Flag

Categorical Finite Set

Date/Time Range

String Typeless

Text

Page 8: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-8

Modeling

• Data Treatment– Training set– Test set– Maybe others

• Techniques– Association– Classification– Clustering– Predictions– Sequential patterns

Page 9: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-9

Evaluation

• Does model meet business objectives?

• Any important business objectives not addressed?

• Does model make sense?• Is model actionable?

Page 10: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-10

Deployment

• Ongoing monitoring & maintenance– Evaluate

performance against success criteria

– Market reaction & competitor changes

Page 11: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-11

Example

• Training set for computer purchase– 16 records– 5 attributes

• Goal– Find classifier for consumer behavior

Page 12: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-12

Database (1st half)

Case Age Income Student Credit Gender Buy?

A1 31-40 High No Fair Male Yes

A2 >40 Medium No Fair Female Yes

A3 >40 Low Yes Fair Female Yes

A4 31-40 Low Yes Excellent Female Yes

A5 ≤30 Low Yes Fair Female Yes

A6 >40 Medium Yes Fair Male Yes

A7 ≤30 Medium Yes Excellent Male Yes

A8 31-40 Medium No Excellent Male Yes

Page 13: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-13

Database (2nd half)

Case Age Income Student Credit Gender Buy?

A9 31-40 High Yes Fair Male Yes

A10 ≤30 High No Fair Male No

A11 ≤30 High No Excellent Female No

A12 >40 Low Yes Excellent Female No

A13 ≤30 Medium No Fair Male No

A14 >40 Medium No Excellent Female No

A15 ≤30 Unknown No Fair Male Yes

A16 >40 Medium No N/A Female No

Page 14: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-14

Data Selection

• Gender has weak relationship with purchase– Based on correlation– Drop gender

• Selected Attribute Set

{Age, Income, Student, Credit}

Page 15: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-15

Data Preprocessing

• Income unknown in Case 15

• Credit not available in Case 16

• Drop these noisy cases

Page 16: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-16

Data Transformation

• Assign numerical values to each attribute– Age: ≤30 = 3 31-40 = 2 >40 = 1

– Income: High = 3 Medium = 2 Low = 1– Student: Yes = 2 No = 1– Credit:Excellent = 2 Fair = 1

Page 17: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-17

Data Mining

• Categorize output– Buys = C1 Doesn’t buy = C2

• Conduct analysis– Model says A8, A12 don’t buy; rest do– Of the actual yes, 8 correct and 1 not– Of the actual no, 4 correct and 1 not

Page 18: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-18

Data Interpretation

• Test on independent data

Page 19: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-19

Test Data SetCase Actual Model

B1 Yes Yes

B2 Yes Yes

B3 Yes Yes

B4 Yes Yes

B5 Yes Yes

B6 Yes Yes

B7 No No

B8 No Yes

B9 No No

B10 No No

Page 20: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-20

Confusion Matrix

Model Buy Model Not Totals

Actual Buy 6 0 6

Actual Not 1 3 4

Totals 7 3 10

Page 21: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-21

Measures• Correct classification rate

9/10 = 0.90

• Cost function

cost of error:

model says buy, actual no $20

model says no, actual buy $200

• 1 x $20 + 0 x $200 = $20

Page 22: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-22

Goals

• Avoid broad concepts:• Gain insight; discover meaningful patterns;

learn interesting things

– Can’t measure attainment

• Narrow and specify:• Identify customers likely to renew; reduce

churn;• Rank order by propensity to…;

Page 23: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-23

Goals• Description: what is

– understand– explain– discover knowledge

• Prescription: what should be done– classify– predict

Page 24: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-24

Goal• Method A:

– four rules, explains 70%

• Method B:– fifty rules, explains 72%

BEST?

Gain understanding:Method A betterminimum description length (MDL)

Reduce cost of mailing: Method B better

Page 25: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-25

Measurement• Accuracy

– How well does model describe observed data?

• Confidence levels– proportion of the time between

lower and upper limits

• Comprehensibility

• Whole or parts?

Page 26: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-26

Measuring Predictive• Classification & prediction:

error rate = incorrect / total

requires evaluation set be representative• Estimators

predicted – actual (MAD: Mean Absolute Deviation, MSE: Mean Squre Error

MAPE: Mean Absolute Percent Error )

variance = sum(predicted - actual)^2

standard deviation = square root of variance

distance - how far off

Page 27: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-27

Statistics• Population - entire group studied• Sample - subset from population• Bias - difference between sample

average & population average– mean, median, mode– distribution– significance– correlation, regression

Page 28: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-28

Classification Models• LIFT = probability in class by sample divided by

probability in class by population– if population probability is 20% and

sample probability is 30%,

LIFT = 0.3/0.2 = 1.5

• Best lift not necessarily bestneed sufficient sample size

as confidence increases, longer list but lower lift

Page 29: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-29

Lift ChartLIFT

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

mailed

resp

onde

d

% mailed

% responded

Page 30: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-30

Measuring Impact

• Ideal - $ (NPV, NetPresentValue) because of expenditure

• Mass mailing may be better• Depends on:

– fixed cost– cost per recipient– cost per respondent– value of positive response

Page 31: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-31

Bottom Line

• Return on investment

Page 32: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-32

Example Application

• Telephone industry• Problem: Unpaid bills• Data mining used to

develop models to predict nonpayment as early as possible

Page 33: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-33

Knowledge Discovery Process1 Data Selection Learning the application domain

Creating target data set

2 Data Preprocessing Data cleaning & preprocessing

3 Data Transformation Data reduction & projection

4 Data Mining Choosing function

Choosing algorithms

Data mining

5 Data Interpretation Interpretation

Using discovered knowledge

Page 34: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-34

Telephone Bill Study• Billing period sequence analyzed

– Use 2 months, receive bill, payment due month of billing, disconnect if unpaid in given period

• Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period

Page 35: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-35

1: Business Understanding• Predict which customers would be

insolvent– In time for firm to take preventive measures

(and avert losing good customers)

• Hypothesis:– Insolvent customers would change calling

habits & phone usage during a critical period before & immediately after termination of billing period

Page 36: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-36

2: Data Understanding

• Static customer information available in files– Bills, payments, usage

• Used data warehouse to gather & organize data– Coded to protect customer privacy

Page 37: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-37

Creating Target Data Set• Customer files

– Customer information– Disconnects– Reconnections

• Time-dependent data– Bills– Payments– Usage

• 100,000 customers over 17-month period• Stratified sampling to assure all groups appropriately

represented

Page 38: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-38

3: Data Preparation

• Filtered out incomplete data

• Deleted inexpensive calls– Reduced data volume about 50%

• Low number of fraudulent cases

• Cross-checked with phone disconnects

• Lagged data made synchronization necessary

Page 39: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-39

Data Reduction & Projection• Information grouped by account• Customer data aggregated by 2-week periods• Discriminant analysis on 23 categories• Calculated average owed by category

(significant)• Identified extra charges (significant)• Investigated payment by installments (not

significant)

Page 40: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-40

Choosing Data Mining Function• Classes:

– Most possibly solvent (99.3%)– Most possibly insolvent (0.7%)

• Costs of error widely different• New data set created through stratified sampling

– Retained all insolvent– Altered distribution to 90% solvent– Used 2,066 cases total

• Critical period identified– Last 15 two-week periods before service interruption

• Variables defined by counting measures in two-week periods– 46 variables as candidate discriminant factors

Page 41: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-41

4: Modeling

• Discriminant Analysis– Linear model– SPSS – stepwise forward selection

• Decision Trees– Rule-based classifier

• Neural Networks– Nonlinear model

Page 42: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-42

Data Mining• Training set about 2/3rds• Rest test• Discriminant analysis

– Used 17 variables– Equal costs – 0.875 correct– Unequal costs – 0.930 correct

• Rule-based – 0.952 correct• Neural network – 0.929 correct

Page 43: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-43

5: Evaluation

• 1st objective to maximize accuracy of predicting insolvent customers– Decision tree classifier best

• 2nd objective to minimize error rate for solvent customers– Neural network model close to Decision

tree

• Used all 3 on case-by-case basis

Page 44: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-44

Coincidence Matrix – Combined Models

Model insolvent

Model solvent

Unclass Totals

Actual insolvent

19 17 28 64

Actual solvent

1 626 27 654

Totals 20 643 91 718

Page 45: Data Mining Processes Identify actionable results.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

2-45

6: Implementation

• Every customer examined using all 3 algorithms– If all 3 agreed, used that classification– If disagreement, categorized as

unclassified

• Correct on test data 0.898– Only 1 actually solvent customer would

have been disconnected