Targeted Marketing, KDD Cup and Customer Modeling.

Targeted Marketing,KDD Cup

and Customer Modeling

Outline

Direct Marketing

Review: Evaluation: Lift, Gains

KDD Cup 1997

Lift and Benefit estimation

Privacy and Data Mining

Direct Marketing Paradigm

Find most likely prospects to contact

Not everybody needs to be contacted

Number of targets is usually much smaller than number of prospects

Typical Applications retailers, catalogues, direct mail (and e-mail)

customer acquisition, cross-sell, attrition

Direct Marketing Evaluation

Accuracy on the entire dataset is not the right measure

Approach develop a target model

score all prospects and rank them by decreasing score

select top P% of prospects for action

Evaluate Performance on top P% using Gains and Lift

CPH (Gains): Random List vs Model-ranked

0102030405060708090

15 25 35 45 55 65 75 85 95

RandomModel

5% of random list have 5% of targets,

but 5% of model ranked list have 21% of targets CPH(5%,model)=21%.

Pct list

ulative %

Lift Curve

15 25 35 45 55 65 75 85 95

Lift(P) = CPH(P) / P

P -- percent of the list

Lift (at 5%)

= 21% / 5%

= 4.2betterthan random

KDD-CUP 1997

Task: given data on past responders to fund-raising, predict most likely responders for new campaign

Population of 750K prospects 10K responded to a broad campaign mailing

(1.4% response rate)

Analysis file included a stratified (non-random) sample of 10K responders and 26K non-responders (28.7% response

75% used for learning; 25% used for validation

target variable removed from the validation data set

KDD-CUP 1997 Data Set

321 fields/variables with ‘sanitized’ names and labels Demographic information

Credit history

Promotion history

Significant effort on data preprocessing leaker detection and removal

KDD-CUP Participant Statistics

45 companies/institutions participated 23 research prototypes

22 commercial tools

16 contestants turned in their results 9 research prototypes

7 commercial tools

KDD-CUP Algorithm Statistics

Algorithm # of Entries Ave. Score

Rules 2 87

k-NN 1 85

Bayesian 3 83

Multiple/Hybrid 4 79

Other 2 68

Decision Tree 4 44

Of the 16 software/tools… (Score as % of best)

KDD Cup 97 Evaluation

Best Gains at 40% Urban Science

Mineset

Best Gains at 10% BNB

Urban Science

Mineset

KDD-CUP 1997 Awards

The GOLD MINER GOLD MINER award is jointly shared by two contestants this year

1) Charles ElkanCharles Elkan, Ph.D. from University of California, San , Ph.D. from University of California, San Diego Diego with his software BNB, Boosted Naive Bayesian BNB, Boosted Naive Bayesian ClassifierClassifier

1) Urban Science Applications, IncUrban Science Applications, Inc. . with their software gain, Direct Marketing Selection Systemgain, Direct Marketing Selection System

The BRONZE MINER BRONZE MINER award went to the runner-up

3) Silicon Graphics, IncSilicon Graphics, Inc with their software MineSetMineSet

KDD-CUP Results Discussion

Top finishers very close

Naïve Bayes algorithm was used by 2 of the top 3 contestants (BNB and MineSet)

BNB and MineSet did little data preprocessing

MineSet used a total of 6 variables in their final model

Urban Science implemented a tremendous amount of automated data preprocessing and exploratory data analysis and developed more than 50 models in an automated fashion to get to their results

KDD Cup 1997: Top 3 results

Top 3 finishersare very close

KDD Cup 1997 – worst results

Note that the worstresult (C6) was actuallyworse than random.

Competitor names werekept anonymous,apart from top 3 winners

Better Model Evaluation?

Comparing Gains at 10% and 40% is ad-hoc

Are there more principled methods? Area Under the Curve (AUC) of Gains Chart

Lift Quality

Ultimately, financial measures: Campaign Benefits

Model Evaluation: AUC

Area Under the Curve (AUC) is defined as the

Difference between Gains and Random Curves

Selection

Model Evaluation: Lift Quality

See Measuring Lift Quality in Database Marketing, Piatetsky-Shapiro and Steingold, SIGKDD Explorations, December 2000 .

AUC(Model) – AUC(Random)LQ = ----------------------------- AUC(Perfect) –AUC(Random)

Lift Quality (Lquality)

For a perfect model, Lquality = 100%

For a random model, Lquality = 0

For KDD Cup 97, Lquality(Urban Science) = 43.3%

Lquality(Elkan) = 42.7%

However, small differences in Lquality are not significant

Estimating Profit: Campaign Parameters

Direct Mail example N -- number of prospects, e.g. 750,000

T -- fraction of targets, e.g. 0.014

B -- benefit of hitting a target, e.g. $20 Note: this is simplification – actual benefit will vary

C -- cost of contacting a prospect, e.g. $0.68

P -- percentage selected for contact, e.g. 10%

Lift(P ) -- model lift at P , e.g. 3

Contacting Top P of Model-Sorted List Using previous example, let selection be P = 10% and Lift(P)

Selection size = N P , e.g. 75,000

Random has N P T targets in first P list, e.g. 1,050

Q: How many targets are in model P-selection?

Model has more by a factor Lift(P) or N P T Lift(P) targets in the selection, e.g. 3,150

Benefit of contacting the selection is N P T Lift(P) B , e.g. $63,000

Cost of contacting N P is N P C , e.g. $51,000

Profit of Contacting Top P

Profit(P) = Benefit(P) – Cost(P) =

N P T Lift(P) B - N P C =

NP (T Lift(P) B - C ) e.g. $12,000

Q: When is Profit Positive?

CLift(P) > ------ , e.g. 2.4 T ·B

When T • Lift(P) B > C , or

Finding Optimal Cutoff

10 20 30 40 50 60 70 80 90 100

Est Payoff

Use the formula to estimate benefit for each PFind optimal P

*Feasibility Assessment

Expected Profit(P) depends on known Cost C,

Benefit B,

Target Rate T

and unknown Lift(P)

To compute Lift(P) we need to get all the data, load it, clean it, ask for correct data, build models, ...

*Can Expected Lift be estimated ?

only from N and T ?

In theory -- no, but in many practical applications,

?!?! surprisingly yes ?!?!

*Empirical Observations about Lift

For good models, usually Lift(P) is monotically decreasing with P

Lift at fixed P (e.g. 0.05) is usually higher for lower T

Special point P = T

for a perfect predictor, all targets are in the first T of the list, for a maximum lift of 1/T

What can we expect compared to 1/T ?

*Meta Analysis of Lift

26 attrition & cross-sell problems from finance and telecom domains

N ranges from 1,000 to 150,000

T ranges from 1% to 22%

No clear relation to N, but there is dependence on T

*Results: Lift(T) vs 1/T

Best Model (R2 = 0.86)

log10(Lift(T)) = -0.05 + 0.52 log10(1/T)

Approximately

Lift(T) ~ T -0.5 = sqrt (1/T)

Tried several linear and log-linear fits

*Actual Lift(T) vs sqrt(1/T) for All Problems

0 5 10 15 20 25

100*T%

Actual lift(T) Est. lift(T) Error = Actual Lift - sqrt(1/T)

Avg(Error) = -0.08

St. Dev(Error) = 1.0

*GPS Lift(T) Rule of Thumb

For targeted marketing campaigns,

where 0.01 < T < 0.25,

Lift(T) = sqrt (1/T) 1

Exceptions for

truly predictable or random behaviors

poor models

information leakers

*Estimating Entire Curve

Cumulative Percent Hits

CPH(P) = Lift(P) * P

CPH is easier to model than Lift

Several regressions for all CPH curves

Best results with regression

log10(CPH(P)) = a + b log10(P)

Average R2 = 0.97

*CPH Curve Estimate

Approximately

CPH(P) ~ sqrt(P)

bounds:

P 0.6 < CPH(P) < P 0.4

*Lift Curve Estimate

Since Lift(P) = CPH(P)/P

Lift(P) ~ 1/sqrt(P)

bounds:

(1/P ) 0.4 < Lift(P) < (1/P ) 0.6

*More onEstimating Lift and Profitability

G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, Proc. KDD-99, ACM. www.KDnuggets.com/gpspubs/

KDD Cup 1998

Data from Paralyzed Veterans of America (charity)

Goal: select mailing with the highest profit

Winners: Urban Science, SAS, Quadstone see full results and winner’s presentations at

www.kdnuggets.com/meetings/kdd98

KDD-CUP-98 Analysis UniverseParalyzed Veterans of America (PVA), a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease, generously provided the data set PVA’s June 97 fund raising mailing, sent to 3.5 million

donors, was selected as the competition data

Within this universe, a group of 200K “Lapsed” donors was of particular interest to PVA. “Lapsed” donors are individuals who made their last donation to PVA 13 to 24 months prior to the mailing

KDD Cup-98 Example

Evaluation: Expected profit maximization with a mailing cost of $0.68

Sum of (actual donation-$0.68) for all records with predicted/ expected donation > $0.68

Participant with the highest actual sum wins

KDD Cup Cost Matrix

Predicted Donation

Yes No

Actual

Donation

Yes DonationAmt-0.68

No -0.68 0

KDD Cup 1998 Results

Model Selected

Result Rank

GainSmarts

56,330 $14,712

SAS 55,838 $14,662

Quadstone

57,836 $13,954

… … … …

*ALL* 96,367 $10,560

… … … …

#20 42,270 $1,706 20

#21 1,551 $ -54 21

Selected: how manywere selected by themodel

Result: the total profit(donations-cost)of the model

*ALL* - selecting all

Summary

KDD Cup 1997 case study

Model Evaluation: AUC and Lift Quality

Estimating Campaign Profit

*Feasibility Assessment GPS Rule of Thumb for Typical Lift Curve

KDD Cup 1998

Targeted Marketing, KDD Cup and Customer Modeling.

Documents

Transcript of Targeted Marketing, KDD Cup and Customer Modeling.

KDD: A Definition

ddit kdd 2

KDD Course

KDD Cup Survey

The Yahoo! Music Dataset and KDD-Cup'11

Results on Tracks 1 and 2 of KDD Cup 2013

Big Data y Educación - ucm.es Data y... · Eric Schmidt, 2013 ... KDD Cup 2010 Educational Data Mining Challenge ... • KDD Cup 2010 Educational Data Mining Challenge https: ...

Introduction to KDD

KDD Cup 2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team.

AN EFFICIENT ANOMALY INTRUSION DETECTION METHOD … · KDD dataset, an improvement of the KDD Cup'99. The proposed method achieved 99.24% accuracy which shows that the novel algorithm

KDD CUP 2015 - 9th solution

KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights

KDD cup 99

Malstone KDD 2010

Boosting - courses.cs.washington.edu · ML competitions (Kaggle, KDD Cup,…) • Coefficients chosen manually, with boosting, with bagging, or others Most deployed ML systems use

Kdd for personalization

SIGKDDExplorations. Volume3,Issue2 – page 47 KDD Cup 2001 ...

Deep Feature Extraction for multi Class Intrusion ... · The drawbacks of the existing KDD cup 99 dataset discussed by several researchers [7] lead to the development of NSL-KDD dataset.

KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights David Page, University of Wisconsin Co-chairs August 26, 2001 Special thanks: DuPont.

Kaggle KDD Cup Report