KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and...

KDD-09 Tutorial

Predictive Modeling in the Wild: Success

Factors in Data Mining Competitions and Real-

Life ProjectsSaharon Rosset, Tel Aviv University

Claudia Perlich, IBM Research

Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 22

Predictive modelingPredictive modelingMost general definition: build a model from observed data, Most general definition: build a model from observed data,

with the goal of predicting some unobserved outcomeswith the goal of predicting some unobserved outcomes Primary example: supervised learningPrimary example: supervised learning

get training data: get training data: (x(x11,y,y11), (x), (x22,y,y22),…, (x),…, (xnn,y,ynn))drawn i.i.d from joint distribution on (X,y)drawn i.i.d from joint distribution on (X,y)

Build model f(x) to describe the relationship between x and Build model f(x) to describe the relationship between x and yy

Use it to predict y when only x is observed in “future”Use it to predict y when only x is observed in “future” Other cases may relax some of the supervised learning Other cases may relax some of the supervised learning

assumptionsassumptions For example: in KDD Cup 2007 did not see any yFor example: in KDD Cup 2007 did not see any yii’s, had to ’s, had to

extrapolate them based on training xextrapolate them based on training xii’s – see later in ’s – see later in tutorialtutorial


Predictive Modeling CompetitionsPredictive Modeling Competitions

Competitions like KDD-Cup extract “core” predictive Competitions like KDD-Cup extract “core” predictive modeling challenges from their application modeling challenges from their application environmentenvironment

Usually supposed to represent real-life predictive Usually supposed to represent real-life predictive modeling challengesmodeling challenges

Extracting a real-life problem from its context and Extracting a real-life problem from its context and making a credible competition out of it is often making a credible competition out of it is often more difficult than it seemsmore difficult than it seems

We will see it in examplesWe will see it in examples


The Goals of this TutorialThe Goals of this Tutorial Understand the two modes of predictive modeling, Understand the two modes of predictive modeling,

their similarities and differences:their similarities and differences: Real life projectsReal life projects Data mining competitionsData mining competitions

Describe the main factors for success in the two Describe the main factors for success in the two modes of predictive modelingmodes of predictive modeling

Discuss some of the recurring challenges that Discuss some of the recurring challenges that come up in determining successcome up in determining success

These goals will be addressed and demonstrated These goals will be addressed and demonstrated through a series of case studiesthrough a series of case studies


Credentials in Data Mining CompetitionsCredentials in Data Mining Competitions

Saharon RossetSaharon Rosset- Winner KDD CUP 99~- Winner KDD CUP 99~

- Winner KDD CUP 00+- Winner KDD CUP 00+

Claudia PerlichClaudia Perlich- - Runner up KDD CUP 03*Runner up KDD CUP 03*

- Winner ILP challenge 05- Winner ILP challenge 05

- - Winner KDD CUP 09@Winner KDD CUP 09@

Collaborators:Collaborators:@Prem Melville, @Yan Liu, @Grzegorz Swirszcz, *Foster Provost, *Sofus Macscassy, +~Aron Inger, +Nurit Vatnik, +Einat Neuman, @Alexandru Niculescu-Mizil

JointlyJointly- Winners in KDD CUP 2007@Winners in KDD CUP 2007@

- Winners of KDD CUP 2008@Winners of KDD CUP 2008@

- Winners of INFORMS data mining challenge Winners of INFORMS data mining challenge 08@08@


Experience with Real Life ProjectsExperience with Real Life Projects

2004-2009 Collaboration on Business Intelligence 2004-2009 Collaboration on Business Intelligence projects at IBM Researchprojects at IBM Research

Total of >10 publications on real life projectsTotal of >10 publications on real life projects Total 4 IBM Outstanding Technical Achievement Total 4 IBM Outstanding Technical Achievement

awardsawards IBM accomplishment and major accomplishmentIBM accomplishment and major accomplishment Finalists in this year’s INFORMS Edelman Prize for Finalists in this year’s INFORMS Edelman Prize for

real-life applications of Operations Research and real-life applications of Operations Research and StatisticsStatistics

One of the successful projects will be discussed here One of the successful projects will be discussed here as a case studyas a case study


OutlineOutline1.1. Introduction and overview (SR)Introduction and overview (SR)

Differences between competitions and real lifeDifferences between competitions and real life Success Factors Success Factors Recurrent challenges in competitions and real projectsRecurrent challenges in competitions and real projects

2.2. Case StudiesCase Studies KDD CUP 2007 (SR)KDD CUP 2007 (SR) KDD CUP 2008 (CP)KDD CUP 2008 (CP) Business Intelligence Example : Business Intelligence Example :

Market Alignment Program (MAP) (CP)Market Alignment Program (MAP) (CP)

3.3. Conclusions and summary (CP)Conclusions and summary (CP)


Differences between competitions and Differences between competitions and projectsprojects

Competition Project

Task Clearly defined tasksClear Evaluation metrics

‘improve marketing effectiveness’‘identify underperforming stores’‘at what R2 can I fire people?’

Data Clean and available with (some) documentation

Don’t know what data they haveDon’t know what the data mean

Objective Prediction Insight, decision support, weapon in political battlefields, prediction

Deliverable ASCII file with numbers Endless conference callsPowerPoint slidesPrototype/Predictions (bi-monthly to annual refresh)

Duration Weeks/monthsYou know when it is over

Some projects just fail to die (3+ years)Most die before being born

In this tutorial we deal with the predictive modeling aspect, so our discussion of projects will also start with a well defined predictive task and ignore most of the difficulties with getting to that point


Real life project evolution and our focusReal life project evolution and our focus

Sales force mgmt.

Wallet est.

Business/ modeling problem definition

Quantile est.,Latent

variable est.

Statistical problem definition

Quantile est.,Graphical

model

Modeling methodology design

Programming,Simulation,IBM Wallets

Model generation & validation

OnTarget,MAP

Implementation & application development

Not our focus

Loosely related

Our focus

IBM relationshipsFirmographics

Data preparation & integration


Two types of competitionsTwo types of competitions

RealReal Raw dataRaw data Set up the model yourselfSet up the model yourself Task-specific evaluationTask-specific evaluation Simulate real life modeSimulate real life mode

ExampleExample KDD Cup 2007KDD Cup 2007 KDD Cup 2008KDD Cup 2008

ApproachApproach Understand the domainUnderstand the domain Analyze the data Analyze the data Build modelBuild model

ChallengesChallenges Too numerous Too numerous

Sterile Sterile Clean data matrixClean data matrix Standard error measureStandard error measure Often anonymized featuresOften anonymized features Pure machine learningPure machine learning

ExampleExample KDD Cup 2009KDD Cup 2009 PKDD Challenge 2007PKDD Challenge 2007

ApproachApproach Emphasize algorithms, Emphasize algorithms,

computationcomputation Attack with heavy (kernel?) Attack with heavy (kernel?)

machinesmachines ChallengesChallenges

Size, missing values, # Size, missing values, # featuresfeatures


Factors of Success in Competitions Factors of Success in Competitions and Real Lifeand Real Life

1. Data and domain understanding1. Data and domain understanding Generation of data and taskGeneration of data and task Cleaning and Cleaning and

representation/transformationrepresentation/transformation

2. Statistical insights2. Statistical insights Statistical propertiesStatistical properties Test validity of assumptionsTest validity of assumptions Performance measurePerformance measure

3. Modeling and learning approach3. Modeling and learning approach Most “publishable” partMost “publishable” part Choice or development of most suitable Choice or development of most suitable

algorithmalgorithm

Re

al

Ste

rile


Recurring challengesRecurring challenges

We emphasize three recurring challenges in We emphasize three recurring challenges in predictive modeling that often get overlooked:predictive modeling that often get overlooked:

1.1. Data leakage: impact, avoidance and detectionData leakage: impact, avoidance and detection Leakage: use of “illegitimate” data for modelingLeakage: use of “illegitimate” data for modeling ““Legitimate data”: data that will be available when model Legitimate data”: data that will be available when model

is appliedis applied In competitions, the definition of leakage is unclearIn competitions, the definition of leakage is unclear

2.2. Adapting learning to real-life performance Adapting learning to real-life performance measuresmeasures Could move well beyond standard measures like MSE, Could move well beyond standard measures like MSE,

error rate, or AUCerror rate, or AUC We will see this in two of our case studiesWe will see this in two of our case studies

3.3. Relational learning/Feature constructionRelational learning/Feature construction Real data is rarely flat, and good, practical solutions for Real data is rarely flat, and good, practical solutions for

this problem remain a challengethis problem remain a challenge


1 Leakage in Predictive Modeling1 Leakage in Predictive ModelingIntroduction of predictive information about the target by the data generation, collection, and Introduction of predictive information about the target by the data generation, collection, and

preparation processpreparation process Trivial example: Binary target was created using a cutoff on a continuous variable and by accident, the continuous

variable was not removed Reversal of cause and effect when information from the future becomes available

It produces models that do not generalize/true model performances is much lower than ‘out-of It produces models that do not generalize/true model performances is much lower than ‘out-of sample’ (but including leakage) estimatesample’ (but including leakage) estimate

Commonly occurs when combining data from multiple sources or multiple time points and often Commonly occurs when combining data from multiple sources or multiple time points and often manifests in the order in data filesmanifests in the order in data files

Leakage is surprisingly pervasive in competitions and real lifeLeakage is surprisingly pervasive in competitions and real life KDD CUP 2007, KDD CUP 2008 had leakages, as we will see in case studies INFORMS competition had leakage due to partial removal of information for only positive cases


Real life leakage exampleReal life leakage exampleP. Melville, S. Rosset, R. Lawrence (2008) Customer Targeting Models Using

Actively-Selected Web Content. KDD-08

Built models for identifying new customers for IBM products, Built models for identifying new customers for IBM products, based on:based on: IBM Internal databasesIBM Internal databases Companies’ websitesCompanies’ websites

Example pattern: Companies with the word “Websphere” on their Example pattern: Companies with the word “Websphere” on their website are likely to be good customers for IBM Websphere website are likely to be good customers for IBM Websphere productsproducts Ahem, a slight cause and effect problemAhem, a slight cause and effect problem Source of problem: we only have current view of company Source of problem: we only have current view of company

website, not its view when it was an IBM prospect (=prior to website, not its view when it was an IBM prospect (=prior to buying)buying)

Ad-hoc solution: remove all obvious leakage words.Ad-hoc solution: remove all obvious leakage words. Does not solve the fundamental problemDoes not solve the fundamental problem


General leakage solution: “predict the General leakage solution: “predict the future”future”

Niels Bohr is quoted as saying: Niels Bohr is quoted as saying: “Prediction is difficult, especially about the future”“Prediction is difficult, especially about the future”

Flipping this around, if:Flipping this around, if: The true prediction task is “about the future” (usually is)The true prediction task is “about the future” (usually is) We can make sure that our model only has access to We can make sure that our model only has access to

information “at the present”information “at the present” We can apply the time-based cutoff in the competition / We can apply the time-based cutoff in the competition /

evaluation / proof of concept stageevaluation / proof of concept stage We are guaranteed (intuitively and mathematically) that we We are guaranteed (intuitively and mathematically) that we

can prevent leakagecan prevent leakage

For the websites example, this would require getting internet For the websites example, this would require getting internet snapshot from (say) two years ago, and using only what we snapshot from (say) two years ago, and using only what we knew then to learn who bought sinceknew then to learn who bought since


2 Real-life performance measures2 Real-life performance measuresReal life prediction models should be constructed and judged for Real life prediction models should be constructed and judged for

performance on real-life measures:performance on real-life measures: Address the real problem at hand – optimize $$$, life span etc.Address the real problem at hand – optimize $$$, life span etc. At the same time, need to maintain statistical soundness:At the same time, need to maintain statistical soundness:

Can we optimize these measures directly?Can we optimize these measures directly? Are we better off just building good models in general?Are we better off just building good models in general?

Example: breast cancer detection (KDD Cup 2008)Example: breast cancer detection (KDD Cup 2008) At first sight, a standard classification problem (malignant or At first sight, a standard classification problem (malignant or

benign?)benign?) Obvious extension: cost sensitive objectiveObvious extension: cost sensitive objective

Much better to do a biopsy on a healthy subject than send a Much better to do a biopsy on a healthy subject than send a malignant patient home!malignant patient home!

Competition objective: optimize effective use of radiologists’ Competition objective: optimize effective use of radiologists’ timetime

Complex measure called FROCComplex measure called FROC See case study in Claudia’s partSee case study in Claudia’s part


Optimizing real-life measuresOptimizing real-life measuresIt is a common approach to use the prediction objective to It is a common approach to use the prediction objective to

motivate an empirical loss function for modeling:motivate an empirical loss function for modeling: If the prediction objective is the expected value of Y given x, If the prediction objective is the expected value of Y given x,

then squared error loss (e.g, linear regression or CART) is then squared error loss (e.g, linear regression or CART) is appropriateappropriate

If we want to predict the median of Y instead, then absolute If we want to predict the median of Y instead, then absolute loss is appropriateloss is appropriate

More generally, quantile loss can be used (cf. MAP case study)More generally, quantile loss can be used (cf. MAP case study)We will see successful examples of this approach in two case We will see successful examples of this approach in two case

studies (KDD CUP 07 and MAP)studies (KDD CUP 07 and MAP)

What do we do with complex measures like FROC? What do we do with complex measures like FROC? There is really no way to build a good model directlyThere is really no way to build a good model directly Less ambitious approach: Less ambitious approach:

Build a model using standard approaches (e.g. logistic Build a model using standard approaches (e.g. logistic regression)regression)

post-process your model to do well on the specific measurepost-process your model to do well on the specific measureWe will see a successful example of this approach in KDD CUP 08We will see a successful example of this approach in KDD CUP 08


3 Relational and Multi-Level Data3 Relational and Multi-Level Data

Real-life databases are rarely flat!Real-life databases are rarely flat!

Example: INFORMS Challenge 08, medical records:Example: INFORMS Challenge 08, medical records:

Hospital (39K)Hospital (39K)

Event IDEvent ID

Patient IDPatient ID

DiagnosisDiagnosis

Hospital StayHospital Stay

……

AccountingAccounting

Conditions Conditions (210K)(210K)


DiagnosisDiagnosis

YearYear

Medication Medication (629K)(629K)

Event IDEvent ID


DiagnosisDiagnosis

MedicationMedication

……

AccountingAccounting

Demographics Demographics (68K)(68K)


DemographicsDemographics

……

YearYear

m:n

m:n

m:n

m:n


Approaches for dealing with relational Approaches for dealing with relational datadata

Modeling approaches that use relational data directlyModeling approaches that use relational data directly There has been a lot of research, but there is a scarcity of There has been a lot of research, but there is a scarcity of

practically useful methods that take this approachpractically useful methods that take this approach

Flattening the relational structure into a standard X,y setupFlattening the relational structure into a standard X,y setup The key to this approach is generation of useful features The key to this approach is generation of useful features

from the relational tablesfrom the relational tables This is the approach we took in the INFORMS08 challengeThis is the approach we took in the INFORMS08 challenge

Ad hoc approachesAd hoc approaches Based on specific properties of the data and modeling Based on specific properties of the data and modeling

problem, it may be possible to “divide and conquer” the problem, it may be possible to “divide and conquer” the relational setuprelational setup

See example in the KDD CUP 08 case studySee example in the KDD CUP 08 case study


Modeler’s best friend: Exploratory data Modeler’s best friend: Exploratory data analysisanalysis

Exploratory data analysis (EDA) is a general name for a class Exploratory data analysis (EDA) is a general name for a class of techniques aimed at of techniques aimed at Examining dataExamining data Validating dataValidating data Forming hypotheses about dataForming hypotheses about data

The techniques are often graphical or intuitive, but can also The techniques are often graphical or intuitive, but can also be statisticalbe statistical Testing very simple hypotheses as a way of getting at Testing very simple hypotheses as a way of getting at

more complex onesmore complex ones E.g.: test each variable separately against response, and E.g.: test each variable separately against response, and

look for strong correlationslook for strong correlations

The most important proponent of EDA was the great, late The most important proponent of EDA was the great, late statistician John Tukeystatistician John Tukey


The beauty and value of exploratory The beauty and value of exploratory data analysisdata analysis

EDA is a critical step in creating successful predictive modeling EDA is a critical step in creating successful predictive modeling solutions:solutions: Expose leakageExpose leakage AVOID PRECONCEPTIONS about:AVOID PRECONCEPTIONS about:

What mattersWhat matters What would workWhat would work Etc.Etc.

Example: Identifying KDD CUP 08 leakage through EDAExample: Identifying KDD CUP 08 leakage through EDA Graphical display of identifier vs malingnant/benign (see Graphical display of identifier vs malingnant/benign (see

case study slide)) Could also be discovered via a statistical variable-by-Could also be discovered via a statistical variable-by-

variable examination of significant correlations with variable examination of significant correlations with response to detect itresponse to detect it

Key to finding this: AOIVDING PRECONCEPTIONS about the Key to finding this: AOIVDING PRECONCEPTIONS about the irrelevance of identifier irrelevance of identifier


Elements of EDA for predictive Elements of EDA for predictive modelingmodeling

1.1. Examine data variable by variableExamine data variable by variable Outliers?Outliers? Missing data patterns?Missing data patterns?

2.2. Examine relationships with responseExamine relationships with response Strong correlations?Strong correlations? Unexpected correlations?Unexpected correlations?

3.3. Compare to other similar datasets/problemsCompare to other similar datasets/problems Are variable distributions consistent?Are variable distributions consistent? Are correlations consistent?Are correlations consistent?

4.4. Stare: at raw data, at graphs, at correlations/resultsStare: at raw data, at graphs, at correlations/results

Unexpected answers to any of these questions may change the Unexpected answers to any of these questions may change the course of the predictive modeling processcourse of the predictive modeling process


Case study #1: Netflix/KDD-Cup Case study #1: Netflix/KDD-Cup 20072007


October 2006 Announcement October 2006 Announcement

of the NETFLIX Competitionof the NETFLIX Competition

USAToday headline: “Netflix offers $1 million prize for better movie

recommendations”

Details:Details: Beat NETFLIX current recommender ‘Cinematch’ RMSE by 10% prior Beat NETFLIX current recommender ‘Cinematch’ RMSE by 10% prior

to 2011to 2011 $50,000 for the annual progress price$50,000 for the annual progress price

First two awarded to AT&T team: 9.4% improvement as of 10/08 First two awarded to AT&T team: 9.4% improvement as of 10/08 (almost there!)(almost there!)

Data contains a subset of 100 million movie ratings from NETFLIX Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 moviesincluding 480,189 users and 17,770 movies

Performance is evaluated on holdout movie-user pairsPerformance is evaluated on holdout movie-user pairs NETFLIX competition has attracted ~50K contestants on ~40K teams NETFLIX competition has attracted ~50K contestants on ~40K teams

from >150 different countries from >150 different countries ~40K valid submissions from ~5K different teams~40K valid submissions from ~5K different teams


44 55 11

33

22

44

All movies (80K)

All

use

rs (

6.8

M)

NETFLIXCompetition

Data

17KSelection unclear

480 KAt least 20Ratings by end 2005

100 M ratings

NETFLIX DataNETFLIX Data Internet Movie Data BaseInternet Movie Data Base

FieldsFields

TitleTitle

YearYear

ActorsActors

AwardsAwards

RevenueRevenue

……


17K

mo

vie

s

Training Data Task 2

Task 1 Movie Arrival

1998 Time 2005 2006

User Arrival

44 55 ??

33

22

??

QualifierDataset

3M

KDD CUPNO Useror MovieArrival

NETFLIX data generation processNETFLIX data generation process


KDD-CUP 2007 based on the NETFLIXKDD-CUP 2007 based on the NETFLIX

Training: NETFLIX competition data from 1998-2005Training: NETFLIX competition data from 1998-2005 Test: 2006 ratings randomly split by movie in to two tasksTest: 2006 ratings randomly split by movie in to two tasks

Task 1: Who rated what in 2006Task 1: Who rated what in 2006 Given a list of 100,000 pairs of users and movies, predict Given a list of 100,000 pairs of users and movies, predict

for each pair the probability that the user rated the movie for each pair the probability that the user rated the movie in 2006in 2006

Result: IBM Research team was Result: IBM Research team was second runner-upsecond runner-up, No 3 out , No 3 out of 39 teamsof 39 teams

Task 2: Number of ratings per movie in 2006Task 2: Number of ratings per movie in 2006 Given a list of 8863 movies, predict the number of Given a list of 8863 movies, predict the number of

additional reviews that all existing users will give in 2006additional reviews that all existing users will give in 2006 Result: IBM Research team was theResult: IBM Research team was the winner winner, No 1 out of 34 , No 1 out of 34

teamsteams


Test sets from 2006 for Task 1 and Task 2Test sets from 2006 for Task 1 and Task 2

Task 1

Task 2

Users

Mo

vies

183183

88

2424

316316

1932193244

8989

2525

375375

00

RatingTotals

2.22.2

0.90.9

1.41.4

2.52.5

4.24.2

1.91.9

1.41.4

2.62.6

00log(n+1)

Marginal 2006Distribution of rating

MovieMovie UserUser RatingRating

M1M1 U31U31 44

M832M832 U83U83

M63M63 U2U2 33

M83M83 U97U97

M527M527 U63U63 11

M36M36 U81U81

…… …… ……

Task 2Test Set (8.8K)

Remove Pairs that were

rated prior to 2006


M1M1 U31U31 11

M832M832 U83U83 00

M63M63 U2U2 11

M83M83 U97U97 00

M527M527 U63U63 00

…… …… ……

Task 1Test Set (100K)

Sample (movie, user) pairs

according to product of marginals


Task 1: Did User A review Movie B in Task 1: Did User A review Movie B in

2006?2006? A standard classification task to answer question whether “existing” A standard classification task to answer question whether “existing”

users will review “existing” moviesusers will review “existing” movies In line more with “synthetic” mode of competitions than “real” In line more with “synthetic” mode of competitions than “real”

modemode

ChallengesChallenges Huge amount of dataHuge amount of data

how to sample the data so that any learning algorithms can be how to sample the data so that any learning algorithms can be applied is criticalapplied is critical

Complex affecting factorsComplex affecting factors decrease of interest in old movies, growing tendency of watching decrease of interest in old movies, growing tendency of watching

(reviewing) more movies by Netflix users (reviewing) more movies by Netflix users

Key solutionsKey solutions Effective sampling strategies to keep as much information as possibleEffective sampling strategies to keep as much information as possible Careful feature extraction from multiple sourcesCareful feature extraction from multiple sources


Task 2: How many reviews in 2006?Task 2: How many reviews in 2006? Task formulationTask formulation

Regression task to predict the total count of reviewers from Regression task to predict the total count of reviewers from “existing” users for 8863 “existing” movies“existing” users for 8863 “existing” movies

Evaluation is by RMSE Evaluation is by RMSE on log scaleon log scale

ChallengesChallenges Movie dynamics and life-cycleMovie dynamics and life-cycle

Interest in movies changes over timeInterest in movies changes over time User dynamics and life-cycleUser dynamics and life-cycle

No new users are added to the database No new users are added to the database

Key solutionsKey solutions Use counts from test set of Use counts from test set of Task 1Task 1 to learn a model for 2006 to learn a model for 2006

adjusting for pair removaladjusting for pair removal Build set of quarterly lagged models to determine the overall scalarBuild set of quarterly lagged models to determine the overall scalar Use Poisson regressionUse Poisson regression


Some data observationsSome data observations1. Task 1 test set is a potential response for training a model for is a potential response for training a model for Task Task

22 Was sampled according to marginal Was sampled according to marginal

(= # reviews for movie in 06 / total # reviews in 06)(= # reviews for movie in 06 / total # reviews in 06)which is proportional to the which is proportional to the Task 2Task 2 response (= # reviews for response (= # reviews for movie in 06)movie in 06)

BIG advantage: we get a view of 2006 behavior for half the BIG advantage: we get a view of 2006 behavior for half the moviesmovies Build model on this half, apply to the other half ( Build model on this half, apply to the other half (Task 2Task 2 test test set)set)

Caveats:Caveats: ProportionalProportional sampling implies there is a scaling parameter sampling implies there is a scaling parameter

left, which we don’t knowleft, which we don’t know Recall that after sampling, (movie, person) pairs that Recall that after sampling, (movie, person) pairs that

appeared before 2006 were dropped from appeared before 2006 were dropped from Task 1Task 1 test set test set Correcting it is an Correcting it is an inverse rejection sampling inverse rejection sampling problemproblem

Leakage Alert!


Test sets from 2006 for Task 1 and Task Test sets from 2006 for Task 1 and Task 22

Task 1

Task 2

Users

Mo

vies

183183

88

2424

316316

1932193244

8989

2525

375375

00

RatingTotals

2.22.2

0.90.9

1.41.4

2.52.5

4.24.2

1.91.9

1.41.4

2.62.6

00log(n+1)


M1M1 U31U31 44

M832M832 U83U83

M63M63 U2U2 33

M83M83 U97U97

M527M527 U63U63 11

M36M36 U81U81

…… …… ……

Task 2Test Set (8.8K)

Remove Pairs that were

rated prior to 2006


M1M1 U31U31 11

M832M832 U83U83 00

M63M63 U2U2 11

M83M83 U97U97 00

M527M527 U63U63 00

…… …… ……

Task 1Test Set (100K)

Sample (movie, user) pairs

according to product of marginals

Estimate Marginal Distribution

Surrogate learning

problem

44

00

2525

33

11

3636

55

Marginal 2006Distribution of rating


Some data observations (ctd.)Some data observations (ctd.)2.2. No new movies and reviewers in 2006No new movies and reviewers in 2006

Need to emphasize modeling the Need to emphasize modeling the life-cycle life-cycle of movies (and of movies (and reviewers)reviewers) How are older movies reviewed relative to newer How are older movies reviewed relative to newer

movies?movies? Does this depend on other features (like movie’s Does this depend on other features (like movie’s

genre)?genre)? This is especially critical when we consider the scaling This is especially critical when we consider the scaling

caveat abovecaveat above


Some statistical perspectivesSome statistical perspectives1.1. Poisson distribution is very appropriate for countsPoisson distribution is very appropriate for counts

Clearly true of overall counts for 2006Clearly true of overall counts for 2006 Assuming any kind of reasonable reviewers arrival processAssuming any kind of reasonable reviewers arrival process Right modeling approach for true counts is Poisson Right modeling approach for true counts is Poisson

regression:regression:nnii ~ Pois ( ~ Pois (iit)t)log(log(ii) = ) = jj jj x xijij

** = arg max = arg max l( l(n n ; X,; X,) (maximum likelihood)) (maximum likelihood)

What does this imply for model evaluation approach?What does this imply for model evaluation approach? Variance stabilizing transformation for Poisson is square rootVariance stabilizing transformation for Poisson is square root

nnii has roughly constant variance has roughly constant variance RMSE on log scale emphasizes performance on unpopular RMSE on log scale emphasizes performance on unpopular movies (small Poisson parameter movies (small Poisson parameter larger log scale variance) larger log scale variance)

We still We still assumedassumed that if we do well in a likelihood that if we do well in a likelihood formulation, we will do well with any evaluation approachformulation, we will do well with any evaluation approach

Adapting to evaluation measures!


Can we invert the Can we invert the rejection sampling mechanism??This can be viewed as a missing data problemThis can be viewed as a missing data problem

ni, mj are the counts for movie i and reviewer j respectively

pi, qj are the true marginals for movie i and reviewer j respectively

N is the total number of pairs rejected due to review prior to 2006Ui, Pj are the users who reviewed movie i prior to 2006 and movies reviewed by user j prior to 2006,

respectively

Can we design a practical EM algorithm with our huge data size? Can we design a practical EM algorithm with our huge data size? Interesting research problem…Interesting research problem…

We implemented ad-hoc inversion algorithmWe implemented ad-hoc inversion algorithm

Iterate until convergence between:Iterate until convergence between:- assuming movie marginals are fixed, adjusting reviewer marginals- assuming movie marginals are fixed, adjusting reviewer marginals- assuming reviewer marginals are fixed, adjusting movie marginals- assuming reviewer marginals are fixed, adjusting movie marginals

We verified that it indeed improved our data since it increased We verified that it indeed improved our data since it increased correlation with 4Q2005 countscorrelation with 4Q2005 counts

Some statistical perspectives Some statistical perspectives (ctd.)(ctd.)

j

i

Piijj

Ujjii

pNqqpNmE

qNpqpNnE

)1)(100000(),,|(

)1)(100000(),,|(


Modeling Approach SchemaModeling Approach Schema

Inverse RejectionSampling

Count ratings by Movie from

Estimate Poison Regression M1

&Predict on Task 1

movies

Who ReviewedTest (100K)

MovieFeatures

IMDB

ConstructMovie

Features

ConstructLagged Features

Q1-Q4 2005

NETFLIX challenge

Estimate 4 Poison Regression G1…G4

&Predict for 2006

Find optimalScalar

Estimate2006 total

Ratings for Task 2

Test set

Use M1 toPredict Task 2

movies

ScalePredictions

To Total

Standard Approach

Utilizing leakage


Some observations on modeling Some observations on modeling approachapproach

1.1. Lagged datasets are meant to Lagged datasets are meant to simulatesimulate forward prediction to 2006 forward prediction to 2006 Select quarter (e.g., Q105), remove all movies & reviewers that Select quarter (e.g., Q105), remove all movies & reviewers that

“started” later“started” later Build model on this data with e.g., Q305 as responseBuild model on this data with e.g., Q305 as response Apply model to our full dataset, which is naturally cropped at Q405 Apply model to our full dataset, which is naturally cropped at Q405

Gives a prediction for Q206 Gives a prediction for Q206 With several models like this, predict all of 2006With several models like this, predict all of 2006 Two potential uses:Two potential uses:

Use as our prediction for 2006 – but only if better than the model Use as our prediction for 2006 – but only if better than the model built on built on Task 1Task 1 movies! movies!

Consider only sum of their predictions to use for scaling the Consider only sum of their predictions to use for scaling the Task 1Task 1 model model

2.2. We evaluated models on We evaluated models on Task 1Task 1 test set test set Used holdout when also building them on this setUsed holdout when also building them on this set How can we evaluate the models built on lagged datasets? How can we evaluate the models built on lagged datasets?

Missing a Missing a scaling scaling parameter between the 2006 prediction and parameter between the 2006 prediction and sampled setsampled set

Solution: select Solution: select optimal optimal scaling based on scaling based on Task 1 Task 1 test set test set performanceperformance Since other model was still better, we knew we should use it! Since other model was still better, we knew we should use it!


Some details on our models and Some details on our models and submissionsubmission All models at movie level. Features we used:All models at movie level. Features we used:

Historical reviews in previous months/quarters/years (on log scale)Historical reviews in previous months/quarters/years (on log scale) Movie’s age since premier, movie’s age in Netflix (since first Movie’s age since premier, movie’s age in Netflix (since first

review)review) Also consider log, square etc Also consider log, square etc have flexibility in form of have flexibility in form of

functional dependencefunctional dependence Movie’s genreMovie’s genre

Include interactions between genre and age Include interactions between genre and age “life cycle” “life cycle” seems to differ by genre!seems to differ by genre!

Models we considered (MSE on log-scale on Models we considered (MSE on log-scale on Task 1Task 1 holdout): holdout): Poisson regression on Poisson regression on Task 1Task 1 test set (0.24) test set (0.24) Log-scale linear regression model on Log-scale linear regression model on Task 1Task 1 test set (0.25) test set (0.25) Sum of lagged models built on 2005 quarters + best scaling (0.31) Sum of lagged models built on 2005 quarters + best scaling (0.31)

Scaling based on lagged modelsScaling based on lagged models Our estimated of number of reviews for all models in Our estimated of number of reviews for all models in Task 1Task 1 test test

set: about 9.5Mset: about 9.5M Implied scaling parameter for predictions about 90Implied scaling parameter for predictions about 90 Total of our submitted predictions for Total of our submitted predictions for Task 2 Task 2 test set was 9.3Mtest set was 9.3M


Competition evaluationCompetition evaluation First we were informed that we won with RMSE of First we were informed that we won with RMSE of

~770~770 They mistakenly evaluated on non-log scaleThey mistakenly evaluated on non-log scale Strong emphasis on most popular moviesStrong emphasis on most popular movies We won by large marginWe won by large margin

Our model did well on most popular movies! Our model did well on most popular movies! Then they re-evaluated on log scale, we still wonThen they re-evaluated on log scale, we still won

On log scale the least popular movies are emphasizedOn log scale the least popular movies are emphasized Recall that variance stabilizing transformation is in-Recall that variance stabilizing transformation is in-

between (square root)between (square root) So our predictions did well on unpopular movies too!So our predictions did well on unpopular movies too!

Interesting question: would we win on square root Interesting question: would we win on square root scale (or similarly, Poisson likelihood-based scale (or similarly, Poisson likelihood-based evaluation)? Sure hope so!evaluation)? Sure hope so!


Competition evaluation (ctd.)Competition evaluation (ctd.)Results of competition (log-scale evaluation):Results of competition (log-scale evaluation):

Components of our model’s MSE:Components of our model’s MSE: The error of the model for the scaled-down The error of the model for the scaled-down Task 1Task 1 test set (which test set (which

we estimated at about 0.24)we estimated at about 0.24) Additional error from incorrect scaling factorAdditional error from incorrect scaling factor

Scaling numbers:Scaling numbers: True total reviews: 8.7MTrue total reviews: 8.7M Sum of our predictions: 9.3MSum of our predictions: 9.3M

Interesting question: what would be best scalingInteresting question: what would be best scaling For log-scale evaluation? Conjecture: need to under-estimate true For log-scale evaluation? Conjecture: need to under-estimate true

totaltotal For square-root evaluation? Conjecture: need to estimate about For square-root evaluation? Conjecture: need to estimate about

rightright


Effect of scaling on the two Effect of scaling on the two evaluation approachesevaluation approaches

Scaling

Total reviews (M)

Log-scale MSE

Square-root scale

MSE Comment

0.7 6.55 0.222 40.28

0.8 7.48 0.208 29.80Best log performance

0.9 8.42 0.225 26.38Best sqrt performance

0.93 8.70 0.234 26.55 Correct scaling

1 9.35 0.263 28.86 Our solution

1.1 10.29 0.316 36.37


Effect of scaling on the two evaluation Effect of scaling on the two evaluation approachesapproaches


KDD CUP 2007: SummaryKDD CUP 2007: Summary

Keys to our success: Keys to our success: Identify subtle leakageIdentify subtle leakage

Is it formally leakage? Depends on intentions of Is it formally leakage? Depends on intentions of organizers…organizers…

Appropriate statistical approach Appropriate statistical approach Poisson regressionPoisson regression Inverting rejection sampling in leakageInverting rejection sampling in leakage Careful handling of time-series aspects Careful handling of time-series aspects

Not keys to our success:Not keys to our success: Fancy machine learning algorithms Fancy machine learning algorithms


Case Study # 2: KDD CUP 2008 - Siemens Medical Breast Cancer Identification

1712 Patients

6816 Images

105,000 Candidates

[ x1 , x2 , … , x117, class]

candidate feature vector

Malignant

?

MLO CC MLO CC


KDD-CUP 2008 based on Mammography

Training: labeled candidates from 1300 patient and association of candidate to location, image and patient

Test: candidates from separate set of 1300 patients

Task 1: Rank all candidates by the likelihood of being

cancerous Results: IBM Research team was the winner out of 246

Task 2: Identify a list of healthy patients Results: IBM Research team was the winner out of 205


Task 1: Candidate Likelihood of Cancer

Almost standard probability estimation/ranking task on the candidate level

Somewhat synthetic as the meaning of the features is unknown

Challenges Low positive rate: 7% patients and 0.6% of candidates

Beware of overfitting Sampling

Unfamiliar evaluation measure FROC, related to AUC Non-robust

Hint at locality

Key Solution Simple linear model Post-processing of scores Leakge in identifiers

Tru

e P

os

itiv

e P

ati

en

t R

ate

False Positive Candidate Rate Per Image

FROC

Adapting to evaluation measures!


Task 2: Classify patientsDerivate of the previous task 1 A patient is healthy if all her candidates are benign Probability that a patient is healthy is the product of the

probabilities of her candidates

Challenges Extremely non robust performance measure:

Including any patient with cancer in the list disqualified the entry

Risk tradeoff – need to anticipate the solutions of the other participants

Key solution Pick a model with high sensitivity to false negatives Leakage in identifiers: EDA at work


EDA on the Breast Cancer Domain

144484 1148717 0168975 0169638 1171985 0177389 1182498 0185266 0193561 1194771 0198716 1199814 11030694 01123030 01171864 01175742 01177150 01194527 01232036 01280544 01328709 01373028 01387320 01420306 0---more---

Console output of sorted ‘patient_ID patient_lable’:

Base rate of 7%????

What about 200K to 999K?


Mystery of the Data Generation:Identifier Leakage in the Breast cancer data

Distribution of identifiers has a strong natural grouping of patient identifiersDistribution of identifiers has a strong natural grouping of patient identifiers 3 natural buckets3 natural buckets

The three group have VERY different base rated of cancer prevalenceThe three group have VERY different base rated of cancer prevalence Last group seems to be sorted (cancer first)Last group seems to be sorted (cancer first)

Total of 4 groups with very patient different probability of cancerTotal of 4 groups with very patient different probability of cancer Organizers admitted to have combined data from different years in order to Organizers admitted to have combined data from different years in order to

increase the positive rateincrease the positive rate

245 Patients:

36% Cancer

414 Patients:

1% Cancer

1027 Patients

0% Cancer

18 Patients:

85% Cancer

Mo

del

sco

re

Log of Patient ID

Every point is a candidate

Leakage


Building the classification model For evaluation we created a stratified 50% training and

test split by patient Given few positives (~300), results may exhibit high variance

We explored the use of various learning algorithms including Neural Networks, Logistic regression and various SVMs

Linear models (logistic regression or linear SVMs) yielded the most promising results FROC 0.0834

Down-sampling the negative class? Keep on 25% of all healthy patients Helped in some cases, not really reliable improvement

Add the identifier category (1,2,3,4) as additional feature


Modeling Neighborhood Dependence

Candidates are not really iid but actually relational: Stacking

Build initial model and score all candidates Use labels of neighbors in second round

Formulate as EM problem Treat the labels of the neighbors are unobserved in EM

Pair-wise constraints based on location adjacency Calculate the Euclidean distance from the candidates within the

same picture and distance to the nipple in both views for each breast

Select the pairs of candidates with distance difference less than a threshold

Constraints: selected pairs of examples (xi,MLO, xi,CC) should have the same predicted labels, i.e. f(xi,MLO) = f(xi,CC).

Results Seems to improve the probability estimate in terms of AUC Did not improve FROC

Relational Data


Outlier Treatment Many of the 117 numeric features have large

outliers Incur a huge penalty in terms of likelihood

Large bias Badly calibrated probabilities Extreme (wrong) values in the prediction

Histogram of info[, 1]

info[, 1]

Fre

quen

cy

0 5 10 15 20

020

0040

0060

0080

00

Histogram of Feature 10

142 observations > 5

Statistics


ROC vs. FROC optimization: Post-processing of model scores?

In ROC all rows are independent In ROC all rows are independent

and both true positives and falseand both true positives and false

positives are counted by rowpositives are counted by row FROC has true FROC has true patients patients and false and false

positive positive candidatescandidates

Higher TP rate for candidates does not improve FROC unless from Higher TP rate for candidates does not improve FROC unless from newnew patient, e.g., patient, e.g., It’s better to have 2 correctly identified candidates from It’s better to have 2 correctly identified candidates from

different patients, instead of 5 from the same different patients, instead of 5 from the same It’s best to re-order candidates based on model scores so as It’s best to re-order candidates based on model scores so as

to ensure many different patients up front to ensure many different patients up front

Tru

e P

os

itiv

e R

ate

False Positive Rate

ROC

Tru

e P

os

itiv

e P

ati

en

t R

ate

FROC

False Positive Candidate Rate

Adapting to evaluation


Probabilistic Approach: At any point we want to maximize the expected gradient of the FROC at this point Define for each candidate c of patient i

pc probability that candidate c is malignant npj probability that a patient i has not yet been identified

3 cases Candidate is positive but you already have identified patient with probability = pc *(1-npi) Candidate is positive and new patient with probability = pc *npi

Candidate is negative with probability =1- pc

Pick candidate with largest expected gain: pc *npi/(1- pc)Theorem: The expected value of FROC for the is higher that for any other orderProblem: Our probability estimates are not good enough for this to work well

Theory of Post-processingAdapting to evaluation


Bad Calibration! We consistently over-

predict the probability of cancer for the most likely candidates Linear Bias of the method High class-skew Outlier in the 117 numeric

features leads to extreme predictions on holdout

Clibration Plot

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1 1.2

Predicted Probability

Tru

e P

rob

abil

ity

Calibration Plot

Re-calibration?Re-calibration? We tried a number of methods We tried a number of methods No improvementNo improvement Some resulted in better calibration but hurt the Some resulted in better calibration but hurt the

rankingranking

Statistics


Post-Processing Heuristic

Re-ordering model scores significantly improves the FROC Re-ordering model scores significantly improves the FROC with no additional modelingwith no additional modeling

Ad-HOC Approach:Ad-HOC Approach: Take the top n ranked Take the top n ranked candidates where n is candidates where n is approximately the number of approximately the number of positive candidatespositive candidates Select one candidate with the Select one candidate with the highest score for each patient highest score for each patient from this list and put them on from this list and put them on the top of the listthe top of the list Iterate until all top n Iterate until all top n candidates are re-orderedcandidates are re-ordered

Tru

e P

os

itiv

e P

ati

en

t R

ate

False Positive Rate Per Image

Adapting to evaluation


Submissions and ResultsTask 1 Bagged linear SVMs with hinge loss and heuristic post processing This approach scored the winning result of 0.0933 on FROC out of 246

submission of 110 unique participants Second place scored 0.0895 Some rumor that other participants also found the ID leakage

Task 2 Logistic model performs better than the SVM models probably because

likelihood is more sensitive to extreme errors (the first false negative) The first false negative occur typically around 1100 patients in the

training set We submitted the first 1020 patients ranked by a logistic model that

included ID feature + original 117 features Scored a specificity of 0.682 on the test set with no false negatives Only 24 out of 203 submissions had no false negatives Second place scored 0.17 specificity


Summary in terms of success factors Leakage in the identifier provides information about the

likelihood of a patient to have cancer Caused by the organizers effort to increase the positive rate

by adding ‘old’ patients that developed cancer

Post-processing for FROC optimization Awareness of impact of feature outliers Interacts with the statistical properties of the data and the

model Log-likelihood more sensitive than hinge loss

Otherwise simple model to avoid overfitting Linear models

Relational is not helpful for the given evaluation


KDD CUP 2009 Data: customer database of Orange with 100K

observations and 15K variables

Three different tasks and 2.5 versions: Prediction: Churn, Appetency, Upselling Versions: Fast (5 days) & Slow (1 month) Large and Small version

Interesting characteristics: Highly ‘sterile’, nothing known about anything Leaderboard is was possible to match the large and small and

receive feedback on 20% of test


KDD CUP 2000 Data: online store history for Gazelle.com

Five different tasks including: Prediction: Who will continue in session? Who will buy? Insights: Characterize heavy spenders

Interesting characteristics: “Leakage”: internal testing sessions were left in data

Deterministic behavior If identified, give 100% accuracy in prediction for part of data

Evaluation in terms of “real” business objectives? Sort of: handled by defining a set of “standard” questions,

each covering different aspect of business objective Relational data?

Yes, customers had different # of sessions, of different length, with different stages


KDD CUP 2003 Data: Citation rates in Physics papers

Two tasks: Predict Change in number of citation during next 3 month Write an interesting paper about it

Interesting Characteristics Highly relational, links between papers and authors Feature construction up to participants Leakage impossible since the truth was really in the

future Evaluation on SSE against integer values (Poisson)


ILP Challenge 2003 Data: Yeast genome including protein sequence,

alignment similarity scores with other proteins, additional protein information from relational DB

Task: Identify (potentially multiple) functional classes for each gene

Interesting characteristics 420 possible classes, very subjective asignment Purely relational, no ‘features’ available

Distances (supposedly p-values) of gene alignment Secondary structure (protein of amino acids) Protein DB with keywords, etc

‘Leakage’ in the identifier: contains letter of the labeling research group

Highly unsatisfcatory evaluation: precision of the prediction


INFORMS data mining contest 2008

Data: 2 years of hospital records with accounting information (cost, reimbursement, …) , patient demographics, medication history

Tasks: Identify pneumonia patients Design optimization setting for preventive treatment

Interesting characteristics: Relational setting (4 tables linked though patient

identifier) Leakage: removal of the pneumonia code left hidden

traces ‘Dirty’ data with plenty missing, contradicting

demographics and changing patient ID’s


Data Mining in the Wild: Project Work

Similarities with competitions (compared to DM research): Single dataset Algorithms can be existing and simple No real need for baselines (although useful) The absolute performance matters

Differences to competitions: You need to decide what the analytical problem is You need to define the evaluation rather than optimize it You need to avoid leakage rather than use it You need to FIND all relevant data rather than use what

is there (often leads to relational settings) You need to deliver it somehow to have impact


Case Study #3: Market Alignment ProgramWallet:

Total amount of money that the customer can spend in a certain product category in a given period

Why Are We Interested in Wallet? Customer targeting

Focus on acquiring customers with high wallet For existing customers, focus on high wallet, low share-of-

wallet customers Sales force management

Wallet of as sales force allocation target and make resource assignment decisions based on wallet

Evaluate success of sales personnel and by attained share-of-wallet


Wallet Modeling Challenge

The customer wallet is never observed Nothing to “fit a model” Even if you have a model, how do you evaluate it?

We would like a predictive approach from available data Firmographics (Sales, Industry, Employees) IBM Sales and transaction history


Define Wallet/Opportunity?

TOTAL: Total customer available budget in total IT Can we really hope to attain all of it?

SERVED: Total customer spending on IT products offered by IBM Better definition for our marketing purposes

REALISTIC: IBM spending of the “best similar customers”

IBM Sales REALISTIC SERVED TOTAL Company Revenue

Company Revenue

TOTAL

SERVED

REALISTIC

IBM Sales


REALISTIC Wallets as quantiles

Motivation Imagine 100 identical firms with identical IT needs Consider the distribution of the IBM sales to these firms Bottom 95% of firms should spend as much as the top

5%

Define REALISTIC wallet as high percentile of spending conditional on the customer attributes

Implies that a few customers are spending full wallet with us

however, we do not know which ones


Distribution of IBM sales s to the customer given customer attributes x: s|x ~ f,x

Two obvious ways to get at the pth percentile: Estimate the conditional by integrating over a

neighborhood of similar customers Take pth percentile of spending in neighborhood

Create a global model for pth percentile Build global regression models, e.g.,

s|x ~ Exp(αx+β)

Formally: Percentile of Conditional

REALISTIC


Estimation: the Quantile Loss Function The mean minimizes a sum of squared

residuals:

The median minimizes a sum of absolute residuals.

The p-th quantile minimizes an asymmetrically weighted sum of absolute residuals:

n

iiy

1

2)(min

n

iim my

1

||min

n

iiipy yyL

i1

ˆ )ˆ,(min

-3 -2 -1 0 1 2 3

01

23

4

p=0.8

p=0.5 (absolute loss)

yyyyp

yyyypyyLp ˆ if )ˆ()1(

ˆ if )ˆ()ˆ,(


‘Ad HOC’

Overview of analytical approaches

kNN-Industry

- Size

Optimization

Quantile Regression Decomposition

Model Form - Linear - Decision Tree - Quanting

- Linear Model- Adjustment

General kNN - K - Distance - Features


Data Generation Process Need to combine data on revenue with customers properties

Complicated matching process between in IBM internal customer view (accounts) and the external sources (Dun & Breadstreet)

Probabilistic process with plenty of heuristics Huge danger of introducing data bias Tradeoff in data quality and coverage

Leakage potential: We can only get current customer information This information might be tainted by the customer’s

interaction with IBM Problem gets amplified when we try to augment the data

with home-page information


Evaluating Measures for Wallet We still don’t know the truth Combined approach:

Quantile loss to assess only the relevant predictive ability and feature selection

Expert Feedback to select suitable model class Business Impact to identify overall effectiveness

Quantile LossQuantile Loss Expert Expert FeedbackFeedback

Business ImpactBusiness Impact

AvailableAvailable RelevantRelevant Very RelevantVery Relevant

- Not that relevant- Missing a parameter- sensitive to skew- Scale? Log?

- Similar to survey- Unclear incentives- Potentially biased- Hard to come by on large scale

- Highly aggregated- Long lag- Convoluted with impact of other things- Requires intense tracking


Empirical Evaluation I: Quantile Loss Setup

Four domains with relevant quantile modeling problems:direct mailing, housing prices, income data, IBM sales

Performance on test set in terms of 0.9th quantile loss Approaches:

Linear quantile regression Q-kNN (kNN with quantile prediction from the

neighbors) Quantile trees (quantile prediction in the leaf) Bagged quantile trees Quanting (Langrofd et al. 2006 -- reduces quantile

estimation to averaged classification using trees) Baselines

Best constant model Traditional regression models for expected values,

adjusted under Gaussian assumption (+1.28)


Performance on Quantile Loss (smaller is better)

Conclusions Standard regression is not competitive (because the residuals are not

normal) If there is a time-lagged variable, linear quantile model is best Splitting criterion is irrelevant in the tree models Quanting (using decision trees) and quantile tree perform comparably Generalized kNN is not competitive


Evaluation II: MAP Workshops Overview

Calculated 2005 opportunity using naive Q-kNN approach

2005 MAP workshops Displayed opportunity by brand Expert can accept or alter the opportunity

Select 3 brands for evaluation: DB2, Rational, Tivoli

Build ~100 models for each brand using different approaches

Compare expert opportunity to model predictions Error measures: absolute, squared Scale: original, log, root Total of 6 measures


0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12 14 16 18 20

Expert Feedback

MODEL_OPPTY

Expert Feedback to Original Model

Experts acceptopportunity (45%)

Increase (17%)

Decrease (23%)

Experts changeopportunity (40%)

Experts reduced opportunity to 0(15%)


Observations

Many accounts are set for external reasons to zero Exclude from evaluation since no model can

predict the competitive environment Exponential distribution of opportunities

Evaluation on the original (non-log) scale is subject to large outliers

Experts seem to make percentage adjustments Consider log scale evaluation in addition to

original scale and root as intermediate Suspect strong “anchoring” bias, 45% of

opportunities were not touched


Model Comparison Results

ModelModel RationalRational DB2DB2 TivoliTivoli

Displayed Model Displayed Model (kNN)(kNN)

66 66 44 55 66 66

Max 03-05 RevenueMax 03-05 Revenue 11 11 00 33 11 44

Linear Quantile 0.8Linear Quantile 0.8 55 66 22 44 33 55

Regression TreeRegression Tree 11 33 22 44 11 22

Q-kNN 50 + Q-kNN 50 + flooringflooring 22 33 66 66 44 66

Decomposition Decomposition CenterCenter

00 00 33 55 00 44

Quantile Tree 0.8Quantile Tree 0.8 00 11 22 44 11 44

(Anchoring)

(Best)

We count how often a model scores within the top 10 and 20 for each of the 6 measures:


MAP Experiments Conclusions

Q-kNN performs very well after flooring but is typically inferior prior to flooring

80th percentile Linear quantile regression performs consistently well (flooring has a minor effect)

Experts are strongly influenced by displayed opportunity (and displayed revenue of previous years)

Models without last year’s revenue don’t perform well

Use Linear Quantile Regression with q=0.8 in MAP 06


MAP Business Impact MAP launched in 2005

In 2006 420 workshops held worldwide, with teams responsible for most of IBM’s revenue

MAP recognized as 2006 IBM Research Accomplishment Awarded based on “proven” business impact

Runner up in Case Study Award in KDD 2007 Edelman finalist 2009 Most important use is segmentation of customer base

Shift resources into “invest” segments with low wallet share


Business Impact For 2006, 270 resource shifts were made to 268 Invest

Accounts We examine the performance of these accounts relative to

background

$0

$20

$40

$60

$80

$100

$120

$140

$0 $20 $40 $60 $80 $100

2005 Revenue ($M)

Rev

enue

Asp

iratio

n ($

M)

CORE

INVEST

EXAMINE

OPTIMIZE

Invest

Core - Growth

Core - Optimize

2005 Actual Revenue ($M)

Val

idat

ed R

even

ue

Op

po

rtu

nit

y ($

M)

270 Shifts

REVENUE: 9% growth in INVEST accounts 4% growth in all other accounts

QUOTA ATTAINMENT: 45% for MAP-shifted resources 36% for non-MAP shifts

PIPELINE (relative to 2005):17% growth in INVEST accounts3% growth in all other accounts


Summary in terms of success factors

1 Data and Domain understanding Match of business objective to modeling approach

made a previously unsolvable business problem solvable with predictive modeling

2 Statistical insight Minimizing quantile-loss estimates the correct quantity One single evaluation metrics is in real life not enough Autocorrelation helps linear model

3 Modeling Extension to tree induction Comparative study In the end: linear it is


Identify Potential Causes for Chip Failure

Data: 5K machines of which 18 failed in the last year Task: Can you identify a (short) list of other machines

that are likely to fail to have them preemptively fixed Characteristics

Relational: Tool ID, Multiple chips per machine (only the first failure is detected)

Leakage: database is clearly augmented past failure: all failure have a customer associated, but customer is missing in most non-failure

Statistical observation: This is really a survival analysis problem, the failure does not occur prior to a runtime of 180 days

Accuracy and even AUC is NOT relevant Insight: cause of failure Lift and false positive rate in the top k is more important


Threats in Competitions and Projects

Competitions Mistakes under time

pressure Accidental use of the

target (kernel SVM) Complexity

Overfitting

Projects Unavailability of dataUnavailability of data Data generation problemsData generation problems The model is not good The model is not good

enough to be usefulenough to be useful Model results are not Model results are not

accessible to the useraccessible to the user If the user has to If the user has to

understand the model you understand the model you need to keep it simpleneed to keep it simple

Web delivery of predictionsWeb delivery of predictions


OverfittingEven if you think, that you know this one -You probably still overdo it!

KDD CUP results have shown that a large number of entries overfit

2003, 90% of entries did worse than the best constant prediction

Corollary: Don’t overdo it on the search Having a holdout, does NOT make you immune to

overfitting- you just overfit on the holdout 10 fold cross validation does NO make you immune either Leaderboards on 10% of test are VERY deceptive

KDD CUP 2009: The winner of the fast challenge after only 5 days was indeed the leader of the board

The winner of the slow challenge after 1 more month was NOT the leader of the board


Overfitting: Example KDD CUP 2008

Data 105,000 candidates 117 numeric features Sounds good right?

Overfitting is NOT just about the training size and model complexity!

Linear models overfit too! How robust is the evaluation measure?

AUC FROC Number of healthy patients

What is the base rate? 600 positives


Factors of Success in Competitions and Real Life

1. Data and domain understanding Generation of data and task Cleaning and

representation/transformation

2. Statistical insights Statistical properties Test validity of assumptions Performance measure

3. Modeling and learning approach Most “publishable” part Choice or development of most suitable

algorithm

Re

al

Ste

rile


Success Factor 1: Data and Domain Understanding

Task and data generation Formulate analytical problem (MAP) EDA Check for Leakage

KDD 07: NETFLIX KDD 08: Cancer MAP

Adjust for Decreasing population Task 1 target

Combined sources lead to leakage

Wallet definition and design of analytical solution


Success Factors 2: Statistical insights

Properties of Evaluation Measures Does it measure what you care about? Robustness Invariance to transformation/… Linkage between model optimization, statistic and

performance

KDD 07: NETFLIX KDD 08: Cancer MAP Poisson regression Log transform downscale

Highly non-robust, beware of overfitting Post processing

Robust evaluation Multiple measures


Success Factors 3: Models and approach

How much complexity do you need? Often linear does just fine with the correctly constructed

features (Actually of my wins have been with linear models)

Feature selection Can you optimize what you want to optimize?

How does the model relate to your evaluation metrics Regression approaches predict conditional mean

Accuracy vs AUC vs Log likelihood Does it scale to your problem?

Some cool methods just do not run on 100K

NETFLIX KDD CUP 08 MAP Linear Poisson Log transform

Logistic Regression Linear SVM

Linear quantile regression


Summary: comparison of case studies

KDD CUP 07Task 2

KDD CUP 08Task 1

MAP

Ultimate modeling goal

Demand forecasting

Breast cancer detection

Customer wallet estimation

Evaluation objective

Log-scale RMSE

FROC 0.2-0.3 Quantile loss/ Expert feedback

Key data/domain insight

Leakage from Task 1

Leakage in patient IDs

Duality quantilewallet

Key statistical insight

Poisson distribution

FROC post-processing

Optimizer of quantile loss

Best modeling approach

Maximum likelihood (Poisson reg.)

Machine learning (linear SVM)

Empirical risk minimization (quantile reg.)


Invitation: Please join us on another data mining competition!

INFORMS Data Mining contest on health care data Register at www.informsdmcontest2009.org Real data of hospital visits for patients with severe

heart disease ‘Real’ tasks for ongoing project

Transfer to specialized hospitals Severity / death

Relational (multiple hospital stays per patient) Evaluation:

AUC Publication and workshop at INFORMS 2009

KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and...

Documents

Transcript of KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and...