KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and...
-
Upload
michael-copeland -
Category
Documents
-
view
216 -
download
2
Transcript of KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and...
KDD-09 Tutorial
Predictive Modeling in the Wild: Success
Factors in Data Mining Competitions and Real-
Life ProjectsSaharon Rosset, Tel Aviv University
Claudia Perlich, IBM Research
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 22
Predictive modelingPredictive modelingMost general definition: build a model from observed data, Most general definition: build a model from observed data,
with the goal of predicting some unobserved outcomeswith the goal of predicting some unobserved outcomes Primary example: supervised learningPrimary example: supervised learning
get training data: get training data: (x(x11,y,y11), (x), (x22,y,y22),…, (x),…, (xnn,y,ynn))drawn i.i.d from joint distribution on (X,y)drawn i.i.d from joint distribution on (X,y)
Build model f(x) to describe the relationship between x and Build model f(x) to describe the relationship between x and yy
Use it to predict y when only x is observed in “future”Use it to predict y when only x is observed in “future” Other cases may relax some of the supervised learning Other cases may relax some of the supervised learning
assumptionsassumptions For example: in KDD Cup 2007 did not see any yFor example: in KDD Cup 2007 did not see any yii’s, had to ’s, had to
extrapolate them based on training xextrapolate them based on training xii’s – see later in ’s – see later in tutorialtutorial
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 33
Predictive Modeling CompetitionsPredictive Modeling Competitions
Competitions like KDD-Cup extract “core” predictive Competitions like KDD-Cup extract “core” predictive modeling challenges from their application modeling challenges from their application environmentenvironment
Usually supposed to represent real-life predictive Usually supposed to represent real-life predictive modeling challengesmodeling challenges
Extracting a real-life problem from its context and Extracting a real-life problem from its context and making a credible competition out of it is often making a credible competition out of it is often more difficult than it seemsmore difficult than it seems
We will see it in examplesWe will see it in examples
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 44
The Goals of this TutorialThe Goals of this Tutorial Understand the two modes of predictive modeling, Understand the two modes of predictive modeling,
their similarities and differences:their similarities and differences: Real life projectsReal life projects Data mining competitionsData mining competitions
Describe the main factors for success in the two Describe the main factors for success in the two modes of predictive modelingmodes of predictive modeling
Discuss some of the recurring challenges that Discuss some of the recurring challenges that come up in determining successcome up in determining success
These goals will be addressed and demonstrated These goals will be addressed and demonstrated through a series of case studiesthrough a series of case studies
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 55
Credentials in Data Mining CompetitionsCredentials in Data Mining Competitions
Saharon RossetSaharon Rosset- Winner KDD CUP 99~- Winner KDD CUP 99~
- Winner KDD CUP 00+- Winner KDD CUP 00+
Claudia PerlichClaudia Perlich- - Runner up KDD CUP 03*Runner up KDD CUP 03*
- Winner ILP challenge 05- Winner ILP challenge 05
- - Winner KDD CUP 09@Winner KDD CUP 09@
Collaborators:Collaborators:@Prem Melville, @Yan Liu, @Grzegorz Swirszcz, *Foster Provost, *Sofus Macscassy, +~Aron Inger, +Nurit Vatnik, +Einat Neuman, @Alexandru Niculescu-Mizil
JointlyJointly- Winners in KDD CUP 2007@Winners in KDD CUP 2007@
- Winners of KDD CUP 2008@Winners of KDD CUP 2008@
- Winners of INFORMS data mining challenge Winners of INFORMS data mining challenge 08@08@
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 66
Experience with Real Life ProjectsExperience with Real Life Projects
2004-2009 Collaboration on Business Intelligence 2004-2009 Collaboration on Business Intelligence projects at IBM Researchprojects at IBM Research
Total of >10 publications on real life projectsTotal of >10 publications on real life projects Total 4 IBM Outstanding Technical Achievement Total 4 IBM Outstanding Technical Achievement
awardsawards IBM accomplishment and major accomplishmentIBM accomplishment and major accomplishment Finalists in this year’s INFORMS Edelman Prize for Finalists in this year’s INFORMS Edelman Prize for
real-life applications of Operations Research and real-life applications of Operations Research and StatisticsStatistics
One of the successful projects will be discussed here One of the successful projects will be discussed here as a case studyas a case study
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 77
OutlineOutline1.1. Introduction and overview (SR)Introduction and overview (SR)
Differences between competitions and real lifeDifferences between competitions and real life Success Factors Success Factors Recurrent challenges in competitions and real projectsRecurrent challenges in competitions and real projects
2.2. Case StudiesCase Studies KDD CUP 2007 (SR)KDD CUP 2007 (SR) KDD CUP 2008 (CP)KDD CUP 2008 (CP) Business Intelligence Example : Business Intelligence Example :
Market Alignment Program (MAP) (CP)Market Alignment Program (MAP) (CP)
3.3. Conclusions and summary (CP)Conclusions and summary (CP)
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 99
Differences between competitions and Differences between competitions and projectsprojects
Competition Project
Task Clearly defined tasksClear Evaluation metrics
‘improve marketing effectiveness’‘identify underperforming stores’‘at what R2 can I fire people?’
Data Clean and available with (some) documentation
Don’t know what data they haveDon’t know what the data mean
Objective Prediction Insight, decision support, weapon in political battlefields, prediction
Deliverable ASCII file with numbers Endless conference callsPowerPoint slidesPrototype/Predictions (bi-monthly to annual refresh)
Duration Weeks/monthsYou know when it is over
Some projects just fail to die (3+ years)Most die before being born
In this tutorial we deal with the predictive modeling aspect, so our discussion of projects will also start with a well defined predictive task and ignore most of the difficulties with getting to that point
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1010
Real life project evolution and our focusReal life project evolution and our focus
Sales force mgmt.
Wallet est.
Business/ modeling problem definition
Quantile est.,Latent
variable est.
Statistical problem definition
Quantile est.,Graphical
model
Modeling methodology design
Programming,Simulation,IBM Wallets
Model generation & validation
OnTarget,MAP
Implementation & application development
Not our focus
Loosely related
Our focus
IBM relationshipsFirmographics
Data preparation & integration
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1111
Two types of competitionsTwo types of competitions
RealReal Raw dataRaw data Set up the model yourselfSet up the model yourself Task-specific evaluationTask-specific evaluation Simulate real life modeSimulate real life mode
ExampleExample KDD Cup 2007KDD Cup 2007 KDD Cup 2008KDD Cup 2008
ApproachApproach Understand the domainUnderstand the domain Analyze the data Analyze the data Build modelBuild model
ChallengesChallenges Too numerous Too numerous
Sterile Sterile Clean data matrixClean data matrix Standard error measureStandard error measure Often anonymized featuresOften anonymized features Pure machine learningPure machine learning
ExampleExample KDD Cup 2009KDD Cup 2009 PKDD Challenge 2007PKDD Challenge 2007
ApproachApproach Emphasize algorithms, Emphasize algorithms,
computationcomputation Attack with heavy (kernel?) Attack with heavy (kernel?)
machinesmachines ChallengesChallenges
Size, missing values, # Size, missing values, # featuresfeatures
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1212
Factors of Success in Competitions Factors of Success in Competitions and Real Lifeand Real Life
1. Data and domain understanding1. Data and domain understanding Generation of data and taskGeneration of data and task Cleaning and Cleaning and
representation/transformationrepresentation/transformation
2. Statistical insights2. Statistical insights Statistical propertiesStatistical properties Test validity of assumptionsTest validity of assumptions Performance measurePerformance measure
3. Modeling and learning approach3. Modeling and learning approach Most “publishable” partMost “publishable” part Choice or development of most suitable Choice or development of most suitable
algorithmalgorithm
Re
al
Ste
rile
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1313
Recurring challengesRecurring challenges
We emphasize three recurring challenges in We emphasize three recurring challenges in predictive modeling that often get overlooked:predictive modeling that often get overlooked:
1.1. Data leakage: impact, avoidance and detectionData leakage: impact, avoidance and detection Leakage: use of “illegitimate” data for modelingLeakage: use of “illegitimate” data for modeling ““Legitimate data”: data that will be available when model Legitimate data”: data that will be available when model
is appliedis applied In competitions, the definition of leakage is unclearIn competitions, the definition of leakage is unclear
2.2. Adapting learning to real-life performance Adapting learning to real-life performance measuresmeasures Could move well beyond standard measures like MSE, Could move well beyond standard measures like MSE,
error rate, or AUCerror rate, or AUC We will see this in two of our case studiesWe will see this in two of our case studies
3.3. Relational learning/Feature constructionRelational learning/Feature construction Real data is rarely flat, and good, practical solutions for Real data is rarely flat, and good, practical solutions for
this problem remain a challengethis problem remain a challenge
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1414
1 Leakage in Predictive Modeling1 Leakage in Predictive ModelingIntroduction of predictive information about the target by the data generation, collection, and Introduction of predictive information about the target by the data generation, collection, and
preparation processpreparation process Trivial example: Binary target was created using a cutoff on a continuous variable and by accident, the continuous
variable was not removed Reversal of cause and effect when information from the future becomes available
It produces models that do not generalize/true model performances is much lower than ‘out-of It produces models that do not generalize/true model performances is much lower than ‘out-of sample’ (but including leakage) estimatesample’ (but including leakage) estimate
Commonly occurs when combining data from multiple sources or multiple time points and often Commonly occurs when combining data from multiple sources or multiple time points and often manifests in the order in data filesmanifests in the order in data files
Leakage is surprisingly pervasive in competitions and real lifeLeakage is surprisingly pervasive in competitions and real life KDD CUP 2007, KDD CUP 2008 had leakages, as we will see in case studies INFORMS competition had leakage due to partial removal of information for only positive cases
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1515
Real life leakage exampleReal life leakage exampleP. Melville, S. Rosset, R. Lawrence (2008) Customer Targeting Models Using
Actively-Selected Web Content. KDD-08
Built models for identifying new customers for IBM products, Built models for identifying new customers for IBM products, based on:based on: IBM Internal databasesIBM Internal databases Companies’ websitesCompanies’ websites
Example pattern: Companies with the word “Websphere” on their Example pattern: Companies with the word “Websphere” on their website are likely to be good customers for IBM Websphere website are likely to be good customers for IBM Websphere productsproducts Ahem, a slight cause and effect problemAhem, a slight cause and effect problem Source of problem: we only have current view of company Source of problem: we only have current view of company
website, not its view when it was an IBM prospect (=prior to website, not its view when it was an IBM prospect (=prior to buying)buying)
Ad-hoc solution: remove all obvious leakage words.Ad-hoc solution: remove all obvious leakage words. Does not solve the fundamental problemDoes not solve the fundamental problem
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1616
General leakage solution: “predict the General leakage solution: “predict the future”future”
Niels Bohr is quoted as saying: Niels Bohr is quoted as saying: “Prediction is difficult, especially about the future”“Prediction is difficult, especially about the future”
Flipping this around, if:Flipping this around, if: The true prediction task is “about the future” (usually is)The true prediction task is “about the future” (usually is) We can make sure that our model only has access to We can make sure that our model only has access to
information “at the present”information “at the present” We can apply the time-based cutoff in the competition / We can apply the time-based cutoff in the competition /
evaluation / proof of concept stageevaluation / proof of concept stage We are guaranteed (intuitively and mathematically) that we We are guaranteed (intuitively and mathematically) that we
can prevent leakagecan prevent leakage
For the websites example, this would require getting internet For the websites example, this would require getting internet snapshot from (say) two years ago, and using only what we snapshot from (say) two years ago, and using only what we knew then to learn who bought sinceknew then to learn who bought since
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1717
2 Real-life performance measures2 Real-life performance measuresReal life prediction models should be constructed and judged for Real life prediction models should be constructed and judged for
performance on real-life measures:performance on real-life measures: Address the real problem at hand – optimize $$$, life span etc.Address the real problem at hand – optimize $$$, life span etc. At the same time, need to maintain statistical soundness:At the same time, need to maintain statistical soundness:
Can we optimize these measures directly?Can we optimize these measures directly? Are we better off just building good models in general?Are we better off just building good models in general?
Example: breast cancer detection (KDD Cup 2008)Example: breast cancer detection (KDD Cup 2008) At first sight, a standard classification problem (malignant or At first sight, a standard classification problem (malignant or
benign?)benign?) Obvious extension: cost sensitive objectiveObvious extension: cost sensitive objective
Much better to do a biopsy on a healthy subject than send a Much better to do a biopsy on a healthy subject than send a malignant patient home!malignant patient home!
Competition objective: optimize effective use of radiologists’ Competition objective: optimize effective use of radiologists’ timetime
Complex measure called FROCComplex measure called FROC See case study in Claudia’s partSee case study in Claudia’s part
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1818
Optimizing real-life measuresOptimizing real-life measuresIt is a common approach to use the prediction objective to It is a common approach to use the prediction objective to
motivate an empirical loss function for modeling:motivate an empirical loss function for modeling: If the prediction objective is the expected value of Y given x, If the prediction objective is the expected value of Y given x,
then squared error loss (e.g, linear regression or CART) is then squared error loss (e.g, linear regression or CART) is appropriateappropriate
If we want to predict the median of Y instead, then absolute If we want to predict the median of Y instead, then absolute loss is appropriateloss is appropriate
More generally, quantile loss can be used (cf. MAP case study)More generally, quantile loss can be used (cf. MAP case study)We will see successful examples of this approach in two case We will see successful examples of this approach in two case
studies (KDD CUP 07 and MAP)studies (KDD CUP 07 and MAP)
What do we do with complex measures like FROC? What do we do with complex measures like FROC? There is really no way to build a good model directlyThere is really no way to build a good model directly Less ambitious approach: Less ambitious approach:
Build a model using standard approaches (e.g. logistic Build a model using standard approaches (e.g. logistic regression)regression)
post-process your model to do well on the specific measurepost-process your model to do well on the specific measureWe will see a successful example of this approach in KDD CUP 08We will see a successful example of this approach in KDD CUP 08
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 1919
3 Relational and Multi-Level Data3 Relational and Multi-Level Data
Real-life databases are rarely flat!Real-life databases are rarely flat!
Example: INFORMS Challenge 08, medical records:Example: INFORMS Challenge 08, medical records:
Hospital (39K)Hospital (39K)
Event IDEvent ID
Patient IDPatient ID
DiagnosisDiagnosis
Hospital StayHospital Stay
……
AccountingAccounting
Conditions Conditions (210K)(210K)
Patient IDPatient ID
DiagnosisDiagnosis
YearYear
Medication Medication (629K)(629K)
Event IDEvent ID
Patient IDPatient ID
DiagnosisDiagnosis
MedicationMedication
……
AccountingAccounting
Demographics Demographics (68K)(68K)
Patient IDPatient ID
DemographicsDemographics
……
YearYear
m:n
m:n
m:n
m:n
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2020
Approaches for dealing with relational Approaches for dealing with relational datadata
Modeling approaches that use relational data directlyModeling approaches that use relational data directly There has been a lot of research, but there is a scarcity of There has been a lot of research, but there is a scarcity of
practically useful methods that take this approachpractically useful methods that take this approach
Flattening the relational structure into a standard X,y setupFlattening the relational structure into a standard X,y setup The key to this approach is generation of useful features The key to this approach is generation of useful features
from the relational tablesfrom the relational tables This is the approach we took in the INFORMS08 challengeThis is the approach we took in the INFORMS08 challenge
Ad hoc approachesAd hoc approaches Based on specific properties of the data and modeling Based on specific properties of the data and modeling
problem, it may be possible to “divide and conquer” the problem, it may be possible to “divide and conquer” the relational setuprelational setup
See example in the KDD CUP 08 case studySee example in the KDD CUP 08 case study
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2121
Modeler’s best friend: Exploratory data Modeler’s best friend: Exploratory data analysisanalysis
Exploratory data analysis (EDA) is a general name for a class Exploratory data analysis (EDA) is a general name for a class of techniques aimed at of techniques aimed at Examining dataExamining data Validating dataValidating data Forming hypotheses about dataForming hypotheses about data
The techniques are often graphical or intuitive, but can also The techniques are often graphical or intuitive, but can also be statisticalbe statistical Testing very simple hypotheses as a way of getting at Testing very simple hypotheses as a way of getting at
more complex onesmore complex ones E.g.: test each variable separately against response, and E.g.: test each variable separately against response, and
look for strong correlationslook for strong correlations
The most important proponent of EDA was the great, late The most important proponent of EDA was the great, late statistician John Tukeystatistician John Tukey
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2222
The beauty and value of exploratory The beauty and value of exploratory data analysisdata analysis
EDA is a critical step in creating successful predictive modeling EDA is a critical step in creating successful predictive modeling solutions:solutions: Expose leakageExpose leakage AVOID PRECONCEPTIONS about:AVOID PRECONCEPTIONS about:
What mattersWhat matters What would workWhat would work Etc.Etc.
Example: Identifying KDD CUP 08 leakage through EDAExample: Identifying KDD CUP 08 leakage through EDA Graphical display of identifier vs malingnant/benign (see Graphical display of identifier vs malingnant/benign (see
case study slide)) Could also be discovered via a statistical variable-by-Could also be discovered via a statistical variable-by-
variable examination of significant correlations with variable examination of significant correlations with response to detect itresponse to detect it
Key to finding this: AOIVDING PRECONCEPTIONS about the Key to finding this: AOIVDING PRECONCEPTIONS about the irrelevance of identifier irrelevance of identifier
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2323
Elements of EDA for predictive Elements of EDA for predictive modelingmodeling
1.1. Examine data variable by variableExamine data variable by variable Outliers?Outliers? Missing data patterns?Missing data patterns?
2.2. Examine relationships with responseExamine relationships with response Strong correlations?Strong correlations? Unexpected correlations?Unexpected correlations?
3.3. Compare to other similar datasets/problemsCompare to other similar datasets/problems Are variable distributions consistent?Are variable distributions consistent? Are correlations consistent?Are correlations consistent?
4.4. Stare: at raw data, at graphs, at correlations/resultsStare: at raw data, at graphs, at correlations/results
Unexpected answers to any of these questions may change the Unexpected answers to any of these questions may change the course of the predictive modeling processcourse of the predictive modeling process
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2424
Case study #1: Netflix/KDD-Cup Case study #1: Netflix/KDD-Cup 20072007
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2525
October 2006 Announcement October 2006 Announcement
of the NETFLIX Competitionof the NETFLIX Competition
USAToday headline: “Netflix offers $1 million prize for better movie
recommendations”
Details:Details: Beat NETFLIX current recommender ‘Cinematch’ RMSE by 10% prior Beat NETFLIX current recommender ‘Cinematch’ RMSE by 10% prior
to 2011to 2011 $50,000 for the annual progress price$50,000 for the annual progress price
First two awarded to AT&T team: 9.4% improvement as of 10/08 First two awarded to AT&T team: 9.4% improvement as of 10/08 (almost there!)(almost there!)
Data contains a subset of 100 million movie ratings from NETFLIX Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 moviesincluding 480,189 users and 17,770 movies
Performance is evaluated on holdout movie-user pairsPerformance is evaluated on holdout movie-user pairs NETFLIX competition has attracted ~50K contestants on ~40K teams NETFLIX competition has attracted ~50K contestants on ~40K teams
from >150 different countries from >150 different countries ~40K valid submissions from ~5K different teams~40K valid submissions from ~5K different teams
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2626
44 55 11
33
22
44
All movies (80K)
All
use
rs (
6.8
M)
NETFLIXCompetition
Data
17KSelection unclear
480 KAt least 20Ratings by end 2005
100 M ratings
NETFLIX DataNETFLIX Data Internet Movie Data BaseInternet Movie Data Base
FieldsFields
TitleTitle
YearYear
ActorsActors
AwardsAwards
RevenueRevenue
……
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2727
17K
mo
vie
s
Training Data Task 2
Task 1 Movie Arrival
1998 Time 2005 2006
User Arrival
44 55 ??
33
22
??
QualifierDataset
3M
KDD CUPNO Useror MovieArrival
NETFLIX data generation processNETFLIX data generation process
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2828
KDD-CUP 2007 based on the NETFLIXKDD-CUP 2007 based on the NETFLIX
Training: NETFLIX competition data from 1998-2005Training: NETFLIX competition data from 1998-2005 Test: 2006 ratings randomly split by movie in to two tasksTest: 2006 ratings randomly split by movie in to two tasks
Task 1: Who rated what in 2006Task 1: Who rated what in 2006 Given a list of 100,000 pairs of users and movies, predict Given a list of 100,000 pairs of users and movies, predict
for each pair the probability that the user rated the movie for each pair the probability that the user rated the movie in 2006in 2006
Result: IBM Research team was Result: IBM Research team was second runner-upsecond runner-up, No 3 out , No 3 out of 39 teamsof 39 teams
Task 2: Number of ratings per movie in 2006Task 2: Number of ratings per movie in 2006 Given a list of 8863 movies, predict the number of Given a list of 8863 movies, predict the number of
additional reviews that all existing users will give in 2006additional reviews that all existing users will give in 2006 Result: IBM Research team was theResult: IBM Research team was the winner winner, No 1 out of 34 , No 1 out of 34
teamsteams
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 2929
Test sets from 2006 for Task 1 and Task 2Test sets from 2006 for Task 1 and Task 2
Task 1
Task 2
Users
Mo
vies
183183
88
2424
316316
1932193244
8989
2525
375375
00
RatingTotals
2.22.2
0.90.9
1.41.4
2.52.5
4.24.2
1.91.9
1.41.4
2.62.6
00log(n+1)
Marginal 2006Distribution of rating
MovieMovie UserUser RatingRating
M1M1 U31U31 44
M832M832 U83U83
M63M63 U2U2 33
M83M83 U97U97
M527M527 U63U63 11
M36M36 U81U81
…… …… ……
Task 2Test Set (8.8K)
Remove Pairs that were
rated prior to 2006
MovieMovie UserUser RatingRating
M1M1 U31U31 11
M832M832 U83U83 00
M63M63 U2U2 11
M83M83 U97U97 00
M527M527 U63U63 00
…… …… ……
Task 1Test Set (100K)
Sample (movie, user) pairs
according to product of marginals
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3030
Task 1: Did User A review Movie B in Task 1: Did User A review Movie B in
2006?2006? A standard classification task to answer question whether “existing” A standard classification task to answer question whether “existing”
users will review “existing” moviesusers will review “existing” movies In line more with “synthetic” mode of competitions than “real” In line more with “synthetic” mode of competitions than “real”
modemode
ChallengesChallenges Huge amount of dataHuge amount of data
how to sample the data so that any learning algorithms can be how to sample the data so that any learning algorithms can be applied is criticalapplied is critical
Complex affecting factorsComplex affecting factors decrease of interest in old movies, growing tendency of watching decrease of interest in old movies, growing tendency of watching
(reviewing) more movies by Netflix users (reviewing) more movies by Netflix users
Key solutionsKey solutions Effective sampling strategies to keep as much information as possibleEffective sampling strategies to keep as much information as possible Careful feature extraction from multiple sourcesCareful feature extraction from multiple sources
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3131
Task 2: How many reviews in 2006?Task 2: How many reviews in 2006? Task formulationTask formulation
Regression task to predict the total count of reviewers from Regression task to predict the total count of reviewers from “existing” users for 8863 “existing” movies“existing” users for 8863 “existing” movies
Evaluation is by RMSE Evaluation is by RMSE on log scaleon log scale
ChallengesChallenges Movie dynamics and life-cycleMovie dynamics and life-cycle
Interest in movies changes over timeInterest in movies changes over time User dynamics and life-cycleUser dynamics and life-cycle
No new users are added to the database No new users are added to the database
Key solutionsKey solutions Use counts from test set of Use counts from test set of Task 1Task 1 to learn a model for 2006 to learn a model for 2006
adjusting for pair removaladjusting for pair removal Build set of quarterly lagged models to determine the overall scalarBuild set of quarterly lagged models to determine the overall scalar Use Poisson regressionUse Poisson regression
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3232
Some data observationsSome data observations1. Task 1 test set is a potential response for training a model for is a potential response for training a model for Task Task
22 Was sampled according to marginal Was sampled according to marginal
(= # reviews for movie in 06 / total # reviews in 06)(= # reviews for movie in 06 / total # reviews in 06)which is proportional to the which is proportional to the Task 2Task 2 response (= # reviews for response (= # reviews for movie in 06)movie in 06)
BIG advantage: we get a view of 2006 behavior for half the BIG advantage: we get a view of 2006 behavior for half the moviesmovies Build model on this half, apply to the other half ( Build model on this half, apply to the other half (Task 2Task 2 test test set)set)
Caveats:Caveats: ProportionalProportional sampling implies there is a scaling parameter sampling implies there is a scaling parameter
left, which we don’t knowleft, which we don’t know Recall that after sampling, (movie, person) pairs that Recall that after sampling, (movie, person) pairs that
appeared before 2006 were dropped from appeared before 2006 were dropped from Task 1Task 1 test set test set Correcting it is an Correcting it is an inverse rejection sampling inverse rejection sampling problemproblem
Leakage Alert!
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3333
Test sets from 2006 for Task 1 and Task Test sets from 2006 for Task 1 and Task 22
Task 1
Task 2
Users
Mo
vies
183183
88
2424
316316
1932193244
8989
2525
375375
00
RatingTotals
2.22.2
0.90.9
1.41.4
2.52.5
4.24.2
1.91.9
1.41.4
2.62.6
00log(n+1)
MovieMovie UserUser RatingRating
M1M1 U31U31 44
M832M832 U83U83
M63M63 U2U2 33
M83M83 U97U97
M527M527 U63U63 11
M36M36 U81U81
…… …… ……
Task 2Test Set (8.8K)
Remove Pairs that were
rated prior to 2006
MovieMovie UserUser RatingRating
M1M1 U31U31 11
M832M832 U83U83 00
M63M63 U2U2 11
M83M83 U97U97 00
M527M527 U63U63 00
…… …… ……
Task 1Test Set (100K)
Sample (movie, user) pairs
according to product of marginals
Estimate Marginal Distribution
Surrogate learning
problem
44
00
2525
33
11
3636
55
Marginal 2006Distribution of rating
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3434
Some data observations (ctd.)Some data observations (ctd.)2.2. No new movies and reviewers in 2006No new movies and reviewers in 2006
Need to emphasize modeling the Need to emphasize modeling the life-cycle life-cycle of movies (and of movies (and reviewers)reviewers) How are older movies reviewed relative to newer How are older movies reviewed relative to newer
movies?movies? Does this depend on other features (like movie’s Does this depend on other features (like movie’s
genre)?genre)? This is especially critical when we consider the scaling This is especially critical when we consider the scaling
caveat abovecaveat above
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3535
Some statistical perspectivesSome statistical perspectives1.1. Poisson distribution is very appropriate for countsPoisson distribution is very appropriate for counts
Clearly true of overall counts for 2006Clearly true of overall counts for 2006 Assuming any kind of reasonable reviewers arrival processAssuming any kind of reasonable reviewers arrival process Right modeling approach for true counts is Poisson Right modeling approach for true counts is Poisson
regression:regression:nnii ~ Pois ( ~ Pois (iit)t)log(log(ii) = ) = jj jj x xijij
** = arg max = arg max l( l(n n ; X,; X,) (maximum likelihood)) (maximum likelihood)
What does this imply for model evaluation approach?What does this imply for model evaluation approach? Variance stabilizing transformation for Poisson is square rootVariance stabilizing transformation for Poisson is square root
nnii has roughly constant variance has roughly constant variance RMSE on log scale emphasizes performance on unpopular RMSE on log scale emphasizes performance on unpopular movies (small Poisson parameter movies (small Poisson parameter larger log scale variance) larger log scale variance)
We still We still assumedassumed that if we do well in a likelihood that if we do well in a likelihood formulation, we will do well with any evaluation approachformulation, we will do well with any evaluation approach
Adapting to evaluation measures!
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3636
Can we invert the Can we invert the rejection sampling mechanism??This can be viewed as a missing data problemThis can be viewed as a missing data problem
ni, mj are the counts for movie i and reviewer j respectively
pi, qj are the true marginals for movie i and reviewer j respectively
N is the total number of pairs rejected due to review prior to 2006Ui, Pj are the users who reviewed movie i prior to 2006 and movies reviewed by user j prior to 2006,
respectively
Can we design a practical EM algorithm with our huge data size? Can we design a practical EM algorithm with our huge data size? Interesting research problem…Interesting research problem…
We implemented ad-hoc inversion algorithmWe implemented ad-hoc inversion algorithm
Iterate until convergence between:Iterate until convergence between:- assuming movie marginals are fixed, adjusting reviewer marginals- assuming movie marginals are fixed, adjusting reviewer marginals- assuming reviewer marginals are fixed, adjusting movie marginals- assuming reviewer marginals are fixed, adjusting movie marginals
We verified that it indeed improved our data since it increased We verified that it indeed improved our data since it increased correlation with 4Q2005 countscorrelation with 4Q2005 counts
Some statistical perspectives Some statistical perspectives (ctd.)(ctd.)
j
i
Piijj
Ujjii
pNqqpNmE
qNpqpNnE
)1)(100000(),,|(
)1)(100000(),,|(
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3737
Modeling Approach SchemaModeling Approach Schema
Inverse RejectionSampling
Count ratings by Movie from
Estimate Poison Regression M1
&Predict on Task 1
movies
Who ReviewedTest (100K)
MovieFeatures
IMDB
ConstructMovie
Features
ConstructLagged Features
Q1-Q4 2005
NETFLIX challenge
Estimate 4 Poison Regression G1…G4
&Predict for 2006
Find optimalScalar
Estimate2006 total
Ratings for Task 2
Test set
Use M1 toPredict Task 2
movies
ScalePredictions
To Total
Standard Approach
Utilizing leakage
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3838
Some observations on modeling Some observations on modeling approachapproach
1.1. Lagged datasets are meant to Lagged datasets are meant to simulatesimulate forward prediction to 2006 forward prediction to 2006 Select quarter (e.g., Q105), remove all movies & reviewers that Select quarter (e.g., Q105), remove all movies & reviewers that
“started” later“started” later Build model on this data with e.g., Q305 as responseBuild model on this data with e.g., Q305 as response Apply model to our full dataset, which is naturally cropped at Q405 Apply model to our full dataset, which is naturally cropped at Q405
Gives a prediction for Q206 Gives a prediction for Q206 With several models like this, predict all of 2006With several models like this, predict all of 2006 Two potential uses:Two potential uses:
Use as our prediction for 2006 – but only if better than the model Use as our prediction for 2006 – but only if better than the model built on built on Task 1Task 1 movies! movies!
Consider only sum of their predictions to use for scaling the Consider only sum of their predictions to use for scaling the Task 1Task 1 model model
2.2. We evaluated models on We evaluated models on Task 1Task 1 test set test set Used holdout when also building them on this setUsed holdout when also building them on this set How can we evaluate the models built on lagged datasets? How can we evaluate the models built on lagged datasets?
Missing a Missing a scaling scaling parameter between the 2006 prediction and parameter between the 2006 prediction and sampled setsampled set
Solution: select Solution: select optimal optimal scaling based on scaling based on Task 1 Task 1 test set test set performanceperformance Since other model was still better, we knew we should use it! Since other model was still better, we knew we should use it!
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 3939
Some details on our models and Some details on our models and submissionsubmission All models at movie level. Features we used:All models at movie level. Features we used:
Historical reviews in previous months/quarters/years (on log scale)Historical reviews in previous months/quarters/years (on log scale) Movie’s age since premier, movie’s age in Netflix (since first Movie’s age since premier, movie’s age in Netflix (since first
review)review) Also consider log, square etc Also consider log, square etc have flexibility in form of have flexibility in form of
functional dependencefunctional dependence Movie’s genreMovie’s genre
Include interactions between genre and age Include interactions between genre and age “life cycle” “life cycle” seems to differ by genre!seems to differ by genre!
Models we considered (MSE on log-scale on Models we considered (MSE on log-scale on Task 1Task 1 holdout): holdout): Poisson regression on Poisson regression on Task 1Task 1 test set (0.24) test set (0.24) Log-scale linear regression model on Log-scale linear regression model on Task 1Task 1 test set (0.25) test set (0.25) Sum of lagged models built on 2005 quarters + best scaling (0.31) Sum of lagged models built on 2005 quarters + best scaling (0.31)
Scaling based on lagged modelsScaling based on lagged models Our estimated of number of reviews for all models in Our estimated of number of reviews for all models in Task 1Task 1 test test
set: about 9.5Mset: about 9.5M Implied scaling parameter for predictions about 90Implied scaling parameter for predictions about 90 Total of our submitted predictions for Total of our submitted predictions for Task 2 Task 2 test set was 9.3Mtest set was 9.3M
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4040
Competition evaluationCompetition evaluation First we were informed that we won with RMSE of First we were informed that we won with RMSE of
~770~770 They mistakenly evaluated on non-log scaleThey mistakenly evaluated on non-log scale Strong emphasis on most popular moviesStrong emphasis on most popular movies We won by large marginWe won by large margin
Our model did well on most popular movies! Our model did well on most popular movies! Then they re-evaluated on log scale, we still wonThen they re-evaluated on log scale, we still won
On log scale the least popular movies are emphasizedOn log scale the least popular movies are emphasized Recall that variance stabilizing transformation is in-Recall that variance stabilizing transformation is in-
between (square root)between (square root) So our predictions did well on unpopular movies too!So our predictions did well on unpopular movies too!
Interesting question: would we win on square root Interesting question: would we win on square root scale (or similarly, Poisson likelihood-based scale (or similarly, Poisson likelihood-based evaluation)? Sure hope so!evaluation)? Sure hope so!
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4141
Competition evaluation (ctd.)Competition evaluation (ctd.)Results of competition (log-scale evaluation):Results of competition (log-scale evaluation):
Components of our model’s MSE:Components of our model’s MSE: The error of the model for the scaled-down The error of the model for the scaled-down Task 1Task 1 test set (which test set (which
we estimated at about 0.24)we estimated at about 0.24) Additional error from incorrect scaling factorAdditional error from incorrect scaling factor
Scaling numbers:Scaling numbers: True total reviews: 8.7MTrue total reviews: 8.7M Sum of our predictions: 9.3MSum of our predictions: 9.3M
Interesting question: what would be best scalingInteresting question: what would be best scaling For log-scale evaluation? Conjecture: need to under-estimate true For log-scale evaluation? Conjecture: need to under-estimate true
totaltotal For square-root evaluation? Conjecture: need to estimate about For square-root evaluation? Conjecture: need to estimate about
rightright
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4242
Effect of scaling on the two Effect of scaling on the two evaluation approachesevaluation approaches
Scaling
Total reviews (M)
Log-scale MSE
Square-root scale
MSE Comment
0.7 6.55 0.222 40.28
0.8 7.48 0.208 29.80Best log performance
0.9 8.42 0.225 26.38Best sqrt performance
0.93 8.70 0.234 26.55 Correct scaling
1 9.35 0.263 28.86 Our solution
1.1 10.29 0.316 36.37
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4343
Effect of scaling on the two evaluation Effect of scaling on the two evaluation approachesapproaches
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4444
KDD CUP 2007: SummaryKDD CUP 2007: Summary
Keys to our success: Keys to our success: Identify subtle leakageIdentify subtle leakage
Is it formally leakage? Depends on intentions of Is it formally leakage? Depends on intentions of organizers…organizers…
Appropriate statistical approach Appropriate statistical approach Poisson regressionPoisson regression Inverting rejection sampling in leakageInverting rejection sampling in leakage Careful handling of time-series aspects Careful handling of time-series aspects
Not keys to our success:Not keys to our success: Fancy machine learning algorithms Fancy machine learning algorithms
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4545
Case Study # 2: KDD CUP 2008 - Siemens Medical Breast Cancer Identification
1712 Patients
6816 Images
105,000 Candidates
[ x1 , x2 , … , x117, class]
candidate feature vector
Malignant
?
MLO CC MLO CC
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4646
KDD-CUP 2008 based on Mammography
Training: labeled candidates from 1300 patient and association of candidate to location, image and patient
Test: candidates from separate set of 1300 patients
Task 1: Rank all candidates by the likelihood of being
cancerous Results: IBM Research team was the winner out of 246
Task 2: Identify a list of healthy patients Results: IBM Research team was the winner out of 205
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4747
Task 1: Candidate Likelihood of Cancer
Almost standard probability estimation/ranking task on the candidate level
Somewhat synthetic as the meaning of the features is unknown
Challenges Low positive rate: 7% patients and 0.6% of candidates
Beware of overfitting Sampling
Unfamiliar evaluation measure FROC, related to AUC Non-robust
Hint at locality
Key Solution Simple linear model Post-processing of scores Leakge in identifiers
Tru
e P
os
itiv
e P
ati
en
t R
ate
False Positive Candidate Rate Per Image
FROC
Adapting to evaluation measures!
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4848
Task 2: Classify patientsDerivate of the previous task 1 A patient is healthy if all her candidates are benign Probability that a patient is healthy is the product of the
probabilities of her candidates
Challenges Extremely non robust performance measure:
Including any patient with cancer in the list disqualified the entry
Risk tradeoff – need to anticipate the solutions of the other participants
Key solution Pick a model with high sensitivity to false negatives Leakage in identifiers: EDA at work
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 4949
EDA on the Breast Cancer Domain
144484 1148717 0168975 0169638 1171985 0177389 1182498 0185266 0193561 1194771 0198716 1199814 11030694 01123030 01171864 01175742 01177150 01194527 01232036 01280544 01328709 01373028 01387320 01420306 0---more---
Console output of sorted ‘patient_ID patient_lable’:
Base rate of 7%????
What about 200K to 999K?
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5050
Mystery of the Data Generation:Identifier Leakage in the Breast cancer data
Distribution of identifiers has a strong natural grouping of patient identifiersDistribution of identifiers has a strong natural grouping of patient identifiers 3 natural buckets3 natural buckets
The three group have VERY different base rated of cancer prevalenceThe three group have VERY different base rated of cancer prevalence Last group seems to be sorted (cancer first)Last group seems to be sorted (cancer first)
Total of 4 groups with very patient different probability of cancerTotal of 4 groups with very patient different probability of cancer Organizers admitted to have combined data from different years in order to Organizers admitted to have combined data from different years in order to
increase the positive rateincrease the positive rate
245 Patients:
36% Cancer
414 Patients:
1% Cancer
1027 Patients
0% Cancer
18 Patients:
85% Cancer
Mo
del
sco
re
Log of Patient ID
Every point is a candidate
Leakage
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5151
Building the classification model For evaluation we created a stratified 50% training and
test split by patient Given few positives (~300), results may exhibit high variance
We explored the use of various learning algorithms including Neural Networks, Logistic regression and various SVMs
Linear models (logistic regression or linear SVMs) yielded the most promising results FROC 0.0834
Down-sampling the negative class? Keep on 25% of all healthy patients Helped in some cases, not really reliable improvement
Add the identifier category (1,2,3,4) as additional feature
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5252
Modeling Neighborhood Dependence
Candidates are not really iid but actually relational: Stacking
Build initial model and score all candidates Use labels of neighbors in second round
Formulate as EM problem Treat the labels of the neighbors are unobserved in EM
Pair-wise constraints based on location adjacency Calculate the Euclidean distance from the candidates within the
same picture and distance to the nipple in both views for each breast
Select the pairs of candidates with distance difference less than a threshold
Constraints: selected pairs of examples (xi,MLO, xi,CC) should have the same predicted labels, i.e. f(xi,MLO) = f(xi,CC).
Results Seems to improve the probability estimate in terms of AUC Did not improve FROC
Relational Data
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5353
Outlier Treatment Many of the 117 numeric features have large
outliers Incur a huge penalty in terms of likelihood
Large bias Badly calibrated probabilities Extreme (wrong) values in the prediction
Histogram of info[, 1]
info[, 1]
Fre
quen
cy
0 5 10 15 20
020
0040
0060
0080
00
Histogram of Feature 10
142 observations > 5
Statistics
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5454
ROC vs. FROC optimization: Post-processing of model scores?
In ROC all rows are independent In ROC all rows are independent
and both true positives and falseand both true positives and false
positives are counted by rowpositives are counted by row FROC has true FROC has true patients patients and false and false
positive positive candidatescandidates
Higher TP rate for candidates does not improve FROC unless from Higher TP rate for candidates does not improve FROC unless from newnew patient, e.g., patient, e.g., It’s better to have 2 correctly identified candidates from It’s better to have 2 correctly identified candidates from
different patients, instead of 5 from the same different patients, instead of 5 from the same It’s best to re-order candidates based on model scores so as It’s best to re-order candidates based on model scores so as
to ensure many different patients up front to ensure many different patients up front
Tru
e P
os
itiv
e R
ate
False Positive Rate
ROC
Tru
e P
os
itiv
e P
ati
en
t R
ate
FROC
False Positive Candidate Rate
Adapting to evaluation
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5555
Probabilistic Approach: At any point we want to maximize the expected gradient of the FROC at this point Define for each candidate c of patient i
pc probability that candidate c is malignant npj probability that a patient i has not yet been identified
3 cases Candidate is positive but you already have identified patient with probability = pc *(1-npi) Candidate is positive and new patient with probability = pc *npi
Candidate is negative with probability =1- pc
Pick candidate with largest expected gain: pc *npi/(1- pc)Theorem: The expected value of FROC for the is higher that for any other orderProblem: Our probability estimates are not good enough for this to work well
Theory of Post-processingAdapting to evaluation
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5656
Bad Calibration! We consistently over-
predict the probability of cancer for the most likely candidates Linear Bias of the method High class-skew Outlier in the 117 numeric
features leads to extreme predictions on holdout
Clibration Plot
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1 1.2
Predicted Probability
Tru
e P
rob
abil
ity
Calibration Plot
Re-calibration?Re-calibration? We tried a number of methods We tried a number of methods No improvementNo improvement Some resulted in better calibration but hurt the Some resulted in better calibration but hurt the
rankingranking
Statistics
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5757
Post-Processing Heuristic
Re-ordering model scores significantly improves the FROC Re-ordering model scores significantly improves the FROC with no additional modelingwith no additional modeling
Ad-HOC Approach:Ad-HOC Approach: Take the top n ranked Take the top n ranked candidates where n is candidates where n is approximately the number of approximately the number of positive candidatespositive candidates Select one candidate with the Select one candidate with the highest score for each patient highest score for each patient from this list and put them on from this list and put them on the top of the listthe top of the list Iterate until all top n Iterate until all top n candidates are re-orderedcandidates are re-ordered
Tru
e P
os
itiv
e P
ati
en
t R
ate
False Positive Rate Per Image
Adapting to evaluation
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5858
Submissions and ResultsTask 1 Bagged linear SVMs with hinge loss and heuristic post processing This approach scored the winning result of 0.0933 on FROC out of 246
submission of 110 unique participants Second place scored 0.0895 Some rumor that other participants also found the ID leakage
Task 2 Logistic model performs better than the SVM models probably because
likelihood is more sensitive to extreme errors (the first false negative) The first false negative occur typically around 1100 patients in the
training set We submitted the first 1020 patients ranked by a logistic model that
included ID feature + original 117 features Scored a specificity of 0.682 on the test set with no false negatives Only 24 out of 203 submissions had no false negatives Second place scored 0.17 specificity
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 5959
Summary in terms of success factors Leakage in the identifier provides information about the
likelihood of a patient to have cancer Caused by the organizers effort to increase the positive rate
by adding ‘old’ patients that developed cancer
Post-processing for FROC optimization Awareness of impact of feature outliers Interacts with the statistical properties of the data and the
model Log-likelihood more sensitive than hinge loss
Otherwise simple model to avoid overfitting Linear models
Relational is not helpful for the given evaluation
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6060
KDD CUP 2009 Data: customer database of Orange with 100K
observations and 15K variables
Three different tasks and 2.5 versions: Prediction: Churn, Appetency, Upselling Versions: Fast (5 days) & Slow (1 month) Large and Small version
Interesting characteristics: Highly ‘sterile’, nothing known about anything Leaderboard is was possible to match the large and small and
receive feedback on 20% of test
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6161
KDD CUP 2000 Data: online store history for Gazelle.com
Five different tasks including: Prediction: Who will continue in session? Who will buy? Insights: Characterize heavy spenders
Interesting characteristics: “Leakage”: internal testing sessions were left in data
Deterministic behavior If identified, give 100% accuracy in prediction for part of data
Evaluation in terms of “real” business objectives? Sort of: handled by defining a set of “standard” questions,
each covering different aspect of business objective Relational data?
Yes, customers had different # of sessions, of different length, with different stages
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6262
KDD CUP 2003 Data: Citation rates in Physics papers
Two tasks: Predict Change in number of citation during next 3 month Write an interesting paper about it
Interesting Characteristics Highly relational, links between papers and authors Feature construction up to participants Leakage impossible since the truth was really in the
future Evaluation on SSE against integer values (Poisson)
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6363
ILP Challenge 2003 Data: Yeast genome including protein sequence,
alignment similarity scores with other proteins, additional protein information from relational DB
Task: Identify (potentially multiple) functional classes for each gene
Interesting characteristics 420 possible classes, very subjective asignment Purely relational, no ‘features’ available
Distances (supposedly p-values) of gene alignment Secondary structure (protein of amino acids) Protein DB with keywords, etc
‘Leakage’ in the identifier: contains letter of the labeling research group
Highly unsatisfcatory evaluation: precision of the prediction
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6464
INFORMS data mining contest 2008
Data: 2 years of hospital records with accounting information (cost, reimbursement, …) , patient demographics, medication history
Tasks: Identify pneumonia patients Design optimization setting for preventive treatment
Interesting characteristics: Relational setting (4 tables linked though patient
identifier) Leakage: removal of the pneumonia code left hidden
traces ‘Dirty’ data with plenty missing, contradicting
demographics and changing patient ID’s
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6565
Data Mining in the Wild: Project Work
Similarities with competitions (compared to DM research): Single dataset Algorithms can be existing and simple No real need for baselines (although useful) The absolute performance matters
Differences to competitions: You need to decide what the analytical problem is You need to define the evaluation rather than optimize it You need to avoid leakage rather than use it You need to FIND all relevant data rather than use what
is there (often leads to relational settings) You need to deliver it somehow to have impact
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6666
Case Study #3: Market Alignment ProgramWallet:
Total amount of money that the customer can spend in a certain product category in a given period
Why Are We Interested in Wallet? Customer targeting
Focus on acquiring customers with high wallet For existing customers, focus on high wallet, low share-of-
wallet customers Sales force management
Wallet of as sales force allocation target and make resource assignment decisions based on wallet
Evaluate success of sales personnel and by attained share-of-wallet
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6767
Wallet Modeling Challenge
The customer wallet is never observed Nothing to “fit a model” Even if you have a model, how do you evaluate it?
We would like a predictive approach from available data Firmographics (Sales, Industry, Employees) IBM Sales and transaction history
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6868
Define Wallet/Opportunity?
TOTAL: Total customer available budget in total IT Can we really hope to attain all of it?
SERVED: Total customer spending on IT products offered by IBM Better definition for our marketing purposes
REALISTIC: IBM spending of the “best similar customers”
IBM Sales REALISTIC SERVED TOTAL Company Revenue
Company Revenue
TOTAL
SERVED
REALISTIC
IBM Sales
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 6969
REALISTIC Wallets as quantiles
Motivation Imagine 100 identical firms with identical IT needs Consider the distribution of the IBM sales to these firms Bottom 95% of firms should spend as much as the top
5%
Define REALISTIC wallet as high percentile of spending conditional on the customer attributes
Implies that a few customers are spending full wallet with us
however, we do not know which ones
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7070
Distribution of IBM sales s to the customer given customer attributes x: s|x ~ f,x
Two obvious ways to get at the pth percentile: Estimate the conditional by integrating over a
neighborhood of similar customers Take pth percentile of spending in neighborhood
Create a global model for pth percentile Build global regression models, e.g.,
s|x ~ Exp(αx+β)
Formally: Percentile of Conditional
REALISTIC
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7171
Estimation: the Quantile Loss Function The mean minimizes a sum of squared
residuals:
The median minimizes a sum of absolute residuals.
The p-th quantile minimizes an asymmetrically weighted sum of absolute residuals:
n
iiy
1
2)(min
n
iim my
1
||min
n
iiipy yyL
i1
ˆ )ˆ,(min
-3 -2 -1 0 1 2 3
01
23
4
p=0.8
p=0.5 (absolute loss)
yyyyp
yyyypyyLp ˆ if )ˆ()1(
ˆ if )ˆ()ˆ,(
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7272
‘Ad HOC’
Overview of analytical approaches
kNN-Industry
- Size
Optimization
Quantile Regression Decomposition
Model Form - Linear - Decision Tree - Quanting
- Linear Model- Adjustment
General kNN - K - Distance - Features
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7373
Data Generation Process Need to combine data on revenue with customers properties
Complicated matching process between in IBM internal customer view (accounts) and the external sources (Dun & Breadstreet)
Probabilistic process with plenty of heuristics Huge danger of introducing data bias Tradeoff in data quality and coverage
Leakage potential: We can only get current customer information This information might be tainted by the customer’s
interaction with IBM Problem gets amplified when we try to augment the data
with home-page information
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7474
Evaluating Measures for Wallet We still don’t know the truth Combined approach:
Quantile loss to assess only the relevant predictive ability and feature selection
Expert Feedback to select suitable model class Business Impact to identify overall effectiveness
Quantile LossQuantile Loss Expert Expert FeedbackFeedback
Business ImpactBusiness Impact
AvailableAvailable RelevantRelevant Very RelevantVery Relevant
- Not that relevant- Missing a parameter- sensitive to skew- Scale? Log?
- Similar to survey- Unclear incentives- Potentially biased- Hard to come by on large scale
- Highly aggregated- Long lag- Convoluted with impact of other things- Requires intense tracking
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7575
Empirical Evaluation I: Quantile Loss Setup
Four domains with relevant quantile modeling problems:direct mailing, housing prices, income data, IBM sales
Performance on test set in terms of 0.9th quantile loss Approaches:
Linear quantile regression Q-kNN (kNN with quantile prediction from the
neighbors) Quantile trees (quantile prediction in the leaf) Bagged quantile trees Quanting (Langrofd et al. 2006 -- reduces quantile
estimation to averaged classification using trees) Baselines
Best constant model Traditional regression models for expected values,
adjusted under Gaussian assumption (+1.28)
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7676
Performance on Quantile Loss (smaller is better)
Conclusions Standard regression is not competitive (because the residuals are not
normal) If there is a time-lagged variable, linear quantile model is best Splitting criterion is irrelevant in the tree models Quanting (using decision trees) and quantile tree perform comparably Generalized kNN is not competitive
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7777
Evaluation II: MAP Workshops Overview
Calculated 2005 opportunity using naive Q-kNN approach
2005 MAP workshops Displayed opportunity by brand Expert can accept or alter the opportunity
Select 3 brands for evaluation: DB2, Rational, Tivoli
Build ~100 models for each brand using different approaches
Compare expert opportunity to model predictions Error measures: absolute, squared Scale: original, log, root Total of 6 measures
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7878
0
2
4
6
8
10
12
14
16
18
20
0 2 4 6 8 10 12 14 16 18 20
Expert Feedback
MODEL_OPPTY
Expert Feedback to Original Model
Experts acceptopportunity (45%)
Increase (17%)
Decrease (23%)
Experts changeopportunity (40%)
Experts reduced opportunity to 0(15%)
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 7979
Observations
Many accounts are set for external reasons to zero Exclude from evaluation since no model can
predict the competitive environment Exponential distribution of opportunities
Evaluation on the original (non-log) scale is subject to large outliers
Experts seem to make percentage adjustments Consider log scale evaluation in addition to
original scale and root as intermediate Suspect strong “anchoring” bias, 45% of
opportunities were not touched
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8080
Model Comparison Results
ModelModel RationalRational DB2DB2 TivoliTivoli
Displayed Model Displayed Model (kNN)(kNN)
66 66 44 55 66 66
Max 03-05 RevenueMax 03-05 Revenue 11 11 00 33 11 44
Linear Quantile 0.8Linear Quantile 0.8 55 66 22 44 33 55
Regression TreeRegression Tree 11 33 22 44 11 22
Q-kNN 50 + Q-kNN 50 + flooringflooring 22 33 66 66 44 66
Decomposition Decomposition CenterCenter
00 00 33 55 00 44
Quantile Tree 0.8Quantile Tree 0.8 00 11 22 44 11 44
(Anchoring)
(Best)
We count how often a model scores within the top 10 and 20 for each of the 6 measures:
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8181
MAP Experiments Conclusions
Q-kNN performs very well after flooring but is typically inferior prior to flooring
80th percentile Linear quantile regression performs consistently well (flooring has a minor effect)
Experts are strongly influenced by displayed opportunity (and displayed revenue of previous years)
Models without last year’s revenue don’t perform well
Use Linear Quantile Regression with q=0.8 in MAP 06
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8282
MAP Business Impact MAP launched in 2005
In 2006 420 workshops held worldwide, with teams responsible for most of IBM’s revenue
MAP recognized as 2006 IBM Research Accomplishment Awarded based on “proven” business impact
Runner up in Case Study Award in KDD 2007 Edelman finalist 2009 Most important use is segmentation of customer base
Shift resources into “invest” segments with low wallet share
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8383
Business Impact For 2006, 270 resource shifts were made to 268 Invest
Accounts We examine the performance of these accounts relative to
background
$0
$20
$40
$60
$80
$100
$120
$140
$0 $20 $40 $60 $80 $100
2005 Revenue ($M)
Rev
enue
Asp
iratio
n ($
M)
CORE
INVEST
EXAMINE
OPTIMIZE
Invest
Core - Growth
Core - Optimize
2005 Actual Revenue ($M)
Val
idat
ed R
even
ue
Op
po
rtu
nit
y ($
M)
270 Shifts
REVENUE: 9% growth in INVEST accounts 4% growth in all other accounts
QUOTA ATTAINMENT: 45% for MAP-shifted resources 36% for non-MAP shifts
PIPELINE (relative to 2005):17% growth in INVEST accounts3% growth in all other accounts
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8484
Summary in terms of success factors
1 Data and Domain understanding Match of business objective to modeling approach
made a previously unsolvable business problem solvable with predictive modeling
2 Statistical insight Minimizing quantile-loss estimates the correct quantity One single evaluation metrics is in real life not enough Autocorrelation helps linear model
3 Modeling Extension to tree induction Comparative study In the end: linear it is
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8585
Identify Potential Causes for Chip Failure
Data: 5K machines of which 18 failed in the last year Task: Can you identify a (short) list of other machines
that are likely to fail to have them preemptively fixed Characteristics
Relational: Tool ID, Multiple chips per machine (only the first failure is detected)
Leakage: database is clearly augmented past failure: all failure have a customer associated, but customer is missing in most non-failure
Statistical observation: This is really a survival analysis problem, the failure does not occur prior to a runtime of 180 days
Accuracy and even AUC is NOT relevant Insight: cause of failure Lift and false positive rate in the top k is more important
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8686
Threats in Competitions and Projects
Competitions Mistakes under time
pressure Accidental use of the
target (kernel SVM) Complexity
Overfitting
Projects Unavailability of dataUnavailability of data Data generation problemsData generation problems The model is not good The model is not good
enough to be usefulenough to be useful Model results are not Model results are not
accessible to the useraccessible to the user If the user has to If the user has to
understand the model you understand the model you need to keep it simpleneed to keep it simple
Web delivery of predictionsWeb delivery of predictions
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8787
OverfittingEven if you think, that you know this one -You probably still overdo it!
KDD CUP results have shown that a large number of entries overfit
2003, 90% of entries did worse than the best constant prediction
Corollary: Don’t overdo it on the search Having a holdout, does NOT make you immune to
overfitting- you just overfit on the holdout 10 fold cross validation does NO make you immune either Leaderboards on 10% of test are VERY deceptive
KDD CUP 2009: The winner of the fast challenge after only 5 days was indeed the leader of the board
The winner of the slow challenge after 1 more month was NOT the leader of the board
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8888
Overfitting: Example KDD CUP 2008
Data 105,000 candidates 117 numeric features Sounds good right?
Overfitting is NOT just about the training size and model complexity!
Linear models overfit too! How robust is the evaluation measure?
AUC FROC Number of healthy patients
What is the base rate? 600 positives
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 8989
Factors of Success in Competitions and Real Life
1. Data and domain understanding Generation of data and task Cleaning and
representation/transformation
2. Statistical insights Statistical properties Test validity of assumptions Performance measure
3. Modeling and learning approach Most “publishable” part Choice or development of most suitable
algorithm
Re
al
Ste
rile
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 9090
Success Factor 1: Data and Domain Understanding
Task and data generation Formulate analytical problem (MAP) EDA Check for Leakage
KDD 07: NETFLIX KDD 08: Cancer MAP
Adjust for Decreasing population Task 1 target
Combined sources lead to leakage
Wallet definition and design of analytical solution
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 9191
Success Factors 2: Statistical insights
Properties of Evaluation Measures Does it measure what you care about? Robustness Invariance to transformation/… Linkage between model optimization, statistic and
performance
KDD 07: NETFLIX KDD 08: Cancer MAP Poisson regression Log transform downscale
Highly non-robust, beware of overfitting Post processing
Robust evaluation Multiple measures
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 9292
Success Factors 3: Models and approach
How much complexity do you need? Often linear does just fine with the correctly constructed
features (Actually of my wins have been with linear models)
Feature selection Can you optimize what you want to optimize?
How does the model relate to your evaluation metrics Regression approaches predict conditional mean
Accuracy vs AUC vs Log likelihood Does it scale to your problem?
Some cool methods just do not run on 100K
NETFLIX KDD CUP 08 MAP Linear Poisson Log transform
Logistic Regression Linear SVM
Linear quantile regression
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 9393
Summary: comparison of case studies
KDD CUP 07Task 2
KDD CUP 08Task 1
MAP
Ultimate modeling goal
Demand forecasting
Breast cancer detection
Customer wallet estimation
Evaluation objective
Log-scale RMSE
FROC 0.2-0.3 Quantile loss/ Expert feedback
Key data/domain insight
Leakage from Task 1
Leakage in patient IDs
Duality quantilewallet
Key statistical insight
Poisson distribution
FROC post-processing
Optimizer of quantile loss
Best modeling approach
Maximum likelihood (Poisson reg.)
Machine learning (linear SVM)
Empirical risk minimization (quantile reg.)
Predictive Modeling in the Wild Predictive Modeling in the Wild Saharon Rosset & Claudia PerlichSaharon Rosset & Claudia Perlich 9494
Invitation: Please join us on another data mining competition!
INFORMS Data Mining contest on health care data Register at www.informsdmcontest2009.org Real data of hospital visits for patients with severe
heart disease ‘Real’ tasks for ongoing project
Transfer to specialized hospitals Severity / death
Relational (multiple hospital stays per patient) Evaluation:
AUC Publication and workshop at INFORMS 2009