KDD Cup 2009

35
KDD Cup 2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team

description

KDD Cup 2009. Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team. KDD Cup 2009 Organizing Team. Project team at Orange Labs R&D: Vincent Lemaire Marc Boullé Fabrice Clérot Raphaël Féraud Aurélie Le Cam - PowerPoint PPT Presentation

Transcript of KDD Cup 2009

Page 1: KDD Cup 2009

KDD Cup2009Fast Scoring on a Large DatabasePresentation of the Results at the KDD Cup WorkshopJune 28, 2008The Organizing Team

Page 2: KDD Cup 2009

KDD Cup 2009 Organizing Team

Project team at Orange Labs R&D: • Vincent Lemaire• Marc Boullé• Fabrice Clérot• Raphaël Féraud• Aurélie Le Cam• Pascal Gouzien

Beta testing and proceedings editor:• Gideon Dror

Web site design: • Olivier Guyon (MisterP.net, France)

Coordination (KDD cup co-chairs): • Isabelle Guyon• David Vogel

Page 3: KDD Cup 2009

Thanks to our sponsors…

Orange ACM SIGKDD Pascal Unipen Google Health Discovery Corp Clopinet Data Mining Solutions MPS

Page 4: KDD Cup 2009

KDD Cup Participation By Year

45 5724 31

136

1857

102

3768

95128

453

050

100150200250300350400450500

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year

Year # Teams

1997 45

1998 57

1999 24

2000 31

2001 136

2002 18

2003 57

2004 102

2005 37

2006 68

2007 95

2008 128

2009 453

Record KDD Cup Participation

Page 5: KDD Cup 2009

Participation Statistics 1299 registered teams 7865 entries 46 countries :

Argentina Germany Malaysia South KoreaAustralia Greece Mexico SpainAustria Hong Kong Netherlands SwedenBelgium Hungary New Zealand SwitzerlandBrazil India Pakistan TaiwanBulgaria Iran Portugal TurkeyCanada Ireland Romania UgandaChile Israel Russian Federation United KingdomChina Italy Singapore UruguayFiji Japan Slovak Republic United StatesFinland Jordan SloveniaFrance Latvia South Africa

Page 6: KDD Cup 2009

A worlwide operator

One of the main telecommunication operators in the world

Providing services to more than 170 millions customers over five continents

Including 120 millions under the Orange Brand

Page 7: KDD Cup 2009

KDD Cup 2009 organized by OrangeCustomer Relationship Management (CRM)

Three marketing tasks: predict the propensity of customers

– to switch provider: Churn– to buy new products or services: Appentency– to buy upgrades or new options proposed to them: Up-selling

Objective: improve the return of investments (ROI) of marketing campaigns

– Increase the efficiency of the campaign given a campaign cost– Decrease the campaign cost for a given marketing objective

Better prediction leads to better ROI

Page 8: KDD Cup 2009

Train and deploy requirements

– About one hundred models per month

– Fast data preparation and modeling

– Fast deployment

Model requirements– Robust– Accurate– Understandable

Business requirement– Return of investment for the

whole process

Input data– Relational databases– Numerical or categorical– Noisy– Missing values– Heavily unbalanced distribution

Train data– Hundreds of thousands of

instances– Tens of thousand of variables

Deployment– Tens of millions of instances

Data, constraints and requirements

Page 9: KDD Cup 2009

In-house systemFrom raw data to scoring models

0,n1,n

0,n

1,n0,1

0,n

0,n

0,n

1,1

1,1

1,n

0,n

0,n

1,1

1,n1,n

1,n1,n

0,n

1,n

1,1 0,n

1,n

1,1

0,n

1,1

0,n

1,1

0,n

1,n

0,n

1,11,1

0,n

1,n

1,1

1,n(1,1)

1,1

0,n

1,n

0,n

1,1

0,n

0,n

0,1

1,1

1,n

1,n

0,1

0,n

1,1

1,1

0,1

0,n

Heritage tiers

Heritage offre commerciale

0,n 1,n

1,1

0,n

(1,1)

0,n

0,n

0,n 0,1

0,1

0,n

1,n

1,n0,n

1,n

0,n

(1,1)

Fu appartient type FU

1,1

1,n

Offre

Id offreLibel lé offre

<pi>

Produit & Service

Id PSDate fin val idi té du P&SDate début validité du P&SDate création du P&SLibellé P&S

<pi>

Identi té Tiers

Id identi té tiersLoginType identi té tiers

<pi>

O composée de PS

Elément De Parc

Id EDPDate dernière util isation EDPDate première uti lisation EDP

<pi>

Modèle Conceptuel de DonnéesModèle : MCD PAC_v4Package : Diagramme : Tiers ServicesAuteur : claudebe Date : 14/06/2005 Version :

PS a pour FU

T util ise IT

EDP souscri t ds O

Date début souscription offreDate fin souscription offre

DD

<O>

CRU concerne FU

Gamme

Id gammeLibel lé gammeDate création gammeDate fin de gamme

<pi>

G composée de PS

Fonction Usage

Id fonction d'usageLibél lé fonction usage

<pi>

T détient EDP

Date début détention EDPDate fin détention EDP

DD

<O>

Compte Facturation

Id compte facturationDate début val idité compte facturationDate fin validité compte facturation

<pi>

F émise pour CF

Compte Rendu Usage

Id compte rendu usageDate début CRUDate fin CRUVolume descendant CRUVolume montant CRUType transmission

<pi>

IT génère CRU

CRU généré par EDP

Ligne Facture

Id l igne de factureLigne affichée sur factureMontant HTMontant TTC

<pi>

Type Ligne Facture

Id type l igne factureLibel lé type l igne facture

<pi>

Facture

Id factureDate échéance facture

<pi>

LF correspond à EDP

LF compose F

EDP facturé sur CF

Tiers

Id tiersPrénom tiers PPNom tiers PPNom mari tal PPGenre PPDate naissance tiers PPDate création tiersDate clôture tiersDate modification tiersType Tiers

<pi>T a pour relation avec T

Foyer

Id foyerDate création foyerDate fin foyerNb personnes foyer

<pi>

Adresse

Id adresseCode postal d istributionCommuneNb habitants communeDépartement

<pi>

F a pour A

Date début adresseDate fin adresse

DD

T a pour F

Date début appartenance foyerDate fin appartenance foyerRole tiers ds foyer

DDVA1Type Relation Tiers

Id type relationDate création type de rela tion tiersLibel lé type rela tion tiers

<pi>

Statut Opérateur

Id statut opérateurLibellé statut opérateur

<pi>

Operateur

Id opérateurLibel lé opérateur

<pi>

T a pour S

Date début statut tiersDate fin statut tiers

DD

CSP

Id CSP 350Libel lé CSP 350Id CSP 23Libel lé CSP 23Id CSP 5Libel lé CSP 5

<pi>

T a pour CSP

LF a pour TLF

Classification Offre

Id classi fication offreLibel lé classification offre

<pi>

O posi tionnée ds C

CO hiérarchie

Groupe de CRU

Id groupe de CRU <pi>

CRU appartient à la CCRU

Cercle Relationnel

Id CRLibél lé cercle relationnel

<pi>

CRU a pour OCR

CRU a pour DCR

Coordonnées Tiers

Id coordonnée tiersDate création coordonnéeLibel lé coordonnée tiers

<pi>

T ti tula ire CT

C correspond à M

Données payeur

Inscription fichier contentieuxNb dossiers recouvrement acti fsNb dossiers réclamation actifsNb dossiers recouvrementNb dossiers réclamationNiveau risque courantNiveau risque précédent

Classe de risque

Id classe risqueLibel lé classe risqueLibel lé court classe risqueNiveau risque minimumNiveau risque maximum

<pi>

T a pour CR

Date début tiers ds classe risqueDate fin tiers ds classe risque

DD

Offre composée

Id offre composéeLibel lé offre composée

<pi>

Offre commercia le

Id offre commercialeLibellé offre commercialeDate création offreDate clôture offre

<pi>

O fait partie OC

Date début rattachement offreDate fin rattachement offre

DD

EDP correspond PS

Posi tionnement classification

Id posi tionnementLibel lé positionnement

<pi>

P dans O P hiérarchie

CRU Enchainement

Média

Id médiaLibel lé média

<pi>

EDP a EU

moisvaleur

VA6N10

T payeur du CF

DP pour O

Etat Usage

Id EUlibellé é tat usage

<pi>

Type de fonction d'usage

id type FUlib typ FU

<pi>

Customer

Services

Products

Call details

Data warehouse– Relational data base

Data mart– Star schema

Feature construction– PAC technology– Generates tens of thousands of

variables

Data preparation and modeling

– Khiops technology

Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …

scoring model

Data feeding

PAC

Khiops

Page 10: KDD Cup 2009

Design of the challenge Orange business objective

– Benchmark the in-house system against state of the art techniques

Data– Data store

– Not an option– Data warehouse

– Confidentiality and scalability issues– Relational data requires domain knowledge and specialized skills

– Tabular format– Standard format for the data mining community– Domain knowledge incorporated using feature construction (PAC)– Easy anonymization

Tasks– Three representative marketing tasks

Requirements– Fast data preparation and modeling (fully automatic)– Accurate– Fast deployment– Robust– Understandable

Page 11: KDD Cup 2009

Data sets extraction and preparation Input data

– 10 relational table– A few hundreds of fields– One million customers

Instance selection– Resampling given the three marketing tasks– Keep 100 000 instances, with less unbalanced target distributions

Variable construction– Using PAC technology– 20000 constructed variables to get a tabular representation– Keep 15 000 variables (discard constant variables)– Small track: subset of 230 variables related to classical domain knowledge

Anonymization– Discard variable names, discard identifiers– Randomize order of variables– Rescale each numerical variable by a random factor– Recode each categorical variable using random category names

Data samples– 50 000 train and test instances sampled randomly – 5000 validation instances sampled randomly from the test set

Page 12: KDD Cup 2009

Scientific and technical challenge

Scientific objective– Fast data preparation and modeling: within five days– Large scale: 50 000 train and test data, 15 000 variables– Hetegeneous data

– Numerical with missing values– Categorical with hundreds of values– Heavily unbalanced distribution

KDD social meeting objective– Attract as many participants as possible

– Additional small track and slow track– Online feedback on validation dataset– Toy problem (only one informative input variable)

– Leverage challenge protocol overhead– One month to explore descriptive data and test submission protocol

– Attractive conditions– No intellectual property conditions– Money prizes

Page 13: KDD Cup 2009

Business impact of the challenge

Bring Orange datasets to the data mining community– Benefit for community

– Access to challenging data– Benefit for Orange

– Benchmark of numerous competing techniques– Drive the research efforts towards Orange needs

Evaluate the Orange in-house system– High number of participants and high quality of the results– Orange in-house results:

– Improved by a significant margin when leveraging all business requirements

– Almost Parretto optimal when other criterions are considered (automation, very fast train and deploy, robustness and understandability)

– Need to study the best challenge methods to get more insights

Page 14: KDD Cup 2009

KDD Cup 2009: Result Analysis

Best Result (period considered in the figure)

In House System (downloadable : www.khiops.com)

Baseline (Naïve Bayes)

Page 15: KDD Cup 2009

Overall – Test AUC – Fast

Good Result Very Quickly Best Results (on each dataset) Submissions

Page 16: KDD Cup 2009

Overall – Test AUC – Fast

Good Result Very Quickly Best Results (on each dataset) Submissions

In House (Orange) System: •No parameters•On 1 standard laptop (mono proc)•If deal as 3 different problems

Page 17: KDD Cup 2009

Overall – Test AUC – Fast

Very Fast Good Result Small improvement after the first day

(83.85 84.93)

Page 18: KDD Cup 2009

Overall – Test AUC – SlowVery Small improvement after the 5th day

(84.93 85.2)Improvement due to unscrambling?

Page 19: KDD Cup 2009

Overall – Test AUC – Submissions

23.24% of the submissions (>0.5)

< Baseline

15.25% of the submissions (>0.5)

> In House

84.75% of the submissions (>0.5)

< In House

Page 20: KDD Cup 2009

Overall – Test AUC 'Correlation' Test / Valid

Page 21: KDD Cup 2009

Overall – Test AUC'Correlation' Test / Train

Random Values Submitted

Boosting Method orTrain Target Submitted Over fitting

?

Page 22: KDD Cup 2009

Overall – Test AUC

Test AUC - 12 hours Test AUC - 24 hours

Test AUC – 36 days Test AUC – 5 days

Page 23: KDD Cup 2009

Overall – Test AUC

Test AUC - 12 hours

Test AUC – 36 days

• time to adjust model parameters ?

• time to train ensemble method ?

• time to find more processors ?

• time to test more methods

• time to unscramble ?

• …

Difference between :

• best result at the end of the first day and

• best result at the end of the 36 days

=1.35%

Page 24: KDD Cup 2009

Test AUC = f (time)

Easier ?Harder ?

Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]

Page 25: KDD Cup 2009

Test AUC = f (time)

Easier ?Harder ?

Difference between :

• best result at the end of the first day and

• best result at the end of the 36 days

=1.84% =1.38% =0.11%

Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]

Page 26: KDD Cup 2009

CorrelationTest AUC / Valid AUC (5 days)

Easier ?Harder ?

Churn – Test/Valid – day [0:5] Appetency – Test/Valid – day [0:5] Up-selling– Test/Valid – day [0:5]

Page 27: KDD Cup 2009

CorrelationTrain AUC / Valid AUC (36 days)

Difficulty to conclude something…

Churn – Test/Train – day [0:36] Appetency – Test/Train – day [0:36] Up-selling– Test/Train – day [0:36]

Page 28: KDD Cup 2009

HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)

Knowledge (parameters?) found during 5 days helps after… ?

Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]

Page 29: KDD Cup 2009

Knowledge (parameters?) found during 5 days helps after… ?

HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)

YES !

Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]

Churn – Test AUC – day ]5:36] Appetency – Test AUC – day ]5:36] Up-selling– Test AUC – day ]5:36]

Page 30: KDD Cup 2009

Fact Sheets:Preprocessing & Feature Selection

0 20 40 60 80

Principal Component Analysis

Other prepro

Grouping modalities

Normalizations

Discretization

Replacement of the missing values

PREPROCESSING (overall usage=95%)

Percent of participants

0 10 20 30 40 50 60

Wrapper with search

Embedded method

Other FS

Filter method

Feature ranking

FEATURE SELECTION (overall usage=85%)

Percent of participants

Forward / backward wrapper

Page 31: KDD Cup 2009

Fact Sheets:Classifier

0 10 20 30 40 50 60

Bayesian Neural Network

Bayesian Network

Nearest neighbors

Naïve Bayes

Neural Network

Other Classif

Non-linear kernel

Linear classifier

Decision tree...

CLASSIFIER (overall usage=93%)

Percent of participants

- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.

- Less than 50% regularization (20% 2-norm, 10% 1-norm).

- Only 13% unlabeled data.

Page 32: KDD Cup 2009

Fact Sheets: Model Selection

0 10 20 30 40 50 60

Bayesian

Bi-level

Penalty-based

Virtual leave-one-out

Other cross-valid

Other-MS

Bootstrap est

Out-of-bag est

K-fold or leave-one-out

10% test

MODEL SELECTION (overall usage=90%)

Percent of participants

- About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other).

- About 10% used unscrambling.

Page 33: KDD Cup 2009

Run in parallel

Multi-processor

None

>= 32 GB

> 8 GB

<= 8 GB <= 2GB

Fact Sheets: Implementation

Java

Matlab

C C++

Other (R, SAS)

Mac OS

Linux Unix Windows

Memory

Operating System

Parallelism

Software Platform

Page 34: KDD Cup 2009

Winning methods

Fast track:- IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)- ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used.- David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees.Slow track:- University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss.- Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting.- National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.

-(+: small dataset unscrambling)

Page 35: KDD Cup 2009

Conclusion

Participation exceeded our expectations. We thank the participants for their hard work, our sponsors, and Orange who offered:

– A problem of real industrial interest with challenging scientific and technical aspects

– Prizes.

Lessons learned:– Do not under-estimate the participants: five days were

given for the fast challenge, only a few hours sufficed to some participants.

– Ensemble methods are effective.– Ensemble of decision trees offer off-the-shelf solutions to

problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.