KDD Cup Survey

33
1 KDD Cup Survey Xinyue Liu

description

KDD Cup Survey. Xinyue Liu. Outline. Nuts and Bolts of KDD Cup KDD Cup 97-99 KDD Cup 2000 Summary. About KDD Cup. A knowledge discovery and data mining tools competition in conjunction with KDD conferences. It aims at: - PowerPoint PPT Presentation

Transcript of KDD Cup Survey

Page 1: KDD Cup Survey

1

KDD Cup Survey

Xinyue Liu

Page 2: KDD Cup Survey

2

Outline Nuts and Bolts of KDD Cup

KDD Cup 97-99

KDD Cup 2000

Summary

Page 3: KDD Cup Survey

3

About KDD CupA knowledge discovery and data mining tools

competition in conjunction with KDD conferences. It aims at:

showcase the best methods for discovering higher-level knowledge from data.

Helping to close the gap between research and industry

Stimulating further KDD research and development

Page 4: KDD Cup Survey

4

StatisticsAccess and Final Participation

0

50

100

150

200

NDA (access to data) Participants

Coun

t

Cup 97

Cup 98

Cup 99

Cup 2000

Participation in KDD Cup grew steadily, especially requests to access the data

Average person-hours per submission: 204Max person-hours per submission: 910

Commercial software grew from 44% (cup 97) to 52% (cup 98) to 77% (cup 2000)

Page 5: KDD Cup Survey

5

AlgorithmsAlgorithms Tried vs Submitted

0

2

4

6

8

10

12

14

16

18

20

Algorithm

Entrie

s Tried

Submitted

Decision trees most widely tried and by far the most commonly submitted

Page 6: KDD Cup Survey

6

KDD Cup 97 A classification task – to predict Financial

services industry direct mail response Winners

Charles Elkan, a PhD from UC-San Diego with his Boosted Naive Bayesian (BNB)

Silicon Graphics, Inc with their software MineSet Urban Science Applications, Inc. with their

software gain, Direct Marketing Selection System

Page 7: KDD Cup Survey

7

BNB Boosting – to learn a series of classifiers, where

each classifier in the series pays more attention to the examples misclassified by its predecessor. Repeated T rounds.

BNB – representationally equivalent to a multilayer perceptron with a single hidden layer.

Complexity – O(ef)e – examples f - attributes

Page 8: KDD Cup Survey

8

MineSet A KDD tool that combines data access, transformation,

classification, and visualization.

Page 9: KDD Cup Survey

9

KDD Cup 98 URL: www.kdnuggets.com/meetings/kdd98/kdd

-cup-98.html A classification task – to analyze fund raising mail

responses to a non-profit organization Winners

Urban Science Applications, Inc. with their software GainSmarts.

SAS Institute, Inc. with their software Enterprise Miner.

Quadstone Limited with their software Decisionhouse

Page 10: KDD Cup Survey

10

GainSmarts GainSmarts – a feature selection expert

system First step - used Logistic Regression to assign

each prospect a probability of donation (Pi). Second step - used Linear Regression to

estimate a conditional donation amount of responding donors (Ai)

Result (<1% error) - Prediction = Pi * Ai

Page 11: KDD Cup Survey

11

Enterprise Miner A data mining solution that addresses the entire

data mining process SEMMA Process

Sample Explore Modify Model Assess

Algorithms Decision tree Neural network Regression

Page 12: KDD Cup Survey

12

Decisionhouse Decisionhouse – an integrated modelling

software suite by Quadstone Data exploration using visualization modules. Use Decision trees and Scorecards to model more

complex tasks. Choose the final model by comparing a variety of

modeling approaches and looking at the difference in predicted net profitability (lift curve).

Page 13: KDD Cup Survey

Results

$-$5,000

$10,000$15,000$20,000$25,000$30,000$35,000$40,000$45,000$50,000$55,000$60,000$65,000$70,000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%Maximum Possible Profit Line($72,776 in profits with 4,873 mailed)

GainSmartsSAS/Enterprise Miner

Quadstone/Decisionhouse

Mail to Everyone Solution ($10,560 in profits with 96,367 mailed)

Page 14: KDD Cup Survey

14

KDD Cup 99 URL:

www.cse.ucsd.edu/users/elkan/kdresults.html Problem

same data set as KDD Cup 98 Winners

SAS Institute Inc. with their software Enterprise Miner.

Amdocs with their Information Analysis Environment

Page 15: KDD Cup Survey

15

Software SAS – using two-stage model which includes two

multi-layer perceptron (MLP) neural networks models.

Amdocs – using its own Information Analysis Environment, which allows modeling of the value and class membership simultaneously. Algorithms used is a hybrid logistic regression model

Page 17: KDD Cup Survey

17

Data SetData collected from Gazelle.com, a legwear and legcare web retailer Pre-processedTraining set: 2 months Test sets: one month Data collected includes:

Click streams Order information Registration form

Page 18: KDD Cup Survey

18

Problems The goal – to design models to support web-site

personalization and to improve the profitability of the site by increasing customer response.

Questions - When given a set of page views,1. will the visitor view another page on the site or leave? 2. which product brand will the visitor view in the remainder of

the session?3. characterize heavy spenders4. characterize killer pages5. characterize which product brand a visitor will view in the

remainder of the session?

Page 19: KDD Cup Survey

19

Evaluation Accuracy/score was measured for the two

questions with test sets Insight questions judged with help of retail experts

from Gazelle and Blue Martini Created a list of insights from all participants

Each insight was given a weigh Each participant was scored on all insights

Additional factors: Presentation quality Correctness

Page 20: KDD Cup Survey

20

The Winners Question 1 & 5 Winner: Amdocs

Question 2 & 3 Winner: Salford Systems

Question 4 Winner: e-steam poster

Page 21: KDD Cup Survey

21

Software (Amdocs)

Exploratory Data Analysis – SAS Classification Tree – Amdocs Business

Insight Tool Decision tree Rules Extraction Modeling Combining models

Page 22: KDD Cup Survey

22

Scheme

No

Yes Score Model0 for all

Score Model5 rules (0 or 1)

Score Model1 for all

Yes

one-click

multi-click

No

BestRuleMergedHybrid

Score ModelEnsemble

FinalPrediction

Model

Main Model

PartialTestingPattern?

Bot /Crawler ?

All Data

Page 23: KDD Cup Survey

23

Main ModelDecision Tree

5 treesbuilt on 34000 cases

Decision Tree5 trees

built on 34000 cases

Decision Tree5 trees

built on 34000 cases

Rule Generator1466 rules

111 continue rules

Rule Generator1466 rules

111 continue rules

Rule Generator1466 rules

111 continue rules

BestRule

HybridModel

MergedRules

BestRule

HybridModel

MergedRules

BestRule

HybridModel

MergedRules

Page 24: KDD Cup Survey

24

Sub-models

Best rule

Hybrid Model

Merged Rules

Each model captures a different aspect of

the overall behavior in

the data.Combining

or ensembling the models

provides the best

prediction results.

Logistic regression on rule set defines score for

each record as a combination of rules the

record satisfies

Logistic regression on rule set + raw field

values combine to define score for each record

Chooses most accurate rule satisfied by each

record

Page 25: KDD Cup Survey

25

Software (Salford) CART - a decision tree tool that automatically

searching for and isolating significant patterns and relationships

MARS - a multivariate non-parametric regression procedure

HotSpotDetector

TreeNet

Page 26: KDD Cup Survey

26

Cart Binary recursive partitioning. Key elements:

Splitting rules Brute force search all possible splits for all variables Rank each splitting rule on the basis of a quality-of-

split criterion (default GINI) Recursion - split until further splitting is

impossible or stopped. Class assignment

Plurality rule Assign every node whether it is terminal or not.

Pruning Trees – does not stop in the middle Testing - best sub-tree is the one with the lowest

error

Page 27: KDD Cup Survey

27

MARs Automatic variable search  Automatic variable transformation  Automatic limited interaction searches  Variable nesting  Built-in testing regimens  model selection parameters.

Page 28: KDD Cup Survey

28

Insights (Heavy Spenders)

Some of the Good insights Referrers - establish ad policy based on conversion

rates, not click-throughs Not an AOL user - browser window too small for layout Referring site traffic changed dramatically over time Came to site from print-ad or news, not friends & families Very high and very low income Geographic: Northeast U.S. states Repeat visitors

Page 29: KDD Cup Survey

29

Insights (Who leaves?)

Some of the good insights Crawlers, bots accounted for 16% of sessions Long processing time (> 12 seconds) implies high

abandonment Referring sites: mycoupons have long sessions,

shopnow.com are prone to exit quickly Returning visitors' prob of continuing is double View of specific products (Oroblue,Levante) cause

abandonment Probability of leaving decreases with page views Free Gift and Welcome templates on first three

pages encouraged visitors to stay at site

Page 30: KDD Cup Survey

30

Insights(Brand view)

Some good insights Referrer URL is great predictor:

Fashionmall.com and winnie-cooper are referrers for Hanes and Donna Karan

mycoupons.com, tripod, deal-finder are referrers for American Essentials

Previous views of a product imply later views

Page 31: KDD Cup Survey

31

Summary Data mining requires background knowledge and

access to business users Successful data mining solutions combine

automated and manual analysis, integrating the power of the machine with expert knowledge and human insight

Web Mining is challenging: crawlers/bots, frequent site changes, etc.

KDD Cup is an excellent source to learn the state-of-art KDD techniques

KDD Cup data available for research and education

Page 32: KDD Cup Survey

32

ReferencesElkan C. (1997). Boosting and Naive Bayesian Learning. Technical

Report No. CS97-557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze

Miner Award. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/quadstone/index.html

Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html

Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf

Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc

Page 33: KDD Cup Survey

33

References (Cont.)Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers

using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf

Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt

Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt

Salford System white papers: http://www.salford-systems.com/whitepaper.html

Summary talk presented at KDD (2000)http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt