April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin...
-
Upload
elfrieda-wilkerson -
Category
Documents
-
view
212 -
download
0
Transcript of April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin...
April 11, 2008
Data Mining Competition 2008
The 4th Annual Business Intelligence Symposium
Hualin Wang ([email protected])
Manager of Advanced AnalyticsRetail Marketing Insights
Alliance Data, Columbus, Ohio
April 11, 2008 – Data Mining Competition 2008 Presentation 2
About Alliance Data
Alliance Data develops data driven solutions that help partners build lasting relationships with their customers. As one of the largest providers of retail and co-brand card services, loyalty and marketing solutions, payment processing, and business process outsourcing, we serve the retail, petroleum, utility, financial services and hospitality markets.
April 11, 2008 – Data Mining Competition 2008 Presentation 3
Approach Summary
Exploratory Data Analysis • Identify data issues• Re-code variables • Transform variables• Frequency, UNIVARIATE, BIVARIATE, ANOVA analysis, etc.
Modeling Methodology
• LOGISTIC & PROBIT regression models• Develop a set of regression models of both types on bootstrapping
samples with a range of weights for responders and non-responders.
Ensemble Models • Ensemble the set of LOGISTIC & PROBIT models
April 11, 2008 – Data Mining Competition 2008 Presentation 4
Exploratory Data Analysis
Missing Imputation – Substitute missing value with mean, median, mode, ‘logical’ values, and others based on bivariate results. Notes: Twenty variables are formatted differently for the training and test datasets. For example, some variables have value ‘YE’ in one dataset and ‘YES’ in the other. X2 has the value of HILLSBOROUGH in one set and HILLSBOROUG in the other.
Univariate / Bivariate – Check distributions, extreme values, trend and other patterns.
Significance Investigation – Conduct contingency table analysis to understand whether character variables and their levels are significant in predicting response.
Information Value – Compute information values.
Clustering Analysis – Reveal correlation among numerical variables.
Play the MUSIC gracefully or face it!
April 11, 2008 – Data Mining Competition 2008 Presentation 5
Variable Creation
Capping – Extreme tails are typically capped to reduce their undue influence and to produce more robust parameter estimates.
Binning – Small and insignificant levels of character variables are regrouped.
Box-Cox Transformations – These transformations are commonly included, specially, the square root and logarithm.
Johnson Transformations – Performed on numeric variables to make them more ‘normal’.
Weight of Evidence – Created for character variables and binned numeric variables.
Interaction – Explore possible interactions with the help of decision tree analyses.
April 11, 2008 – Data Mining Competition 2008 Presentation 6
Modeling Methodology
Step 1 – Pick an integer from 3 through 16 and draw 10 bootstrapping samples.
Step 2 – Develop a LOGISTIC model on each sample with responders’ weight equal to the integer and non-responders’ weight equal to 1.
Step 3 – Average the10 probabilities to produce an ensemble LOGISTIC model. In this way, we create 14 ensemble LOGISTIC models, one for each integer from 3 through 16.
Steps 4-6 – Similarly, we obtain 14 ensemble PROBIT models.
Together there are 28 models.
April 11, 2008 – Data Mining Competition 2008 Presentation 7
Ensemble Models
Use each of the 28 models to rank order the 95,960 observations in the test dataset from 95,960 to 1 based on its decreased predicted probabilities.
The average of the 28 ranks for each observation is the final score.
April 11, 2008 – Data Mining Competition 2008 Presentation 8
What have been considered throughout the process?
The two judgment criteria: c-statistic & the response rate in the top 10K. The response rate in the top 10K requires a model to be able to push the responders to the top as much as possible. The rank order capability in the middle may not be strong. The c-statistic criterion requires a model to be able to rank order for the whole population. See the chart on the right.
Comapring Alternative Models
00.10.20.30.40.50.60.70.80.9
1
0 1 2 3 4 5 6 7 8 9 10
Model Decile
Cu
mu
lati
ve R
esp
on
der
s
Model I
Model II
Modeling methods: There are few options for modeling the response, such as LOGISTIC models, PROBIT models (or any one in the family), decision trees, SVM, TreeNet and neural networks. I decided to use the one that I had used before and was known to work well in a similar situation. Sight difference is that this time I combined both LOGISTIC and PROBIT models instead of choosing one over the other.
April 11, 2008 – Data Mining Competition 2008 Presentation 9
My Experiences
• Play the MUSIC gracefully or face it! It usually pays off to develop disciplined procedures to discover and deal with data issues.
• Develop models with different methods and then combine them. In general, ensemble models outperform models with any single method.
• Spend good amount of time on trying to discover trends, patterns and other true data relationships. Make good use of them in modeling.
April 11, 2008 – Data Mining Competition 2008 Presentation 10
Thanks!
Many thanks:
To the Data Mining Program at University of Central Florida and BlueCross BlueShield of Florida for organizing and sponsoring the competition.
Specially to Professor Su for his analytical work and timely responses to our inquires.
To the 4th Annual Business Intelligence Symposium for providing the opportunity for us to present and discuss the problem and the competition.