© Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004...
-
Upload
albert-paul -
Category
Documents
-
view
224 -
download
3
Transcript of © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004...
© Deloitte Consulting, 2004
Introduction to Data Mining
James Guszcza, FCAS, MAAA
CAS 2004 Ratemaking Seminar
Philadelphia
March 11-12, 2004
2
© Deloitte Consulting, 2004
Themes
What is Data Mining?How does it relate to statistics?Insurance applicationsData sources
The Data Mining Process Model Design Modeling Techniques
Louise Francis’ Presentation
3
© Deloitte Consulting, 2004
Themes
How does data mining need actuarial science?
Variable creationModel designModel evaluation
How does actuarial science need data mining?
Advances in computing, modeling techniquesIdeas from other fields can be applied to insurance
problems
4
© Deloitte Consulting, 2004
Themes
“The quiet statisticians have changed our world; not by discovering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions.”
-- Ian Hacking
Data mining gives us new ways of approaching the age-old problems of risk selection and pricing….
….and other problems not traditionally considered ‘actuarial’.
© Deloitte Consulting, 2004
What is Data Mining?
6
© Deloitte Consulting, 2004
What is Data Mining?
My definition: “Statistics for the Computer Age” Many new techniques have come from Computer
Science, Marketing, Biology… but all can (should!) be brought under the framework of “statistics”
Not a radical break with traditional statistics Complements, builds on traditional statistics
Statistics enriched with brute-force capabilities of modern computing
Opens the door to new techniques Therefore Data Mining tends to be associated with
industrial-sized data sets
7
© Deloitte Consulting, 2004
Buzz-words
Data Mining Knowledge Discovery Machine Learning Statistical Learning Predictive Modeling Supervised Learning Unsupervised Learning ….etc
8
© Deloitte Consulting, 2004
What is Data Mining?
Supervised learning: predict the value of a target variable based on several predictive variables
“Predictive Modeling”Credit / non-credit scoring enginesRetention, cross-sell models
Unsupervised learning: describe associations and patterns along many dimensions without any target information
Customer segmentationData ClusteringMarket basket analysis (“diapers and beer”)
9
© Deloitte Consulting, 2004
So Why Should Actuaries Do This Stuff?
Any application of statistics requires subject-matter expertise
Psychometricians Econometricians Bioinformaticians Marketing scientists …are all applied statisticians with a particular subject-
matter expertise & area of specialty Add actuarial modelers to this list!
“Insurometricians”!? Actuarial knowledge is critical to the success of insurance
data mining projects
10
© Deloitte Consulting, 2004
Three Concepts
Scoring enginesA “predictive model” by any other name…
Lift curvesHow much worse than average are the policies with
the worst scores?
Out-of-sample testsHow well will the model work in the real world?Unbiased estimate of predictive power
11
© Deloitte Consulting, 2004
Classic Application: Scoring Engines
Scoring engine: formula that classifies or separates policies (or risks, accounts, agents…) into
profitable vs. unprofitableRetaining vs. non-retaining…
(Non-)Linear equation f( ) of several predictive variables
Produces continuous range of scores
score = f(X1, X2, …, XN)
12
© Deloitte Consulting, 2004
What “Powers” a Scoring Engine?
Scoring Engine:
score = f(X1, X2, …, XN) The X1, X2,…, XN are at least as important as
the f( )!Again why actuarial expertise is necessaryThink of the predictive power of credit variables
A large part of the modeling process consists of variable creation and selection
Usually possible to generate 100’s of variablesSteepest part of the learning curve
13
© Deloitte Consulting, 2004
Model Evaluation: Lift Curves
Sort data by score Break the dataset into
10 equal pieces Best “decile”: lowest
score lowest LR Worst “decile”: highest
score highest LR Difference: “Lift”
Lift = segmentation power
Lift translates into ROI of the modeling project
14
© Deloitte Consulting, 2004
Out-of-Sample Testing
Randomly divide data into 3 pieces Training data, Test data, Validation data
Use Training data to fit models Score the Test data to create a lift curve
Perform the train/test steps iteratively until you have a model you’re happy with
During this iterative phase, validation data is set aside in a “lock box”
Once model has been finalized, score the Validation data and produce a lift curve
Unbiased estimate of future performance
15
© Deloitte Consulting, 2004
Data Mining: Applications
The classic: Profitability Scoring Model Underwriting/Pricing applications
Credit models Retention models Elasticity models Cross-sell models Lifetime Value models Agent/agency monitoring Target marketing Fraud detection Customer segmentation
no target variable (“unsupervised learning”)
16
© Deloitte Consulting, 2004
Skills needed
StatisticalBeyond college/actuarial exams… fast-moving field
ActuarialThe subject-matter expertise
Programming!Need scalable software, computing environment
IT - Systems AdministrationData extraction, data load, model implementation
Project ManagementAbsolutely critical because of the scope &
multidisciplinary nature of data mining projects
17
© Deloitte Consulting, 2004
Data Sources
Company’s internal data Policy-level records Loss & premium transactions Billing VIN……..
Externally purchased data Credit CLUE MVR Census ….
© Deloitte Consulting, 2004
The Data Mining Process
19
© Deloitte Consulting, 2004
Raw Data
Research/Evaluate possible data sourcesAvailabilityHit rateImplementabilityCost-effectiveness
Extract/purchase data Check data for quality (QA) At this stage, data is still in a “raw” form
Often start with voluminous transactional dataMuch of the data mining process is “messy”
20
© Deloitte Consulting, 2004
Variable Creation
Create predictive and target variablesNeed good programming skillsNeed domain and business expertise
Steepest part of the learning curveDiscuss specifics of variable creation
with company expertsUnderwriters, Actuaries, Marketers…Opportunity to quantify tribal wisdom
21
© Deloitte Consulting, 2004
Variable Transformation
Univariate analysis of predictive variables Exploratory Data Analysis (EDA) Data Visualization Use EDA to cap / transform predictive
variablesExtreme valuesMissing values…etc
22
© Deloitte Consulting, 2004
Multivariate Analysis
Examine correlations among the variables Weed out redundant, weak, poorly distributed
variables Model design Build candidate models
Regression/GLMDecision Trees/MARSNeural Networks
Select final model
23
© Deloitte Consulting, 2004
Model Analysis & Implementation
Perform model analytics Necessary for client to gain comfort with the model
Calibrate Models Create user-friendly “scale” – client dictates
Implement models Programming skills again are critical
Monitor performance Distribution of scores/variables, usage of the models,..etc Plan model maintenance schedule
© Deloitte Consulting, 2004
Model Design
Where Data Mining Needs Actuarial Science
25
© Deloitte Consulting, 2004
Model Design Issues
Which target variable to use? Frequency & severity Loss Ratio, other profitability measures Binary targets: defection, cross-sell …etc
How to prepare the target variable? Period - 1-year or Multi-year? Losses evaluated @? Cap large losses? Cat losses? How / whether to re-rate, adjust premium? What counts as a “retaining” policy? …etc
26
© Deloitte Consulting, 2004
Model Design Issues
Which data points to include/exclude Certain classes of business? Certain states? …etc
Which variables to consider? Credit, or non-credit only? Include rating variables in the model? Exclude certain variables for regulatory reasons? …etc
What is the “level” of the model? Policy-term level, HH-level, Risk-level ..etc Or should data be summarized into “cells” à la minimum bias?
27
© Deloitte Consulting, 2004
Model Design Issues
How should model be evaluated?Lift curves, Gains chart, ROC curve?How to measure ROI?How to split data into train/test/validation? Or cross-
validation?Is there enough data for lift curve to be “credible”?
Are your “incredible” results credible?…etc
Not an exhaustive list – every project raises different actuarial issues!
28
© Deloitte Consulting, 2004
Reference
My favorite textbook:
The Elements of Statistical Learning--Jerome Friedman, Trevor Hastie, Robert Tibshirani