© Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004...

© Deloitte Consulting, 2004

Introduction to Data Mining

James Guszcza, FCAS, MAAA

CAS 2004 Ratemaking Seminar

Philadelphia

March 11-12, 2004

2


Themes

What is Data Mining?How does it relate to statistics?Insurance applicationsData sources

The Data Mining Process Model Design Modeling Techniques

Louise Francis’ Presentation

3


Themes

How does data mining need actuarial science?

Variable creationModel designModel evaluation

How does actuarial science need data mining?

Advances in computing, modeling techniquesIdeas from other fields can be applied to insurance

problems

4


Themes

“The quiet statisticians have changed our world; not by discovering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions.”

-- Ian Hacking

Data mining gives us new ways of approaching the age-old problems of risk selection and pricing….

….and other problems not traditionally considered ‘actuarial’.


What is Data Mining?

6



My definition: “Statistics for the Computer Age” Many new techniques have come from Computer

Science, Marketing, Biology… but all can (should!) be brought under the framework of “statistics”

Not a radical break with traditional statistics Complements, builds on traditional statistics

Statistics enriched with brute-force capabilities of modern computing

Opens the door to new techniques Therefore Data Mining tends to be associated with

industrial-sized data sets

7


Buzz-words

Data Mining Knowledge Discovery Machine Learning Statistical Learning Predictive Modeling Supervised Learning Unsupervised Learning ….etc

8



Supervised learning: predict the value of a target variable based on several predictive variables

“Predictive Modeling”Credit / non-credit scoring enginesRetention, cross-sell models

Unsupervised learning: describe associations and patterns along many dimensions without any target information

Customer segmentationData ClusteringMarket basket analysis (“diapers and beer”)

9


So Why Should Actuaries Do This Stuff?

Any application of statistics requires subject-matter expertise

Psychometricians Econometricians Bioinformaticians Marketing scientists …are all applied statisticians with a particular subject-

matter expertise & area of specialty Add actuarial modelers to this list!

“Insurometricians”!? Actuarial knowledge is critical to the success of insurance

data mining projects

10


Three Concepts

Scoring enginesA “predictive model” by any other name…

Lift curvesHow much worse than average are the policies with

the worst scores?

Out-of-sample testsHow well will the model work in the real world?Unbiased estimate of predictive power

11


Classic Application: Scoring Engines

Scoring engine: formula that classifies or separates policies (or risks, accounts, agents…) into

profitable vs. unprofitableRetaining vs. non-retaining…

(Non-)Linear equation f( ) of several predictive variables

Produces continuous range of scores

score = f(X1, X2, …, XN)

12


What “Powers” a Scoring Engine?

Scoring Engine:

score = f(X1, X2, …, XN) The X1, X2,…, XN are at least as important as

the f( )!Again why actuarial expertise is necessaryThink of the predictive power of credit variables

A large part of the modeling process consists of variable creation and selection

Usually possible to generate 100’s of variablesSteepest part of the learning curve

13


Model Evaluation: Lift Curves

Sort data by score Break the dataset into

10 equal pieces Best “decile”: lowest

score lowest LR Worst “decile”: highest

score highest LR Difference: “Lift”

Lift = segmentation power

Lift translates into ROI of the modeling project

14


Out-of-Sample Testing

Randomly divide data into 3 pieces Training data, Test data, Validation data

Use Training data to fit models Score the Test data to create a lift curve

Perform the train/test steps iteratively until you have a model you’re happy with

During this iterative phase, validation data is set aside in a “lock box”

Once model has been finalized, score the Validation data and produce a lift curve

Unbiased estimate of future performance

15


Data Mining: Applications

The classic: Profitability Scoring Model Underwriting/Pricing applications

Credit models Retention models Elasticity models Cross-sell models Lifetime Value models Agent/agency monitoring Target marketing Fraud detection Customer segmentation

no target variable (“unsupervised learning”)

16


Skills needed

StatisticalBeyond college/actuarial exams… fast-moving field

ActuarialThe subject-matter expertise

Programming!Need scalable software, computing environment

IT - Systems AdministrationData extraction, data load, model implementation

Project ManagementAbsolutely critical because of the scope &

multidisciplinary nature of data mining projects

17


Data Sources

Company’s internal data Policy-level records Loss & premium transactions Billing VIN……..

Externally purchased data Credit CLUE MVR Census ….


The Data Mining Process

19


Raw Data

Research/Evaluate possible data sourcesAvailabilityHit rateImplementabilityCost-effectiveness

Extract/purchase data Check data for quality (QA) At this stage, data is still in a “raw” form

Often start with voluminous transactional dataMuch of the data mining process is “messy”

20


Variable Creation

Create predictive and target variablesNeed good programming skillsNeed domain and business expertise

Steepest part of the learning curveDiscuss specifics of variable creation

with company expertsUnderwriters, Actuaries, Marketers…Opportunity to quantify tribal wisdom

21


Variable Transformation

Univariate analysis of predictive variables Exploratory Data Analysis (EDA) Data Visualization Use EDA to cap / transform predictive

variablesExtreme valuesMissing values…etc

22


Multivariate Analysis

Examine correlations among the variables Weed out redundant, weak, poorly distributed

variables Model design Build candidate models

Regression/GLMDecision Trees/MARSNeural Networks

Select final model

23


Model Analysis & Implementation

Perform model analytics Necessary for client to gain comfort with the model

Calibrate Models Create user-friendly “scale” – client dictates

Implement models Programming skills again are critical

Monitor performance Distribution of scores/variables, usage of the models,..etc Plan model maintenance schedule


Model Design

Where Data Mining Needs Actuarial Science

25


Model Design Issues

Which target variable to use? Frequency & severity Loss Ratio, other profitability measures Binary targets: defection, cross-sell …etc

How to prepare the target variable? Period - 1-year or Multi-year? Losses evaluated @? Cap large losses? Cat losses? How / whether to re-rate, adjust premium? What counts as a “retaining” policy? …etc

26


Model Design Issues

Which data points to include/exclude Certain classes of business? Certain states? …etc

Which variables to consider? Credit, or non-credit only? Include rating variables in the model? Exclude certain variables for regulatory reasons? …etc

What is the “level” of the model? Policy-term level, HH-level, Risk-level ..etc Or should data be summarized into “cells” à la minimum bias?

27


Model Design Issues

How should model be evaluated?Lift curves, Gains chart, ROC curve?How to measure ROI?How to split data into train/test/validation? Or cross-

validation?Is there enough data for lift curve to be “credible”?

Are your “incredible” results credible?…etc

Not an exhaustive list – every project raises different actuarial issues!

28


Reference

My favorite textbook:

The Elements of Statistical Learning--Jerome Friedman, Trevor Hastie, Robert Tibshirani

© Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004...

Documents

Transcript of © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004...