2017 Predictive Analytics Symposium - SOA · A panel of Kaggle Masters share their tips on how to...

Post on 20-May-2020

1 views 0 download

Transcript of 2017 Predictive Analytics Symposium - SOA · A panel of Kaggle Masters share their tips on how to...

2017 Predictive Analytics Symposium Session 35, Kaggle Contests--Tips From Actuaries Who Have Placed

Well

Moderator: Kyle A. Nobbe, FSA, MAAA

Presenters:

Thomas DeGodoy Shea Kee Parkes, FSA, MAAA

SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer

Predictive Modeling Contests

Tom de Godoy

Tom de GodoyCTO & Co-founder,

DataRobot

● 15 years of experience in Insurance Analytics● Previously, Director of Research & Modeling at

Travelers Insurance● Advisor to the DataRobot Insurance practice that

works with a large number of insurance companies

DataRobot is an Automated Machine Learning Platform

● Founded in 2012, Funding over $100 million● Experts with 70+ years of insurance analytics

experience● DataRobot’s insurance portfolio includes Fortune 100,

Regional players, global players and InsurTechs

This session is based on my experience working with leading insurers and then, founding a Machine Learning company that is helping hundreds of companies in their

Machine Learning journey.

Why Kaggle?

- Money prizes? Glory?

- Learn Machine Learning by doing it!

- Be part of a large community of data scientists

Why Learn Machine Learning?

open source programming

democratization

Open-Source Innovations➔ ML driven by open-source and academics

Low-cost computingDisruptive Competition

➔ New business models around data

Unstructured dataTraditional data

“90% of the data in the world today has been created in the last two years alone”

Avalanche of new data➔ “Big Data” environment: Velocity, Volume, Variety

Better Product

Better Service

Optimised Operations

Predictive Analytics is a Competition

1. The ability to identify opportunities

2. The ability to execute on these opportunities

3. Better predictive models than your competitor

Keys to Building a Competitive Advantage:

Keys to Winning This Competition

1. Knowledge of the data and of the business problem

2. Large and diverse set of algorithms

3. Robust model validation

4. Speed

Develop Models Better and Faster Than Your Competitors

Know Your Data

Simple ways to “know your data”:

- Data dictionary- Simple profiles & summaries- Interactive queries

Insights from machine learning models:

- Identify important features- Visualize partial dependencies- Discover non-linear effects and interactions- Discover prediction outliers and their reasons

Most useful insights about the data come from machine learning models.

Quick Prototype & Rapid Iterations

prototype

socializerapid iteration

feedback loop

Rapid iteration and early socialization are the key!

Leverage a Large and Diverse Set of Algorithms

“For each particular method there are situations for which it is particularly well suited, and others where it performs badly compared to the best that can be done with that data”

Source: http://statweb.stanford.edu/~tibs/ElemStatLearn/

How to Leverage More Algorithms?

For each new algorithm, you need to figure out....

● What library/implementation should you use?

● How do you tune the model?

● How do you prepare the data for the model?

● How do you score new data with this model?

● How do you run it faster and less costly?

How to Leverage More Algorithms?

Automated Machine Learning Platform

Having a diverse set of algorithms is key to maximizing accuracy.

Robust Validation

On your own cross-validation framework, evaluate your models using:

- Ranking & accuracy metrics (AUC, Gini, R-Squared, MSE etc)

- Lift charts & dual-lift charts

- Feature importance plots

- Partial dependency plots

- Reason codes

This cross-validation framework should be used only for evaluation and not for tuning

Don’t trust the leaderboard. Trust your own cross validation.

A Lesson from Kaggle

Trust Your Own Cross-Validation

Speed

Speed is a limiting factor for:

- Leveraging a large number of features (and data sources)

- Modeling complex types of data

- Using many models to discover the best solution

- Maximizing model accuracy

- Doing robust validation of any model

You Must be Faster than Your Competitors

The #1 Barrier: The Traditional Approach is Hard!

Math&

Stats

DomainExpertise

DATASCIENCE

Hacking & CodingSkills

RPythonSparkHadoop

Logistic RegressionGLMGBMRandom ForestDecision TreesNeural NetsDeep LearningText MiningFeature EngineeringBlendingCross Validation

Advantages of Automated Machine Learning

1. Time to Value: 10x faster to build and deploy predictive models.

2. Accuracy: Unprecedented accuracy of models “out-of-the-box”.

3. Transparency: Easy to know your model and collaborate on projects.

4. Pervasiveness: Simple UI and workflow for people of various backgrounds to

leverage machine learning.

5. Democratization: Not limited to data scientists.

6. Consistency: Best practices in model building, validation and deployment

applied consistently in every project.

Summary

- Know your data (with multivariate model insights)

- Leverage a large and diverse set of algorithms

- Apply robust model validation

- Speed is critical!

- Leverage automation as much as possible

Questions?

How to do well at a Kaggle contest

Shea Parkes, FSA MAAA

Limitations

The views expressed in this presentation are those of the presenter, and not those of Milliman or the Society of Actuaries. Nothing in this presentation is intended to represent a professional opinion or be an interpretation of actuarial standards of practice.

2

http://blog.kaggle.com/

3

Focus on a single contest

4

Join a contest when it begins

5

Read all contest information

6

Participate in forums

7

Participate in notebooks and kernels

8

Join a team

9

Spend a ton of time feature engineering

10

Setup appropriate validation framework

11

Use existing implementations of common algorithms

12

Write code

13

Use GitHub

14