2017 Predictive Analytics Symposium - SOA · A panel of Kaggle Masters share their tips on how to...
Transcript of 2017 Predictive Analytics Symposium - SOA · A panel of Kaggle Masters share their tips on how to...
2017 Predictive Analytics Symposium Session 35, Kaggle Contests--Tips From Actuaries Who Have Placed
Well
Moderator: Kyle A. Nobbe, FSA, MAAA
Presenters:
Thomas DeGodoy Shea Kee Parkes, FSA, MAAA
SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer
Predictive Modeling Contests
Tom de Godoy
Tom de GodoyCTO & Co-founder,
DataRobot
● 15 years of experience in Insurance Analytics● Previously, Director of Research & Modeling at
Travelers Insurance● Advisor to the DataRobot Insurance practice that
works with a large number of insurance companies
DataRobot is an Automated Machine Learning Platform
● Founded in 2012, Funding over $100 million● Experts with 70+ years of insurance analytics
experience● DataRobot’s insurance portfolio includes Fortune 100,
Regional players, global players and InsurTechs
This session is based on my experience working with leading insurers and then, founding a Machine Learning company that is helping hundreds of companies in their
Machine Learning journey.
Why Kaggle?
- Money prizes? Glory?
- Learn Machine Learning by doing it!
- Be part of a large community of data scientists
Why Learn Machine Learning?
open source programming
democratization
Open-Source Innovations➔ ML driven by open-source and academics
Low-cost computingDisruptive Competition
➔ New business models around data
Unstructured dataTraditional data
“90% of the data in the world today has been created in the last two years alone”
Avalanche of new data➔ “Big Data” environment: Velocity, Volume, Variety
Better Product
Better Service
Optimised Operations
Predictive Analytics is a Competition
1. The ability to identify opportunities
2. The ability to execute on these opportunities
3. Better predictive models than your competitor
Keys to Building a Competitive Advantage:
Keys to Winning This Competition
1. Knowledge of the data and of the business problem
2. Large and diverse set of algorithms
3. Robust model validation
4. Speed
Develop Models Better and Faster Than Your Competitors
Know Your Data
Simple ways to “know your data”:
- Data dictionary- Simple profiles & summaries- Interactive queries
Insights from machine learning models:
- Identify important features- Visualize partial dependencies- Discover non-linear effects and interactions- Discover prediction outliers and their reasons
Most useful insights about the data come from machine learning models.
Quick Prototype & Rapid Iterations
prototype
socializerapid iteration
feedback loop
Rapid iteration and early socialization are the key!
Leverage a Large and Diverse Set of Algorithms
“For each particular method there are situations for which it is particularly well suited, and others where it performs badly compared to the best that can be done with that data”
Source: http://statweb.stanford.edu/~tibs/ElemStatLearn/
How to Leverage More Algorithms?
For each new algorithm, you need to figure out....
● What library/implementation should you use?
● How do you tune the model?
● How do you prepare the data for the model?
● How do you score new data with this model?
● How do you run it faster and less costly?
How to Leverage More Algorithms?
Automated Machine Learning Platform
Having a diverse set of algorithms is key to maximizing accuracy.
Robust Validation
On your own cross-validation framework, evaluate your models using:
- Ranking & accuracy metrics (AUC, Gini, R-Squared, MSE etc)
- Lift charts & dual-lift charts
- Feature importance plots
- Partial dependency plots
- Reason codes
This cross-validation framework should be used only for evaluation and not for tuning
Don’t trust the leaderboard. Trust your own cross validation.
A Lesson from Kaggle
Trust Your Own Cross-Validation
Speed
Speed is a limiting factor for:
- Leveraging a large number of features (and data sources)
- Modeling complex types of data
- Using many models to discover the best solution
- Maximizing model accuracy
- Doing robust validation of any model
You Must be Faster than Your Competitors
The #1 Barrier: The Traditional Approach is Hard!
Math&
Stats
DomainExpertise
DATASCIENCE
Hacking & CodingSkills
RPythonSparkHadoop
Logistic RegressionGLMGBMRandom ForestDecision TreesNeural NetsDeep LearningText MiningFeature EngineeringBlendingCross Validation
Advantages of Automated Machine Learning
1. Time to Value: 10x faster to build and deploy predictive models.
2. Accuracy: Unprecedented accuracy of models “out-of-the-box”.
3. Transparency: Easy to know your model and collaborate on projects.
4. Pervasiveness: Simple UI and workflow for people of various backgrounds to
leverage machine learning.
5. Democratization: Not limited to data scientists.
6. Consistency: Best practices in model building, validation and deployment
applied consistently in every project.
Summary
- Know your data (with multivariate model insights)
- Leverage a large and diverse set of algorithms
- Apply robust model validation
- Speed is critical!
- Leverage automation as much as possible
Questions?
How to do well at a Kaggle contest
Shea Parkes, FSA MAAA
Limitations
The views expressed in this presentation are those of the presenter, and not those of Milliman or the Society of Actuaries. Nothing in this presentation is intended to represent a professional opinion or be an interpretation of actuarial standards of practice.
2
Focus on a single contest
4
Join a contest when it begins
5
Read all contest information
6
Participate in forums
7
Participate in notebooks and kernels
8
Join a team
9
Spend a ton of time feature engineering
10
Setup appropriate validation framework
11
Use existing implementations of common algorithms
12
Write code
13
Use GitHub
14