Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD...

34
Building smarter financial applications with open- source projects H2O and Spark Michal Malohlava @mmalohlava and @h2oai Spark Saturday 2016/04/30

Transcript of Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD...

Page 1: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

Building smarter financial applications with open-source projects H2O and

Spark

Michal Malohlava @mmalohlava and @h2oai

Spark Saturday 2016/04/30

Page 2: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Who Am I?Background

• PhD in CS from Charles University in Prague, Czech Republic

• Postdoc at Purdue University experimenting with algos for large-scale computation

• Now SW engineer at H2O.ai

Experience with domain-specific languages, distributed system, software engineering,

and big data.

Page 3: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

H2O.aiH

2O team

Sri Ambati Cliff ClickCo-

Foun

ders

Stephen Boyd

Rob Tibshirani

TrevorHastie

Scie

ntifi

cA

dvis

ory

Cou

ncil

Page 4: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

H2OOpen-Source In-Memory Data Science Platform

• Highly optimized Java code (in-house)

• Distributed in-memory K-V store and map/reduce computation framework

• Data parser (HDFS, S3, NFS, HTTP, local drives, etc.)

• Read/write access to distributed data frames (R/Pandas-style)

• ML algos - Deep Learning, GBM, DRF, GLM, GLRM, K-Means, PCA, CoxPH, Ensembles

• REST API: clients Interactive UI/R/Python

Page 5: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

How are people using

H2O?

Page 6: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

H2O Use Cases: Available Videos and Talks

Auto Insurance UBI Telematics

Commercial Insurance Risk Analytics

Financial Services Customer Insights

Digital Marketing Consumer Behavior

Pawan Divarkarla Chief Data Officer

“H2O is an enabler in how people are thinking

about data.”

Conor Jensen Analytics Director

“Advanced analytics was one of the key

investments we decided to make.”

Brendan Herger Data Scientist

“H2O is the best solution to to iterate very quickly on

large datasets and produce meaning models.”

Satya Satyamoorthy Director, Software Dev

"I am a big fan of open source. H2O is the best fit in terms of cost as well as ease of use and scalability

and usability.”

Play Video Play Video Play Video Play Video

Progressive Zurich Capital One Nielsen

Page 7: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

H2O Use Cases: Available Videos and Talks

Digital Marketing Marketing Optimization

Healthcare Advanced Alert Monitoring

Financial Services Customer Churn

Insurance Product Recommendation

Prateem Mandal Technical Lead Architect

“H2O gave us the capability to do Big Modeling.

There is no limit to scaling in H2O.”

Taposh Dutta Roy Data & Science Manager

Machine Learning to Save Lives

Julian Bharadwaj Data Scientist

Solving Customer Churn with Machine Learning

Vishal Bamba VP, Strategy & Architecture

Transamerica Product Recommendation Platform

Play Video Play Video Play Video Play Video

Marketshare Kaiser PayPal Transamerica

Page 8: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

WHY H2O

FRAUD PREVENTION

PROBLEM

• Feature engineering with Deep Learning to model new and complex attack patterns quickly

• Highly scalable, superior performance, flexible deployment, works seamlessly with other big data frameworks

• Easy to use, enterprise ready, fully featured

• Transaction level: needed state-of-the-art ML and statistical models to pre-empt fraudulent behavior

• Account Level: needed to monitor account level activity to identify abusive behavior. Abusive patterns include frequent payments, suspicious profile changes.

• Network Level: needed to monitor account-to-account interaction, and frequent money transfers from several accounts into one central account

• 160M records, 1500 features (150 categorical), 0.6TB compressed in HDFS, 800 nodes Hadoop (CDH3) cluster

• Decision: fraud/not-fraud

“The company estimates that a 1% reduction in fraud results in $1 million savings per month.” – Risk Management, Data Science & Fraud Prevention

IMPACT• 11% improvement in accuracy • Every basis point results in a $1M savings monthly • “Fantastic support from H2O team.” —Risk Mgmt, Data Science & Fraud Prevention

Leading Online Payments Provider

Financial

Page 9: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

SOLUTION

CONSUMER CHURN

PROBLEM

• Calculated a daily probability for churn - goal was early detection of churn

• Trained and scored on entire consumer base • Implemented on an R, H2O, Hadoop stack

• Predict consumer churn based on behavior patterns and cadence of transactions

• Identify leading indicators and intercept consumer churn before it happens

“The penetration of H2O is very focused and growing… I see it increasing tenfold. It’s been so successful that there is now a program built around the output of these ML algorithms.” – Anonymized, Data Scientist

IMPACT

• Significant time savings in building models, from 6-7 hours down to less than 30 minutes

• Expansion: “The inventory of ML projects in 2016 is growing now that people have seen the impact of H2O ML on consumer churn and how successful that has been.” - Anonymized, Data Scientist

Leading Online Payments Provider

Financial

Page 10: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

RISK ANALYTICS

PROBLEM

• Did not want to be captive to one toolset, wanted to mix and match different tools

• R and Python integration • Can take bets on new technology without having to go all-in, no

infrastructure and training investment. If you’re wrong, you haven’t lost as much, allows you to see how it works before you scale it

• Insurance carrier with 140 years serving 200+ countries worldwide including 100 years in the U.S.

• Business is based entirely on statistics and probabiltiy since the cost of goods sold for their products is an unknown, they don’t know the risks that their customers are actually going to face

• Need to figure out the predictors of risk • Traditional analytics tools aren’t moving fast enough

“Advanced analytics is one of the top key investments for our company because it’s the key differentiator for insurance companies going into the next couple of decades.” —Anonymized, Analytics Director

• Visualization to tell the story with the data as they build products for clients

• For recruiting, it was important to have an environment that would attract the right talent

• H2O has a vibrant growing community

WHY H2O

WHY OPEN SOURCE

Leading Insurance Provider

Insurance

Page 11: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence Healthcare

HEALTHCARE: PATIENT MONITORING

Leading Healthcare Provider

SOLUTION

PROBLEM

IMPACT

Machine Learning to Save Lives

• Vast amounts of data: 10 million patients • Highly regulated healthcare industry • Zero tolerance for failure • Infrequent occurrence of critical deterioration events

among patients • Patients who undergo an unplanned transfer to the

ICU have higher mortality rates than patients directly admitted to the ICU—they represent 25% of all ICU admissions and 20% of all deaths in the hospital

• Built models to predict the probability of a “patient crash” in patients requiring unplanned transfers to the ICU

• Identifies patients who are likely to crash, intervenes 12 hours before they experience deterioration

• Clinicians receive an alert if a threshold is exceeded to evaluate the patient and determine further course of action

• The results are currently available every 6 hours, but will be configured to calculate the likelihood of critical deterioration on an hourly basis

Page 12: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

WHY H2O

FLEET TELEMATICS: PREVENTIVE MAINTENANCE

PROBLEM

• H2O support for customer’s Kerberos authentication mechanism for Hadoop

• Support for MapReduce, YARN, R, Python and Spark in Hadoop

• In-memory, distributed architecture • Rapid deployment to production with POJO • Quick prototyping with H2O Flow

• Fleet telematics—analyze maintenance records and vehicle performance to make predictions on when to do preventive maintenance

• Couldn’t scale by sampling data • Took days to create models

IMPACT

“Annual Savings are $7M” – Anonymized, Member Technical Staff

• When you look at the cost of towing a stranded vehicle, technician loss of productivity, and the customer lifetime value, the annual savings is $7M.” – Anonymized, Lead Member Technical Staff

Leading Mobile Telecom Operator

Telecommunications

Page 13: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

FRAUD PREVENTION

ISSUE

• Wanted to avoid vendor lock-in, doesn’t allow for rapid growth and innovation

• Have the option of baking something in if it doesn’t exist

• Can poke through source code and see how algos are being run

• Contribute to and grow the community

• Terabytes of data and needed to iterate through modeling quickly

• Diverse and dynamic datasets

WHY H2O

“Universally it’s been a one-stop shop that just helps us do all our modeling in one framework.” –Anonymized, Data Scientist

• “The best solution to be able to iterate very quickly on large datasets and produce meaningful models”

• “H2O is enterprise ready and can operate on very large data sets”

• “We evaluated a large number of hard and soft metrics, H2O just scored really well with all of these areas, relative to the machine learning frameworks that are available at the moment”

• H2O Flow allows data scientists to show executives what modeling is occurring

WHY OPEN SOURCE

Leading Financial Services Provider

Financial

Page 14: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Page 15: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

H2O+Spark = Sparkling

Water

Page 16: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Sparkling WaterProvides

Transparent integration of H2O with Spark ecosystem

Transparent use of H2O data structures and algorithms with Spark API

Platform for building Smarter Applications

Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Functionality missing in H2O can be replaced by Spark and vice versa

Page 17: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Benefits

• Additional algorithms

• NLP

• Powerful data munging

• ML Pipelines

• Advanced algorithms

• speed v. accuracy

• advanced parameters

• Fully distributed and parallelized

• Graphical environment

• R/Python interface

Spark H2O

Page 18: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

How to use Sparkling Water?

Page 19: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Model Building

Data Source

Data munging Modelling

Deep Learning, GBMDRF, GLM, GLRM

K-Means, PCACoxPH, Ensembles

Prediction processing

Page 20: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Data Munging

Data Source

Data load/munging/ exploration Modelling

Page 21: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Stream processing

DataSourceO

ff-lin

e m

odel

trai

ning

Data munging

Model prediction

Deploy the model

Stre

ampr

oces

sing

Data Stream

Spark Streaming/Storm/Flink

Export modelin a binary format

or as code

Modelling

Page 22: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

What is inside?

Page 23: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Cluster

Worker node

Spark executor

Scala/Py main program

Driver node

H2OContext

SparkContext

Worker node

Spark executor

Worker node

Spark executor

Page 24: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

H2O

Ser

vice

sH

2O S

ervi

ces

DataSource

Spar

k Ex

ecut

orSp

ark

Exec

utor

Spar

k Ex

ecut

or

Spark Cluster

DataFrame

H2O

Ser

vice

s

H2OFrame

DataSource

h2oContext.asDataFrame

h2oContext.asH2OFrame

Page 25: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

DEMO Time!

Page 26: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

What do we need?

Spark

+ maven coordinate of Sparkling Water

+ data

+ And some cool machine learning idea!

Page 27: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Lending ClubTrain 2 models which will help to decide about application (accept/decline) and interest rate

Loan data publicly available

• https://www.lendingclub.com/info/download-data.action

Deploy models as a service

Page 28: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

H2O Touropen.h2o.ai

Page 29: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

Checkout H2O.ai Training Books

http://h2o.ai/resources

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

Meetups

https://meetup.com/

More info

Page 30: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O

Page 31: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

SOLUTION

CONSUMER BEHAVIOR ANALYTICS

PROBLEM

• Platform uses H2O for its programmatic buying algorithms

• Used H2O’s gradient boosting method for predictions

• Technical stack included standalone H2O cluster for large-scale data munging and scoring

• Has demographic data on every single home: 140M+ homes and 320M+ persons in the US

• Has purchase behavior for 40M+ homes + TV watch behavior for 70M homes. Can identify 15-20M actual homes for ROI + 40M homes for ad targeting

• Predict ROI and ad effectiveness—correlate watch behavior with buy behavior

• Run concurrent analytics on the same dataset

WHY H2O

World’s largest provider of TV and online behavior analytics

• Java integration with REST API • Ease of use and scalability across large scale data sets • Open source, ability to pick and choose feature sets • Ease of development and ease of implementation • Speed of suppport, i.e. data munging algorithms

Leading Marketing Analytics Provider

Digital Marketing

Page 32: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

SOLUTION

MARKETING ANALYTICS & OPTIMIZATION

PROBLEM

• “Big Modeling” on their own “Predictive Analytics as a Service” platform

• Attribution modeling across multiple channels and TBs of data per customer

• Solution built on a broad stack - including R, H2O, and Hadoop as enablers

• Diverse and wide datasets: cluster size of 25 machines with tens of terabytes data per client

• Analyze current state of marketing budget allocation

• Predict revenue of marketing campaigns • Make recommendations to improve current

projections

“There is no limit to scaling in H2O. The team is amazing.” – Anonymized, Technical Lead Architect

IMPACT

• The business value we have gained from Advanced Analytics is enormous. Our entire portfolio that deals with Digital Data depends on this, and this is the section that is growing the most and will dominate in the near future.” – Anonymized, Technical Lead Architect

Leading Marketing Analytics Provider

Digital Marketing

Page 33: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

PRODUCT RECOMMENDATION

PROBLEM

WHY OPEN SOURCE

“The time savings that we get is semi-ridiculous. Models that used to take months to build, now takes days—at scale, in lightning speed. This is a game changer.” —Anonymized, Advanced Analytics Manager

• Access established and growing community of developers • Agility, speed and flexibility • Future expansion: pent up demand to solve many use cases

with new H2O infrastructure

WHY H2O

• Direct access to developers for enhancements and deep understanding of how the applications work

• Turned to distributed data systems and in-memory open source applications to tackle the volume of data and advanced modeling on large data sets, at scale, and in production—every day

• Advanced algos in-memory at this size and speed allows customer to fail fast

• 100 year old drugstore leader with 10,000 stores worldwide • Senior leadership asked for personalization for all customers for

many different buying scenarios—sending relevant offers based on past purchase patterns, product recommender for high frequency purchases using known open source algos to present customers with products at the right time

• Petabytes of data, billions of rows, across dynamic and wide data sets—one of the largest retail data sets in the U.S.

• Could not scale with existing infrastructure

Leading Global Retailer

Retail

Page 34: Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD PREVENTION PROBLEM • Feature engineering with Deep Learning to model new and complex

H2O.aiMachine Intelligence

SOLUTION

PROBLEM

IMPACT

• Established big data stack in Hadoop environment, aggregating data from many disparate data sets

• Employed H2O running in Hadoop cluster • Enabled analysts to work with R, while

leveraging complete data sets in the big data stack

• Factoring in multiple product and customer variables to help provide optimal product recommendations

• Quickly leveraging massive data sets in order to improve marketing and sales efforts

“With H2O, we could continue to work with our existing R environments, but now access all the data sitting in the cluster. This made it easy to harness a wealth of information, while leveraging our existing skills and investments.” – Anonymized, Innovation Executive

• Built and demonstrated product recommendation prototype within a couple weeks

• Gained insights that can fuel improved product recommendations, fostering improved services and revenues

• Enabled multiple teams of analysts to leverage same tools and datasets, helping spur future innovation across the organization

PRODUCT RECOMMENDATION

Leading Insurance Provider

Insurance