Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD...
Transcript of Michal Malohlava @mmalohlava and @h2oai Building smarter … · 2016-05-11 · WHY H2O FRAUD...
Building smarter financial applications with open-source projects H2O and
Spark
Michal Malohlava @mmalohlava and @h2oai
Spark Saturday 2016/04/30
H2O.aiMachine Intelligence
Who Am I?Background
• PhD in CS from Charles University in Prague, Czech Republic
• Postdoc at Purdue University experimenting with algos for large-scale computation
• Now SW engineer at H2O.ai
Experience with domain-specific languages, distributed system, software engineering,
and big data.
H2O.aiMachine Intelligence
H2O.aiH
2O team
Sri Ambati Cliff ClickCo-
Foun
ders
Stephen Boyd
Rob Tibshirani
TrevorHastie
Scie
ntifi
cA
dvis
ory
Cou
ncil
H2O.aiMachine Intelligence
H2OOpen-Source In-Memory Data Science Platform
• Highly optimized Java code (in-house)
• Distributed in-memory K-V store and map/reduce computation framework
• Data parser (HDFS, S3, NFS, HTTP, local drives, etc.)
• Read/write access to distributed data frames (R/Pandas-style)
• ML algos - Deep Learning, GBM, DRF, GLM, GLRM, K-Means, PCA, CoxPH, Ensembles
• REST API: clients Interactive UI/R/Python
H2O.aiMachine Intelligence
How are people using
H2O?
H2O.aiMachine Intelligence
H2O Use Cases: Available Videos and Talks
Auto Insurance UBI Telematics
Commercial Insurance Risk Analytics
Financial Services Customer Insights
Digital Marketing Consumer Behavior
Pawan Divarkarla Chief Data Officer
“H2O is an enabler in how people are thinking
about data.”
Conor Jensen Analytics Director
“Advanced analytics was one of the key
investments we decided to make.”
Brendan Herger Data Scientist
“H2O is the best solution to to iterate very quickly on
large datasets and produce meaning models.”
Satya Satyamoorthy Director, Software Dev
"I am a big fan of open source. H2O is the best fit in terms of cost as well as ease of use and scalability
and usability.”
Play Video Play Video Play Video Play Video
Progressive Zurich Capital One Nielsen
H2O.aiMachine Intelligence
H2O Use Cases: Available Videos and Talks
Digital Marketing Marketing Optimization
Healthcare Advanced Alert Monitoring
Financial Services Customer Churn
Insurance Product Recommendation
Prateem Mandal Technical Lead Architect
“H2O gave us the capability to do Big Modeling.
There is no limit to scaling in H2O.”
Taposh Dutta Roy Data & Science Manager
Machine Learning to Save Lives
Julian Bharadwaj Data Scientist
Solving Customer Churn with Machine Learning
Vishal Bamba VP, Strategy & Architecture
Transamerica Product Recommendation Platform
Play Video Play Video Play Video Play Video
Marketshare Kaiser PayPal Transamerica
H2O.aiMachine Intelligence
WHY H2O
FRAUD PREVENTION
PROBLEM
• Feature engineering with Deep Learning to model new and complex attack patterns quickly
• Highly scalable, superior performance, flexible deployment, works seamlessly with other big data frameworks
• Easy to use, enterprise ready, fully featured
• Transaction level: needed state-of-the-art ML and statistical models to pre-empt fraudulent behavior
• Account Level: needed to monitor account level activity to identify abusive behavior. Abusive patterns include frequent payments, suspicious profile changes.
• Network Level: needed to monitor account-to-account interaction, and frequent money transfers from several accounts into one central account
• 160M records, 1500 features (150 categorical), 0.6TB compressed in HDFS, 800 nodes Hadoop (CDH3) cluster
• Decision: fraud/not-fraud
“The company estimates that a 1% reduction in fraud results in $1 million savings per month.” – Risk Management, Data Science & Fraud Prevention
IMPACT• 11% improvement in accuracy • Every basis point results in a $1M savings monthly • “Fantastic support from H2O team.” —Risk Mgmt, Data Science & Fraud Prevention
Leading Online Payments Provider
Financial
H2O.aiMachine Intelligence
SOLUTION
CONSUMER CHURN
PROBLEM
• Calculated a daily probability for churn - goal was early detection of churn
• Trained and scored on entire consumer base • Implemented on an R, H2O, Hadoop stack
• Predict consumer churn based on behavior patterns and cadence of transactions
• Identify leading indicators and intercept consumer churn before it happens
“The penetration of H2O is very focused and growing… I see it increasing tenfold. It’s been so successful that there is now a program built around the output of these ML algorithms.” – Anonymized, Data Scientist
IMPACT
• Significant time savings in building models, from 6-7 hours down to less than 30 minutes
• Expansion: “The inventory of ML projects in 2016 is growing now that people have seen the impact of H2O ML on consumer churn and how successful that has been.” - Anonymized, Data Scientist
Leading Online Payments Provider
Financial
H2O.aiMachine Intelligence
RISK ANALYTICS
PROBLEM
• Did not want to be captive to one toolset, wanted to mix and match different tools
• R and Python integration • Can take bets on new technology without having to go all-in, no
infrastructure and training investment. If you’re wrong, you haven’t lost as much, allows you to see how it works before you scale it
• Insurance carrier with 140 years serving 200+ countries worldwide including 100 years in the U.S.
• Business is based entirely on statistics and probabiltiy since the cost of goods sold for their products is an unknown, they don’t know the risks that their customers are actually going to face
• Need to figure out the predictors of risk • Traditional analytics tools aren’t moving fast enough
“Advanced analytics is one of the top key investments for our company because it’s the key differentiator for insurance companies going into the next couple of decades.” —Anonymized, Analytics Director
• Visualization to tell the story with the data as they build products for clients
• For recruiting, it was important to have an environment that would attract the right talent
• H2O has a vibrant growing community
WHY H2O
WHY OPEN SOURCE
Leading Insurance Provider
Insurance
H2O.aiMachine Intelligence Healthcare
HEALTHCARE: PATIENT MONITORING
Leading Healthcare Provider
SOLUTION
PROBLEM
IMPACT
Machine Learning to Save Lives
• Vast amounts of data: 10 million patients • Highly regulated healthcare industry • Zero tolerance for failure • Infrequent occurrence of critical deterioration events
among patients • Patients who undergo an unplanned transfer to the
ICU have higher mortality rates than patients directly admitted to the ICU—they represent 25% of all ICU admissions and 20% of all deaths in the hospital
• Built models to predict the probability of a “patient crash” in patients requiring unplanned transfers to the ICU
• Identifies patients who are likely to crash, intervenes 12 hours before they experience deterioration
• Clinicians receive an alert if a threshold is exceeded to evaluate the patient and determine further course of action
• The results are currently available every 6 hours, but will be configured to calculate the likelihood of critical deterioration on an hourly basis
H2O.aiMachine Intelligence
WHY H2O
FLEET TELEMATICS: PREVENTIVE MAINTENANCE
PROBLEM
• H2O support for customer’s Kerberos authentication mechanism for Hadoop
• Support for MapReduce, YARN, R, Python and Spark in Hadoop
• In-memory, distributed architecture • Rapid deployment to production with POJO • Quick prototyping with H2O Flow
• Fleet telematics—analyze maintenance records and vehicle performance to make predictions on when to do preventive maintenance
• Couldn’t scale by sampling data • Took days to create models
IMPACT
“Annual Savings are $7M” – Anonymized, Member Technical Staff
• When you look at the cost of towing a stranded vehicle, technician loss of productivity, and the customer lifetime value, the annual savings is $7M.” – Anonymized, Lead Member Technical Staff
Leading Mobile Telecom Operator
Telecommunications
H2O.aiMachine Intelligence
FRAUD PREVENTION
ISSUE
• Wanted to avoid vendor lock-in, doesn’t allow for rapid growth and innovation
• Have the option of baking something in if it doesn’t exist
• Can poke through source code and see how algos are being run
• Contribute to and grow the community
• Terabytes of data and needed to iterate through modeling quickly
• Diverse and dynamic datasets
WHY H2O
“Universally it’s been a one-stop shop that just helps us do all our modeling in one framework.” –Anonymized, Data Scientist
• “The best solution to be able to iterate very quickly on large datasets and produce meaningful models”
• “H2O is enterprise ready and can operate on very large data sets”
• “We evaluated a large number of hard and soft metrics, H2O just scored really well with all of these areas, relative to the machine learning frameworks that are available at the moment”
• H2O Flow allows data scientists to show executives what modeling is occurring
WHY OPEN SOURCE
Leading Financial Services Provider
Financial
H2O.aiMachine Intelligence
H2O.aiMachine Intelligence
H2O+Spark = Sparkling
Water
H2O.aiMachine Intelligence
Sparkling WaterProvides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and algorithms with Spark API
Platform for building Smarter Applications
Excels in existing Spark workflows requiring advanced Machine Learning algorithms
Functionality missing in H2O can be replaced by Spark and vice versa
H2O.aiMachine Intelligence
Benefits
• Additional algorithms
• NLP
• Powerful data munging
• ML Pipelines
• Advanced algorithms
• speed v. accuracy
• advanced parameters
• Fully distributed and parallelized
• Graphical environment
• R/Python interface
Spark H2O
H2O.aiMachine Intelligence
How to use Sparkling Water?
H2O.aiMachine Intelligence
Model Building
Data Source
Data munging Modelling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
Prediction processing
H2O.aiMachine Intelligence
Data Munging
Data Source
Data load/munging/ exploration Modelling
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data munging
Model prediction
Deploy the model
Stre
ampr
oces
sing
Data Stream
Spark Streaming/Storm/Flink
Export modelin a binary format
or as code
Modelling
H2O.aiMachine Intelligence
What is inside?
H2O.aiMachine Intelligence
Cluster
Worker node
Spark executor
Scala/Py main program
Driver node
H2OContext
SparkContext
Worker node
Spark executor
Worker node
Spark executor
H2O.aiMachine Intelligence
H2O
Ser
vice
sH
2O S
ervi
ces
DataSource
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Ser
vice
s
H2OFrame
DataSource
h2oContext.asDataFrame
h2oContext.asH2OFrame
H2O.aiMachine Intelligence
DEMO Time!
H2O.aiMachine Intelligence
What do we need?
Spark
+ maven coordinate of Sparkling Water
+ data
+ And some cool machine learning idea!
H2O.aiMachine Intelligence
Lending ClubTrain 2 models which will help to decide about application (accept/decline) and interest rate
Loan data publicly available
• https://www.lendingclub.com/info/download-data.action
Deploy models as a service
H2O.aiMachine Intelligence
H2O Touropen.h2o.ai
H2O.aiMachine Intelligence
Checkout H2O.ai Training Books
http://h2o.ai/resources
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
Learn more at h2o.ai Follow us at @h2oai
Thank you!Sparkling Water is
open-source ML application platform
combining power of Spark and H2O
H2O.aiMachine Intelligence
SOLUTION
CONSUMER BEHAVIOR ANALYTICS
PROBLEM
• Platform uses H2O for its programmatic buying algorithms
• Used H2O’s gradient boosting method for predictions
• Technical stack included standalone H2O cluster for large-scale data munging and scoring
• Has demographic data on every single home: 140M+ homes and 320M+ persons in the US
• Has purchase behavior for 40M+ homes + TV watch behavior for 70M homes. Can identify 15-20M actual homes for ROI + 40M homes for ad targeting
• Predict ROI and ad effectiveness—correlate watch behavior with buy behavior
• Run concurrent analytics on the same dataset
WHY H2O
World’s largest provider of TV and online behavior analytics
• Java integration with REST API • Ease of use and scalability across large scale data sets • Open source, ability to pick and choose feature sets • Ease of development and ease of implementation • Speed of suppport, i.e. data munging algorithms
Leading Marketing Analytics Provider
Digital Marketing
H2O.aiMachine Intelligence
SOLUTION
MARKETING ANALYTICS & OPTIMIZATION
PROBLEM
• “Big Modeling” on their own “Predictive Analytics as a Service” platform
• Attribution modeling across multiple channels and TBs of data per customer
• Solution built on a broad stack - including R, H2O, and Hadoop as enablers
• Diverse and wide datasets: cluster size of 25 machines with tens of terabytes data per client
• Analyze current state of marketing budget allocation
• Predict revenue of marketing campaigns • Make recommendations to improve current
projections
“There is no limit to scaling in H2O. The team is amazing.” – Anonymized, Technical Lead Architect
IMPACT
• The business value we have gained from Advanced Analytics is enormous. Our entire portfolio that deals with Digital Data depends on this, and this is the section that is growing the most and will dominate in the near future.” – Anonymized, Technical Lead Architect
Leading Marketing Analytics Provider
Digital Marketing
H2O.aiMachine Intelligence
PRODUCT RECOMMENDATION
PROBLEM
WHY OPEN SOURCE
“The time savings that we get is semi-ridiculous. Models that used to take months to build, now takes days—at scale, in lightning speed. This is a game changer.” —Anonymized, Advanced Analytics Manager
• Access established and growing community of developers • Agility, speed and flexibility • Future expansion: pent up demand to solve many use cases
with new H2O infrastructure
WHY H2O
• Direct access to developers for enhancements and deep understanding of how the applications work
• Turned to distributed data systems and in-memory open source applications to tackle the volume of data and advanced modeling on large data sets, at scale, and in production—every day
• Advanced algos in-memory at this size and speed allows customer to fail fast
• 100 year old drugstore leader with 10,000 stores worldwide • Senior leadership asked for personalization for all customers for
many different buying scenarios—sending relevant offers based on past purchase patterns, product recommender for high frequency purchases using known open source algos to present customers with products at the right time
• Petabytes of data, billions of rows, across dynamic and wide data sets—one of the largest retail data sets in the U.S.
• Could not scale with existing infrastructure
Leading Global Retailer
Retail
H2O.aiMachine Intelligence
SOLUTION
PROBLEM
IMPACT
• Established big data stack in Hadoop environment, aggregating data from many disparate data sets
• Employed H2O running in Hadoop cluster • Enabled analysts to work with R, while
leveraging complete data sets in the big data stack
• Factoring in multiple product and customer variables to help provide optimal product recommendations
• Quickly leveraging massive data sets in order to improve marketing and sales efforts
“With H2O, we could continue to work with our existing R environments, but now access all the data sitting in the cluster. This made it easy to harness a wealth of information, while leveraging our existing skills and investments.” – Anonymized, Innovation Executive
• Built and demonstrated product recommendation prototype within a couple weeks
• Gained insights that can fuel improved product recommendations, fostering improved services and revenues
• Enabled multiple teams of analysts to leverage same tools and datasets, helping spur future innovation across the organization
PRODUCT RECOMMENDATION
Leading Insurance Provider
Insurance