1
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
Jorge Martinez May 20th, 2015
2
About me
FPGAsBarcelona2009
Embedded software, GPUsBarcelona2011
Distributed systems and MLSF2013
@jorgemarsalhttp://jorgemarsal.github.io
4
Horizontal scaling
The shift from BI to Data Science
The shift from BI to data science
Happens!
https://www.youtube.com/watch?v=vbb-AjiXyh0
5
Predictive analytics applications
Marketing
Sales
Logistics
Risk
Customer support
Human resources
…
Healthcare
Consumer financial
Retail
Insurance
Life sciences
Travel
…
6
Predictive analytics workflow
Build Models
Evaluate ModelsDeploy Models
(In-DB or Web)
BI Integration
1 2
3
Build and evaluate predictive models on large datasets using Distributed R
2
1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB)
3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.Alternatively deploy model as a web service.
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Training predictive models
8
Data Scientists Preferred Languages: R & SQLAdoption of R increased across industries
1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
9
R is …
“The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.”-Bo Cogwill, Google
10
R is ….
PopularNot scalable
Open source
No parallel algorithmsFlexible
Extensible
Limited pre/post processing
12
Horizontal scaling
“The future has arrived, it’s just not evenly distributed yet”- William Gibson
“The future has arrived, it’s just not evenly distributed yet”- William Gibson
Ship code to data,Functional Programming
14
Distributed RA New Enterprise class predictive analytics platform
A scalable, high-performance platform for the R language• Implemented as an R package• Open source
Use familiar GUIs and packages
Analyze data too large for vanilla R
Leverage multiple nodes for distributed processing
Vastly improved
performance
15
Distributed R: architecture
Master• Schedules tasks across the
cluster.• Sends commands/code to
workers
Workers• Hold data partitions• Apply functions to data partitions
in parallel
16
•Relies on user defined partitioning• Also support for distributed data-frames and lists
darray
Distributed R: Distributed data structures
17
• Express computations over partitions• Execute across the cluster
foreach
Distributed R: Distributed code
f (x)
19
• Similar signature, accuracy as R packages• Scalable and high performance • E.g., regression on billions of rows in a couple of minutes
Distributed R: Built-in distributed algorithms
Algorithm Use cases
Linear Regression (GLM) Risk Analysis, Trend Analysis, etc.
Logistic Regression (GLM)Customer Response modeling, Healthcare analytics (Disease analysis)
Random Forest Customer churn, Market campaign analysis
K-Means ClusteringCustomer segmentation, Fraud detection, Anomaly detection
Page Rank Identify influencers
21
Predicting March Madness results using ML
Load data from
Vertica/HDFS/Local FS
Optionally add
additional data using
Idol-on-demand APIs
Train model using
Random Forest in
Distributed R
Deploy model to
Vertica or as a web service
22
Model training
• Use team and opponent features to train a model (Blocks, steals, assists …)
• Learn what’s important in a basketball game (using Random Forest) and use that knowledge to predict game results.
24
Parallel Random Forest
• Random Forest – building an ensemble of deep decision trees.
• Each tree is created with a different subset of the data and different features to generalize better.
• E.g. build 100 decision trees on 4 machines. Each machine builds 25 decision trees.
25
Game result prediction
• Group games by teams and get the average of each team’s features
• Predict the result of a game using the model• Fill out bracket by predicting 1 game at the time
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Demo
28
Distributed R: summary
• Regression on billions of rows in minutes• Graph algorithms on 10B edges• Load 400GB+ data from database to R in < 10 minutes• Open source!
29
That’s cool… what can I do with it?
• Collaborate• Github (report issues, send PRs) https://github.com/vertica/DistributedR • Standardization with R-core
http://www.r-bloggers.com/enhancing-r-for-distributed-computing/
• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-distributed-r/
• Buy commercial support
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Publishing the model as a Web Service
31
Expose R as a Web Service using
OpenCPUhttps://
www.opencpu.org/
Create a Web App that makes predictions using that service.
Steps
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Demo
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Conclusions
35
HavenBig Data Platform
Turn 100% of your data into action.
Human Data
Business Data
Machine Data
Powering Big Data Analytics to Applications
Insight
Haven OnDemand
• Vertica OnDemand
• IDOL OnDemand
• Vertica Enterprise• IDOL Enterprise• Vertica for SQL on
Hadoop• Vertica Distributed R• KeyView
Haven Enterprise
HP Haven Big Data Platform
36
HP Haven Ecosystem for Developers
Haven OnDemand
• IDOLOnDemand.com
• VerticaOnDemand.com
Developer Community
• Ask questions
• Learn through code tutorials, events…
• Share ideas
• Code libraries and quick-starts
• Let us know if something is broken
HP Haven Marketplace
• Downloads for apps, extensions, plugins, widgets for HP Haven
• HP and 3rd Party developer apps
• Promote your app
• New revenue streams
Top Related