Post on 07-Jan-2017
© 2016 MapR Technologies 1© 2016 MapR Technologies 1MapR Confidential © 2016 MapR Technologies
Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform
Mathieu Dumoulin (mdumoulin@mapr.com)Mateusz Dymczyk (mateusz@h2o.ai)Hadoop Summit Tokyo 2016
© 2016 MapR Technologies 2© 2016 MapR Technologies 2MapR Confidential
Mathieu Dumoulin, Data Engineer
• Master’s degree in text classification on Hadoop at Fujitsu Canada’s Innovation Lab
• In Tokyo, I’ve worked as Data Scientist, Search Engineer and Data Engineer
• My favorite ML libs are Scikit-Learn and H2O
• 日本料理が大好き。とくに鍋としゃぶ
しゃぶです。
© 2016 MapR Technologies 3© 2016 MapR Technologies 3MapR Confidential
Mateusz Dymczyk, Software Engineer
• About me
© 2016 MapR Technologies 4© 2016 MapR Technologies 4MapR Confidential
A Machine Learning Pipeline
Image from scikit-learn.org
© 2016 MapR Technologies 5© 2016 MapR Technologies 5MapR Confidential
… Meets the Real World
Must be integrated with a production system
© 2016 MapR Technologies 6© 2016 MapR Technologies 6MapR Confidential
… Meets the Real WorldData comes from many sourcesmaybe very large
Data isn’t always labeled!
Must be integrated with a production system
© 2016 MapR Technologies 7© 2016 MapR Technologies 7MapR Confidential
… Meets the Real WorldData comes from many sources,maybe very large
Needs ETL and cleaning
Finding the best algorithm and parameters can use a lot of CPU
Data isn’t always labeled!
© 2016 MapR Technologies 8© 2016 MapR Technologies 8MapR Confidential
… Meets the Real WorldData comes from many sources,maybe very large
Needs ETL and cleaning
Finding the best algorithm and parameters can use a lot of CPU
Data isn’t always labeled!
From production systems? Is it real time?
Must be integrated with a production system
The predictions are used by another system...
© 2016 MapR Technologies 9© 2016 MapR Technologies 9MapR Confidential
Doing Machine Learning here...
© 2016 MapR Technologies 10© 2016 MapR Technologies 10MapR Confidential
Is very different than here
© 2016 MapR Technologies 11© 2016 MapR Technologies 11MapR Confidential
We don’t have better algorithms, we just have more data.
Peter Norvig, CTO at Google
© 2016 MapR Technologies 12© 2016 MapR Technologies 12MapR Confidential
Machine Learning at scale mattersGrowing number of ML use cases at successful companies
Anomaly Detection
Customer 360Fraud DetectionLog Security
Analysis
Recommender Sensor Data (IoT)
Personalized Offers
Ad Tech
© 2016 MapR Technologies 13© 2016 MapR Technologies 13MapR Confidential
ML at scale matters… but it’s HARD
Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline
© 2016 MapR Technologies 14© 2016 MapR Technologies 14MapR Confidential
There must be a better way...
© 2016 MapR Technologies 15© 2016 MapR Technologies 15MapR Confidential
A platform for big data ML
• ML projects can start simple and show value
• Just work. Integrate with existing systems, and tools
• Integrate common technology, not just YARN
• Easy, unified administration
• Share the cluster (multi-tenancy)
• Keeps your data safe and secure
What’s an ideal big data platform for ML?
© 2016 MapR Technologies 16© 2016 MapR Technologies 16MapR Confidential © 2016 MapR Technologies
MapR Converged Data Platform
© 2016 MapR Technologies 17© 2016 MapR Technologies 17MapR Confidential
MapR Converged Data Platform
Open Source Engines & Tools Commercial Engines & Applications
Utility-Grade Platform Services
Dat
aP
roce
ssin
g
Enterprise StorageMapR-FS MapR-DB MapR Streams
Database Event Streaming
Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy
Search & Others
Cloud & Managed Services
Custom Apps
Unified M
anagement and M
onitoring
© 2016 MapR Technologies 18© 2016 MapR Technologies 18MapR Confidential
Unique MapR features useful for ML
● MapR-FS and NFS mount
● Topologies
● Mirrors and Snapshots
●
●
● Reliability
● Multi-tenancy
● Data Governance
● Security
© 2016 MapR Technologies 19© 2016 MapR Technologies 19MapR Confidential
MapR MCS
● Unified view
● Easy use of features
● REST API and
maprcli utility
© 2016 MapR Technologies 20© 2016 MapR Technologies 20MapR Confidential
NFS MountMount the cluster as a regular folder
$> sudo mount -o hard,nolock ip-10-0-0-110:/mapr /mapr$> ll /mapr/hadoopsummit/total 3drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:21 appsdrwxr-xr-x. 2 mapr mapr 0 Oct 13 11:12 hbasedrwxr-xr-x. 3 root root 1 Oct 13 11:21 installerdrwxr-xr-x. 2 mapr mapr 0 Oct 13 11:14 optdrwxrwxrwx. 2 mapr mapr 1 Oct 14 10:41 tmpdrwxr-xr-x. 6 mapr mapr 4 Oct 14 10:52 userdrwxr-xr-x. 3 mapr mapr 1 Oct 13 11:13 var
© 2014 MapR Technologies 21
MapR NFS and Volumes
[mapr@ip-10-0-0-110 mapr]$ pwd/mapr/hadoopsummit/user/mapr
© 2014 MapR Technologies 22
MapR NFS and Volumes
[mapr@ip-10-0-0-110 mapr]$ pwd/mapr/hadoopsummit/user/mapr
© 2014 MapR Technologies 23
MapR NFS and Volumes
[mapr@ip-10-0-0-110 mapr]$ pwd/mapr/hadoopsummit/user/mapr
© 2014 MapR Technologies 24
Match a Volume to a Topology
Match data to nodes or groups of nodes precisely
© 2016 MapR Technologies 25© 2016 MapR Technologies 25MapR Confidential
CRISP-DM Model
● Industry Standard Model
● Full project view, from business idea to production deployment
● Realistic: lots of cycles
© 2016 MapR Technologies 26© 2016 MapR Technologies 26MapR Confidential
MapR Features for Data Understanding
Data Collection:• NFS Mount• POSIX Client• MapR Streams (Kafka API)• MapR DB (HBase API)Data Exploration
• <Insert your favorite tool>
© 2016 MapR Technologies 27© 2016 MapR Technologies 27MapR Confidential
MapR Features for Data Preparation
Data Cleaning and Feature Engineering (ETL):
• NFS Mount, POSIX Client• Snapshots• Streamsets Data Collector w/
MapR support• Apache Spark• <Your favorite tool>
© 2016 MapR Technologies 28© 2016 MapR Technologies 28MapR Confidential
MapR Features for Modeling
MapR does not “do” machine learning, that’s your job!• MapR Filesystem
• NFS mount/POSIX client
• Mirrors and Snapshots
• Topologies
• Use your existing tools
© 2016 MapR Technologies 29© 2016 MapR Technologies 29MapR Confidential
MapR Features for Evaluation
- Collect data- Explore data
• MapR-FS
• Mirrors
• Snapshots
• Support any tools
© 2016 MapR Technologies 30© 2016 MapR Technologies 30MapR Confidential
MapR Features for Deployment
- Collect data- Explore data
• NFS/POSIX client
• Mirrors
• Snapshots
• Microservice model *
• MapR-DB, MapR Streams
• Security* Check out the converged application blueprint : https://www.mapr.com/appblueprint/overview
© 2016 MapR Technologies 31© 2016 MapR Technologies 31MapR Confidential
Converged Data Platform Machine Learning• Features that work together to support all phases of real
production ML
• Supports all the tools you know and the state of the art
frameworks
• Easier to manage, more robust and secure.
• MapR is made for the enterprise
© 2016 MapR Technologies 32© 2016 MapR Technologies 32MapR Confidential © 2016 MapR Technologies
Demo: ML with H2O on MapR
© 2016 MapR Technologies 33© 2016 MapR Technologies 33MapR Confidential
Demo of H2O on MapR: Features in Action
© 2016 MapR Technologies 34© 2016 MapR Technologies 34MapR Confidential
Q & A@mapr
mdumoulin@mapr.com
Engage with us!
mapr-technologies