Big Data Heterogeneous Mixture Learning on Spark
-
Upload
hadoop-summit -
Category
Technology
-
view
553 -
download
1
Transcript of Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
Masato Asahara and Ryohei Fujimaki
NEC Data Science Research Labs.
Jun/30/2016 @Hadoop Summit 2016
2 © NEC Corporation 2016
Who we are?
▌Masato Asahara (Ph.D.)
▌Researcher, NEC Data Science Research Laboratory Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 6 years in the research field of distributed computing systems and computing resource management technologies.
Masato is currently leading developments of Spark-based machine learning and data analytics systems, particularly using NEC’s Heterogeneous Mixture Learning technology.
▌Ryohei Fujimaki (Ph.D.)
▌Research Fellow, NEC Data Science Research Laboratory Ryohei is responsible for cores to NEC’s leading predictive and prescriptive
analytics solutions, including "heterogeneous mixture learning” and “predictive optimization” technologies. In addition to technology R&D, Ryohei is also involved with co-developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners in the North American and APAC regions.
Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 115-year history.
3 © NEC Corporation 2016
Agenda
HML
4 © NEC Corporation 2016
Agenda
CPU12
SEP11
SEP
10SEP
HML
5 © NEC Corporation 2016
Agenda
HML
6 © NEC Corporation 2016
Agenda
Micro Modular Server
HML
NEC’s Predictive Analytics and Heterogeneous Mixture Learning
8 © NEC Corporation 2016
Enterprise Applications of HML
Driver RiskAssessment
Inventory Optimization
Churn Retention
PredictiveMaintenance
Product Price Optimization
Sales Optimization
Energy/Water Operation Mgmt.
9 © NEC Corporation 2016
NEC’s Heterogeneous Mixture Learning (HML)
NEC’s machine learning that automatically derives “accurate” and “transparent” formulas behind Big Data.
Sun
Sat
Y=a1x1+ ・・・ +anxn
Y=a1x1+ ・・・+knx3
Y=b1x2+ ・・・+hnxn
Y=c1x4+ ・・・+fnxn
Mon
Sun
Explore Massive Formulas
Transparentdata segmentation
and predictive formulasHeterogeneous
mixture data
10 © NEC Corporation 2016
The HML Model
Gender
Heighttallshort
age
smoke
food
weightsports
sports
tea/coffee
beer/wine
The health risk of a tall man can be
predicted by age, alcohol and smoke!
11 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
12 © NEC Corporation 2016
HML (Heterogeneous Mixture Learning)
13 © NEC Corporation 2016
HML Algorithm
Segment data by a rule-based tree
Feature Selection
Pruning
Data Segmentation
Training data
14 © NEC Corporation 2016
HML Algorithm
Select features and fit predictive models for data segments
Feature Selection
Pruning
Data Segmentation
Training data
15 © NEC Corporation 2016
Enterprise Applications of HML
Driver RiskAssessment
Inventory Optimization
Churn Retention
PredictiveMaintenance
Energy/Water Operation Mgmt.
Sales Optimization
Product Price Optimization
16 © NEC Corporation 2016
Driver RiskAssessment
Inventory Optimization
Churn Retention
PredictiveMaintenance
Energy/Water Demand Forecasting
Sales Forecasting
Product Price Optimization
Demand for Big Data Analysis
5 million(customers)×12(months)=60,000,000 training samples
17 © NEC Corporation 2016
Driver RiskAssessment
Inventory Optimization
Churn Retention
PredictiveMaintenanc
e
Energy/Water Operation Mgmt.
Product Price Optimization
Demand for Big Data Analysis
100(items)×1000(shops)=100,000 predictive models
Sales Optimization
18 © NEC Corporation 2016
Driver RiskAssessment
Inventory Optimization
Churn Retention
PredictiveMaintenanc
e
Energy/Water Operation Mgmt.
Sales Optimization
Product Price Optimization
Demand for Big Data Analysis
Distributed HML on Spark
20 © NEC Corporation 2016
Why Spark, not Hadoop?
Because Spark’s in-memory architecture can run HML faster
Feature Selection
Pruning
Data Segmentation
CPU CPU
Data segmentation Feature selection
CPU CPU CPU
Data segmentation Feature selection
~
Pruning
21 © NEC Corporation 2016
Data Scalability powered by Spark
Treat unlimited scale of training data by adding executors
executor executor executor
CPU CPU CPU
driver
HDFS
Add executors,and get
more memory and CPU power
Training data
22 © NEC Corporation 2016
Distributed HML Engine: Architecture
HDFS
Distributed HMLCore (Scala)
YARN
invoke
HML model
Distributed HMLInterface (Python)
Matrix computation
libraries(Breeze, BLAS)
Data Scientist
23 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU12
SEP11
SEP
10SEP
24 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU12
SEP11
SEP
10SEP
25 © NEC Corporation 2016
Challenge to Avoid Data Shuffling
executor executor executor
driver
Model: 10KB~10MB
HML Model Merge AlgorithmTraining data: TB~
executor executor executor
driver
Naïve Design Distributed HML
26 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors build a rule-based tree from their local data in parallel
Driver
Executor
Feature Selection
Pruning
Data Segmentation
27 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the trees
Driver
Executor
Feature Selection
Pruning
Data Segmentation
HML Segment Merge Algorithm
28 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged tree to executors
Driver
Executor
Feature Selection
Pruning
Data Segmentation
29 © NEC Corporation 2016
Execution Flow of Distributed HML
Executors perform feature selection with their local data in parallel
Driver
Executor
Feature Selection
Pruning
Data Segmentation
Sat
Sat
Sat
Sat
Sat
30 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature Selection
Pruning
Data Segmentation
Sat
Sat
Sat
Sat
Sat
HML Feature Merge Algorithm
31 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver merges the results of feature selection
Driver
Executor
Feature Selection
Pruning
Data Segmentation
Sat
HML FeatureMerge Algorithm
32 © NEC Corporation 2016
Execution Flow of Distributed HML
Driver broadcasts the merged results of feature selection to executors
Driver
Executor
Feature Selection
Pruning
Data Segmentation
Sat
SatSatSat
33 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
CPU12
SEP11
SEP
10SEP
34 © NEC Corporation 2016
Wait Time for Other Executors Delays Execution
Machine learning is likely to cause unbalanced computation load
Executor
Sat
Sat
Time
Executor Executor
35 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
36 © NEC Corporation 2016
Balance Computational Effort for Each Executor
Executor optimizes all predictive formulas with equally-divided training data
Executor
Sat
Sat
Time
Executor Executor
Sat Sat
Sat Sat
Completion time reduced!
37 © NEC Corporation 2016
3 Technical Key Points to Fast Run ML on Spark
1TBCPU
12
SEP11
SEP
10SEP
38 © NEC Corporation 2016
Machine Learning Consists of Many Matrix Computations
Feature Selection
Pruning
Data Segmentation
Basic Matrix ComputationMatrix Inversion
Combinatorial Optimization…
CPU
39 © NEC Corporation 2016
RDD Design for Leveraging High-Speed Libraries
Distributed HML’s RDD Design
Matrix object1
element
…
Straightforward RDD Design
Training sample1
partition
Training sample2CPU
CPU
Matrix computation libraries(Breeze, BLAS)
~Training sample1
Training sample2
…
seq(Training sample1,Training sample2, …)
mapPartition
Matrix object1
…
Training sample1
Training sample2
map
Iterator
Matrixobject
creation
element
40 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature Selection
Pruning
Data Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(PruningMapFunc)
41 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Long-chained RDD operations cause high-cost recalculations
Feature Selection
Pruning
Data Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
CPU
.map(PruningMapFunc)
42 © NEC Corporation 2016
Performance Degradation by Long-Chained RDD
Cut long-chained operations periodically by checkpoint()
Feature Selection
Pruning
Data Segmentation RDD.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
.map(TreeMapFunc)
.map(FeatureSelectionMapFunc)
.reduce(TreeReduceFunc)
.reduce(FeatureSelectionReduceFunc)
GC
.map(TreeMapFunc)
savedRDD
.checkpoint()
=
CPU
~
.map(PruningMapFunc)
Benchmark Performance Evaluations
44 © NEC Corporation 2016
Eval 1: Prediction Error (vs. Spark MLlib algorithms)
Distributed HML achieves low error competitive to a complex model
data # samples Distributed HML
Logistic Regression
Decision Tree
Random Forests
gas sensor array (CO)*
4,208,261 0.542 0.597 0.587 0.576
household power consumption*
2,075,259 0.524 0.531 0.529 0.655
HIGGS* 11,000,000 0.335 0.358 0.337 0.317
HEPMASS* 7,000,000 0.156 0.163 0.167 0.175
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
1st
1st
1st
2nd
45 © NEC Corporation 2016
Eval 2: Performance Scalability (vs. Spark MLlib algos)
Distributed HML is competitive with Spark MLlib implementations.
HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores
DistributedHML
LogisticRegression
DistributedHML
* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
LogisticRegression
Evaluation in Real Case
47 © NEC Corporation 2016
ATM Cash Balance Prediction
Much cash High inventory cost
Little cash High risk ofout of stock
Around 20,000 ATMs
48 © NEC Corporation 2016
Training Speed
2 hours9 days
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)
Memory: 2.5TB
Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)
* Run w/ 1 CPU core and 256GB memory
110xSpeed up
Serial HML* Distributed HML
49 © NEC Corporation 2016
Prediction Error
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 23M
▌Cluster spec (10 nodes)# CPU cores: 128
Memory: 2.5TB
Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)
Average Error:
17% lower
SerialHML
Distributed HMLSerialHML
SerialHML
vs
Better: Distributed HML
NEC’s HML Data Analytics Appliance powered by Micro Modular Server
51 © NEC Corporation 2016
Dilemma between Cloud and On-Premises
Cloud On-Premises
•High pay-for-use charge
•Complicated application software design
•Data leakage risk
• Complicated operation for Spark/YARN/HDFS
• High installation cost
• High space/power cost
•Quick start
•Low operation effort for Spark/YARN/HDFS
•No space/power cost
•No installation effort
• No pay-for-use charge
• Low data leakage risk
• Simple application software design
52 © NEC Corporation 2016
NEC’s HML Data Analytics Appliance: Concept
Micro Modular Server
Data Scientist ~Plug-and-Analytics!
•Quick deploy!
•Low data leakage risk
•Low operation effort for Spark/YARN/HDFS powered by HDP
•Significantly space saving
•Low power cost
•Simple application software
328 CPU cores, 1.3TB memory and 5TB SSD in 2U space!
HML
53 © NEC Corporation 2016
Speed Test (ATM Cash Balance Prediction)
7275 sec(2h)
▌Summary of data
# ATMs: around 20,000 (in Japan)
# training samples: around 10M
▌Standard Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)
Memory: 2.5TB
Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)
Standard Cluster Micro Modular Server
4529 sec(1.25h)
1.6xSpeed up
54 © NEC Corporation 2016
Toward More Big Data!
Micro Modular Server(DX1000)
Scalable Modular Server(DX2000)
• CPU 2.3x speed up• MEM 1.7x increased• SSD 3.3x increased
272 high-speed Xeon-D 2.1GHz cores, 2TB memory and 17TB SSD in 3U space
55 © NEC Corporation 2016
Summary
CPU12
SEP11
SEP
10SEP
HMLMicro Modular Server
56 © NEC Corporation 2016
Summary
HMLMicro Modular Server
Appendix
59 © NEC Corporation 2016
Micro Modular Server (DX1000) Spec.: Server Module
60 © NEC Corporation 2016
Micro Modular Server (DX1000) Spec.: Module Enclosure
61 © NEC Corporation 2016
Scalable Modular Server (DX2000) Spec.: Server Module
62 © NEC Corporation 2016
Scalable Modular Server (DX2000) Spec.: Module Enclosure