Big Data Heterogeneous Mixture Learning on Spark

62
Big Data Heterogeneous Mixture Learning on Spark Masato Asahara and Ryohei Fujimaki NEC Data Science Research Labs. Jun/30/2016 @Hadoop Summit 2016

Transcript of Big Data Heterogeneous Mixture Learning on Spark

Page 1: Big Data Heterogeneous Mixture Learning on Spark

Big Data Heterogeneous Mixture Learning on Spark

Masato Asahara and Ryohei Fujimaki

NEC Data Science Research Labs.

Jun/30/2016 @Hadoop Summit 2016

Page 2: Big Data Heterogeneous Mixture Learning on Spark

2 © NEC Corporation 2016

Who we are?

▌Masato Asahara (Ph.D.)

▌Researcher, NEC Data Science Research Laboratory Masato received his Ph.D. degree from Keio University, and has worked at

NEC for 6 years in the research field of distributed computing systems and computing resource management technologies.

Masato is currently leading developments of Spark-based machine learning and data analytics systems, particularly using NEC’s Heterogeneous Mixture Learning technology.

▌Ryohei Fujimaki (Ph.D.)

▌Research Fellow, NEC Data Science Research Laboratory Ryohei is responsible for cores to NEC’s leading predictive and prescriptive

analytics solutions, including "heterogeneous mixture learning” and “predictive optimization” technologies. In addition to technology R&D, Ryohei is also involved with co-developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners in the North American and APAC regions.

Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 115-year history.

Page 3: Big Data Heterogeneous Mixture Learning on Spark

3 © NEC Corporation 2016

Agenda

HML

Page 4: Big Data Heterogeneous Mixture Learning on Spark

4 © NEC Corporation 2016

Agenda

CPU12

SEP11

SEP

10SEP

HML

Page 5: Big Data Heterogeneous Mixture Learning on Spark

5 © NEC Corporation 2016

Agenda

HML

Page 6: Big Data Heterogeneous Mixture Learning on Spark

6 © NEC Corporation 2016

Agenda

Micro Modular Server

HML

Page 7: Big Data Heterogeneous Mixture Learning on Spark

NEC’s Predictive Analytics and Heterogeneous Mixture Learning

Page 8: Big Data Heterogeneous Mixture Learning on Spark

8 © NEC Corporation 2016

Enterprise Applications of HML

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenance

Product Price Optimization

Sales Optimization

Energy/Water Operation Mgmt.

Page 9: Big Data Heterogeneous Mixture Learning on Spark

9 © NEC Corporation 2016

NEC’s Heterogeneous Mixture Learning (HML)

NEC’s machine learning that automatically derives “accurate” and “transparent” formulas behind Big Data.

Sun

Sat

Y=a1x1+ ・・・ +anxn

Y=a1x1+ ・・・+knx3

Y=b1x2+ ・・・+hnxn

Y=c1x4+ ・・・+fnxn

Mon

Sun

Explore Massive Formulas

Transparentdata segmentation

and predictive formulasHeterogeneous

mixture data

Page 10: Big Data Heterogeneous Mixture Learning on Spark

10 © NEC Corporation 2016

The HML Model

Gender

Heighttallshort

age

smoke

food

weightsports

sports

tea/coffee

beer/wine

The health risk of a tall man can be

predicted by age, alcohol and smoke!

Page 11: Big Data Heterogeneous Mixture Learning on Spark

11 © NEC Corporation 2016

HML (Heterogeneous Mixture Learning)

Page 12: Big Data Heterogeneous Mixture Learning on Spark

12 © NEC Corporation 2016

HML (Heterogeneous Mixture Learning)

Page 13: Big Data Heterogeneous Mixture Learning on Spark

13 © NEC Corporation 2016

HML Algorithm

Segment data by a rule-based tree

Feature Selection

Pruning

Data Segmentation

Training data

Page 14: Big Data Heterogeneous Mixture Learning on Spark

14 © NEC Corporation 2016

HML Algorithm

Select features and fit predictive models for data segments

Feature Selection

Pruning

Data Segmentation

Training data

Page 15: Big Data Heterogeneous Mixture Learning on Spark

15 © NEC Corporation 2016

Enterprise Applications of HML

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenance

Energy/Water Operation Mgmt.

Sales Optimization

Product Price Optimization

Page 16: Big Data Heterogeneous Mixture Learning on Spark

16 © NEC Corporation 2016

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenance

Energy/Water Demand Forecasting

Sales Forecasting

Product Price Optimization

Demand for Big Data Analysis

5 million(customers)×12(months)=60,000,000 training samples

Page 17: Big Data Heterogeneous Mixture Learning on Spark

17 © NEC Corporation 2016

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenanc

e

Energy/Water Operation Mgmt.

Product Price Optimization

Demand for Big Data Analysis

100(items)×1000(shops)=100,000 predictive models

Sales Optimization

Page 18: Big Data Heterogeneous Mixture Learning on Spark

18 © NEC Corporation 2016

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenanc

e

Energy/Water Operation Mgmt.

Sales Optimization

Product Price Optimization

Demand for Big Data Analysis

Page 19: Big Data Heterogeneous Mixture Learning on Spark

Distributed HML on Spark

Page 20: Big Data Heterogeneous Mixture Learning on Spark

20 © NEC Corporation 2016

Why Spark, not Hadoop?

Because Spark’s in-memory architecture can run HML faster

Feature Selection

Pruning

Data Segmentation

CPU CPU

Data segmentation Feature selection

CPU CPU CPU

Data segmentation Feature selection

~

Pruning

Page 21: Big Data Heterogeneous Mixture Learning on Spark

21 © NEC Corporation 2016

Data Scalability powered by Spark

Treat unlimited scale of training data by adding executors

executor executor executor

CPU CPU CPU

driver

HDFS

Add executors,and get

more memory and CPU power

Training data

Page 22: Big Data Heterogeneous Mixture Learning on Spark

22 © NEC Corporation 2016

Distributed HML Engine: Architecture

HDFS

Distributed HMLCore (Scala)

YARN

invoke

HML model

Distributed HMLInterface (Python)

Matrix computation

libraries(Breeze, BLAS)

Data Scientist

Page 23: Big Data Heterogeneous Mixture Learning on Spark

23 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

CPU12

SEP11

SEP

10SEP

Page 24: Big Data Heterogeneous Mixture Learning on Spark

24 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

CPU12

SEP11

SEP

10SEP

Page 25: Big Data Heterogeneous Mixture Learning on Spark

25 © NEC Corporation 2016

Challenge to Avoid Data Shuffling

executor executor executor

driver

Model: 10KB~10MB

HML Model Merge AlgorithmTraining data: TB~

executor executor executor

driver

Naïve Design Distributed HML

Page 26: Big Data Heterogeneous Mixture Learning on Spark

26 © NEC Corporation 2016

Execution Flow of Distributed HML

Executors build a rule-based tree from their local data in parallel

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Page 27: Big Data Heterogeneous Mixture Learning on Spark

27 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver merges the trees

Driver

Executor

Feature Selection

Pruning

Data Segmentation

HML Segment Merge Algorithm

Page 28: Big Data Heterogeneous Mixture Learning on Spark

28 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver broadcasts the merged tree to executors

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Page 29: Big Data Heterogeneous Mixture Learning on Spark

29 © NEC Corporation 2016

Execution Flow of Distributed HML

Executors perform feature selection with their local data in parallel

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

Sat

Sat

Sat

Sat

Page 30: Big Data Heterogeneous Mixture Learning on Spark

30 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver merges the results of feature selection

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

Sat

Sat

Sat

Sat

HML Feature Merge Algorithm

Page 31: Big Data Heterogeneous Mixture Learning on Spark

31 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver merges the results of feature selection

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

HML FeatureMerge Algorithm

Page 32: Big Data Heterogeneous Mixture Learning on Spark

32 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver broadcasts the merged results of feature selection to executors

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

SatSatSat

Page 33: Big Data Heterogeneous Mixture Learning on Spark

33 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

CPU12

SEP11

SEP

10SEP

Page 34: Big Data Heterogeneous Mixture Learning on Spark

34 © NEC Corporation 2016

Wait Time for Other Executors Delays Execution

Machine learning is likely to cause unbalanced computation load

Executor

Sat

Sat

Time

Executor Executor

Page 35: Big Data Heterogeneous Mixture Learning on Spark

35 © NEC Corporation 2016

Balance Computational Effort for Each Executor

Executor optimizes all predictive formulas with equally-divided training data

Executor

Sat

Sat

Time

Executor Executor

Page 36: Big Data Heterogeneous Mixture Learning on Spark

36 © NEC Corporation 2016

Balance Computational Effort for Each Executor

Executor optimizes all predictive formulas with equally-divided training data

Executor

Sat

Sat

Time

Executor Executor

Sat Sat

Sat Sat

Completion time reduced!

Page 37: Big Data Heterogeneous Mixture Learning on Spark

37 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

1TBCPU

12

SEP11

SEP

10SEP

Page 38: Big Data Heterogeneous Mixture Learning on Spark

38 © NEC Corporation 2016

Machine Learning Consists of Many Matrix Computations

Feature Selection

Pruning

Data Segmentation

Basic Matrix ComputationMatrix Inversion

Combinatorial Optimization…

CPU

Page 39: Big Data Heterogeneous Mixture Learning on Spark

39 © NEC Corporation 2016

RDD Design for Leveraging High-Speed Libraries

Distributed HML’s RDD Design

Matrix object1

element

Straightforward RDD Design

Training sample1

partition

Training sample2CPU

CPU

Matrix computation libraries(Breeze, BLAS)

~Training sample1

Training sample2

seq(Training sample1,Training sample2, …)

mapPartition

Matrix object1

Training sample1

Training sample2

map

Iterator

Matrixobject

creation

element

Page 40: Big Data Heterogeneous Mixture Learning on Spark

40 © NEC Corporation 2016

Performance Degradation by Long-Chained RDD

Long-chained RDD operations cause high-cost recalculations

Feature Selection

Pruning

Data Segmentation RDD.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(PruningMapFunc)

Page 41: Big Data Heterogeneous Mixture Learning on Spark

41 © NEC Corporation 2016

Performance Degradation by Long-Chained RDD

Long-chained RDD operations cause high-cost recalculations

Feature Selection

Pruning

Data Segmentation RDD.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

GC

.map(TreeMapFunc)

CPU

.map(PruningMapFunc)

Page 42: Big Data Heterogeneous Mixture Learning on Spark

42 © NEC Corporation 2016

Performance Degradation by Long-Chained RDD

Cut long-chained operations periodically by checkpoint()

Feature Selection

Pruning

Data Segmentation RDD.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

GC

.map(TreeMapFunc)

savedRDD

.checkpoint()

=

CPU

~

.map(PruningMapFunc)

Page 43: Big Data Heterogeneous Mixture Learning on Spark

Benchmark Performance Evaluations

Page 44: Big Data Heterogeneous Mixture Learning on Spark

44 © NEC Corporation 2016

Eval 1: Prediction Error (vs. Spark MLlib algorithms)

Distributed HML achieves low error competitive to a complex model

data # samples Distributed HML

Logistic Regression

Decision Tree

Random Forests

gas sensor array (CO)*

4,208,261 0.542 0.597 0.587 0.576

household power consumption*

2,075,259 0.524 0.531 0.529 0.655

HIGGS* 11,000,000 0.335 0.358 0.337 0.317

HEPMASS* 7,000,000 0.156 0.163 0.167 0.175

* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)

1st

1st

1st

2nd

Page 45: Big Data Heterogeneous Mixture Learning on Spark

45 © NEC Corporation 2016

Eval 2: Performance Scalability (vs. Spark MLlib algos)

Distributed HML is competitive with Spark MLlib implementations.

HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores

DistributedHML

LogisticRegression

DistributedHML

* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)

LogisticRegression

Page 46: Big Data Heterogeneous Mixture Learning on Spark

Evaluation in Real Case

Page 47: Big Data Heterogeneous Mixture Learning on Spark

47 © NEC Corporation 2016

ATM Cash Balance Prediction

Much cash High inventory cost

Little cash High risk ofout of stock

Around 20,000 ATMs

Page 48: Big Data Heterogeneous Mixture Learning on Spark

48 © NEC Corporation 2016

Training Speed

2 hours9 days

▌Summary of data

# ATMs: around 20,000 (in Japan)

# training samples: around 10M

▌Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

* Run w/ 1 CPU core and 256GB memory

110xSpeed up

Serial HML* Distributed HML

Page 49: Big Data Heterogeneous Mixture Learning on Spark

49 © NEC Corporation 2016

Prediction Error

▌Summary of data

# ATMs: around 20,000 (in Japan)

# training samples: around 23M

▌Cluster spec (10 nodes)# CPU cores: 128

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

Average Error:

17% lower

SerialHML

Distributed HMLSerialHML

SerialHML

vs

Better: Distributed HML

Page 50: Big Data Heterogeneous Mixture Learning on Spark

NEC’s HML Data Analytics Appliance powered by Micro Modular Server

Page 51: Big Data Heterogeneous Mixture Learning on Spark

51 © NEC Corporation 2016

Dilemma between Cloud and On-Premises

Cloud On-Premises

•High pay-for-use charge

•Complicated application software design

•Data leakage risk

• Complicated operation for Spark/YARN/HDFS

• High installation cost

• High space/power cost

•Quick start

•Low operation effort for Spark/YARN/HDFS

•No space/power cost

•No installation effort

• No pay-for-use charge

• Low data leakage risk

• Simple application software design

Page 52: Big Data Heterogeneous Mixture Learning on Spark

52 © NEC Corporation 2016

NEC’s HML Data Analytics Appliance: Concept

Micro Modular Server

Data Scientist ~Plug-and-Analytics!

•Quick deploy!

•Low data leakage risk

•Low operation effort for Spark/YARN/HDFS powered by HDP

•Significantly space saving

•Low power cost

•Simple application software

328 CPU cores, 1.3TB memory and 5TB SSD in 2U space!

HML

Page 53: Big Data Heterogeneous Mixture Learning on Spark

53 © NEC Corporation 2016

Speed Test (ATM Cash Balance Prediction)

7275 sec(2h)

▌Summary of data

# ATMs: around 20,000 (in Japan)

# training samples: around 10M

▌Standard Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

Standard Cluster Micro Modular Server

4529 sec(1.25h)

1.6xSpeed up

Page 54: Big Data Heterogeneous Mixture Learning on Spark

54 © NEC Corporation 2016

Toward More Big Data!

Micro Modular Server(DX1000)

Scalable Modular Server(DX2000)

• CPU 2.3x speed up• MEM 1.7x increased• SSD 3.3x increased

272 high-speed Xeon-D 2.1GHz cores, 2TB memory and 17TB SSD in 3U space

Page 55: Big Data Heterogeneous Mixture Learning on Spark

55 © NEC Corporation 2016

Summary

CPU12

SEP11

SEP

10SEP

HMLMicro Modular Server

Page 56: Big Data Heterogeneous Mixture Learning on Spark

56 © NEC Corporation 2016

Summary

HMLMicro Modular Server

Page 57: Big Data Heterogeneous Mixture Learning on Spark
Page 58: Big Data Heterogeneous Mixture Learning on Spark

Appendix

Page 59: Big Data Heterogeneous Mixture Learning on Spark

59 © NEC Corporation 2016

Micro Modular Server (DX1000) Spec.: Server Module

Page 60: Big Data Heterogeneous Mixture Learning on Spark

60 © NEC Corporation 2016

Micro Modular Server (DX1000) Spec.: Module Enclosure

Page 61: Big Data Heterogeneous Mixture Learning on Spark

61 © NEC Corporation 2016

Scalable Modular Server (DX2000) Spec.: Server Module

Page 62: Big Data Heterogeneous Mixture Learning on Spark

62 © NEC Corporation 2016

Scalable Modular Server (DX2000) Spec.: Module Enclosure