Big Data Heterogeneous Mixture Learning on Spark

Post on 19-Jan-2017

553 views 1 download

Transcript of Big Data Heterogeneous Mixture Learning on Spark

Big Data Heterogeneous Mixture Learning on Spark

Masato Asahara and Ryohei Fujimaki

NEC Data Science Research Labs.

Jun/30/2016 @Hadoop Summit 2016

2 © NEC Corporation 2016

Who we are?

▌Masato Asahara (Ph.D.)

▌Researcher, NEC Data Science Research Laboratory Masato received his Ph.D. degree from Keio University, and has worked at

NEC for 6 years in the research field of distributed computing systems and computing resource management technologies.

Masato is currently leading developments of Spark-based machine learning and data analytics systems, particularly using NEC’s Heterogeneous Mixture Learning technology.

▌Ryohei Fujimaki (Ph.D.)

▌Research Fellow, NEC Data Science Research Laboratory Ryohei is responsible for cores to NEC’s leading predictive and prescriptive

analytics solutions, including "heterogeneous mixture learning” and “predictive optimization” technologies. In addition to technology R&D, Ryohei is also involved with co-developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners in the North American and APAC regions.

Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 115-year history.

3 © NEC Corporation 2016

Agenda

HML

4 © NEC Corporation 2016

Agenda

CPU12

SEP11

SEP

10SEP

HML

5 © NEC Corporation 2016

Agenda

HML

6 © NEC Corporation 2016

Agenda

Micro Modular Server

HML

NEC’s Predictive Analytics and Heterogeneous Mixture Learning

8 © NEC Corporation 2016

Enterprise Applications of HML

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenance

Product Price Optimization

Sales Optimization

Energy/Water Operation Mgmt.

9 © NEC Corporation 2016

NEC’s Heterogeneous Mixture Learning (HML)

NEC’s machine learning that automatically derives “accurate” and “transparent” formulas behind Big Data.

Sun

Sat

Y=a1x1+ ・・・ +anxn

Y=a1x1+ ・・・+knx3

Y=b1x2+ ・・・+hnxn

Y=c1x4+ ・・・+fnxn

Mon

Sun

Explore Massive Formulas

Transparentdata segmentation

and predictive formulasHeterogeneous

mixture data

10 © NEC Corporation 2016

The HML Model

Gender

Heighttallshort

age

smoke

food

weightsports

sports

tea/coffee

beer/wine

The health risk of a tall man can be

predicted by age, alcohol and smoke!

11 © NEC Corporation 2016

HML (Heterogeneous Mixture Learning)

12 © NEC Corporation 2016

HML (Heterogeneous Mixture Learning)

13 © NEC Corporation 2016

HML Algorithm

Segment data by a rule-based tree

Feature Selection

Pruning

Data Segmentation

Training data

14 © NEC Corporation 2016

HML Algorithm

Select features and fit predictive models for data segments

Feature Selection

Pruning

Data Segmentation

Training data

15 © NEC Corporation 2016

Enterprise Applications of HML

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenance

Energy/Water Operation Mgmt.

Sales Optimization

Product Price Optimization

16 © NEC Corporation 2016

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenance

Energy/Water Demand Forecasting

Sales Forecasting

Product Price Optimization

Demand for Big Data Analysis

5 million(customers)×12(months)=60,000,000 training samples

17 © NEC Corporation 2016

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenanc

e

Energy/Water Operation Mgmt.

Product Price Optimization

Demand for Big Data Analysis

100(items)×1000(shops)=100,000 predictive models

Sales Optimization

18 © NEC Corporation 2016

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenanc

e

Energy/Water Operation Mgmt.

Sales Optimization

Product Price Optimization

Demand for Big Data Analysis

Distributed HML on Spark

20 © NEC Corporation 2016

Why Spark, not Hadoop?

Because Spark’s in-memory architecture can run HML faster

Feature Selection

Pruning

Data Segmentation

CPU CPU

Data segmentation Feature selection

CPU CPU CPU

Data segmentation Feature selection

~

Pruning

21 © NEC Corporation 2016

Data Scalability powered by Spark

Treat unlimited scale of training data by adding executors

executor executor executor

CPU CPU CPU

driver

HDFS

Add executors,and get

more memory and CPU power

Training data

22 © NEC Corporation 2016

Distributed HML Engine: Architecture

HDFS

Distributed HMLCore (Scala)

YARN

invoke

HML model

Distributed HMLInterface (Python)

Matrix computation

libraries(Breeze, BLAS)

Data Scientist

23 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

CPU12

SEP11

SEP

10SEP

24 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

CPU12

SEP11

SEP

10SEP

25 © NEC Corporation 2016

Challenge to Avoid Data Shuffling

executor executor executor

driver

Model: 10KB~10MB

HML Model Merge AlgorithmTraining data: TB~

executor executor executor

driver

Naïve Design Distributed HML

26 © NEC Corporation 2016

Execution Flow of Distributed HML

Executors build a rule-based tree from their local data in parallel

Driver

Executor

Feature Selection

Pruning

Data Segmentation

27 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver merges the trees

Driver

Executor

Feature Selection

Pruning

Data Segmentation

HML Segment Merge Algorithm

28 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver broadcasts the merged tree to executors

Driver

Executor

Feature Selection

Pruning

Data Segmentation

29 © NEC Corporation 2016

Execution Flow of Distributed HML

Executors perform feature selection with their local data in parallel

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

Sat

Sat

Sat

Sat

30 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver merges the results of feature selection

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

Sat

Sat

Sat

Sat

HML Feature Merge Algorithm

31 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver merges the results of feature selection

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

HML FeatureMerge Algorithm

32 © NEC Corporation 2016

Execution Flow of Distributed HML

Driver broadcasts the merged results of feature selection to executors

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Sat

SatSatSat

33 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

CPU12

SEP11

SEP

10SEP

34 © NEC Corporation 2016

Wait Time for Other Executors Delays Execution

Machine learning is likely to cause unbalanced computation load

Executor

Sat

Sat

Time

Executor Executor

35 © NEC Corporation 2016

Balance Computational Effort for Each Executor

Executor optimizes all predictive formulas with equally-divided training data

Executor

Sat

Sat

Time

Executor Executor

36 © NEC Corporation 2016

Balance Computational Effort for Each Executor

Executor optimizes all predictive formulas with equally-divided training data

Executor

Sat

Sat

Time

Executor Executor

Sat Sat

Sat Sat

Completion time reduced!

37 © NEC Corporation 2016

3 Technical Key Points to Fast Run ML on Spark

1TBCPU

12

SEP11

SEP

10SEP

38 © NEC Corporation 2016

Machine Learning Consists of Many Matrix Computations

Feature Selection

Pruning

Data Segmentation

Basic Matrix ComputationMatrix Inversion

Combinatorial Optimization…

CPU

39 © NEC Corporation 2016

RDD Design for Leveraging High-Speed Libraries

Distributed HML’s RDD Design

Matrix object1

element

Straightforward RDD Design

Training sample1

partition

Training sample2CPU

CPU

Matrix computation libraries(Breeze, BLAS)

~Training sample1

Training sample2

seq(Training sample1,Training sample2, …)

mapPartition

Matrix object1

Training sample1

Training sample2

map

Iterator

Matrixobject

creation

element

40 © NEC Corporation 2016

Performance Degradation by Long-Chained RDD

Long-chained RDD operations cause high-cost recalculations

Feature Selection

Pruning

Data Segmentation RDD.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(PruningMapFunc)

41 © NEC Corporation 2016

Performance Degradation by Long-Chained RDD

Long-chained RDD operations cause high-cost recalculations

Feature Selection

Pruning

Data Segmentation RDD.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

GC

.map(TreeMapFunc)

CPU

.map(PruningMapFunc)

42 © NEC Corporation 2016

Performance Degradation by Long-Chained RDD

Cut long-chained operations periodically by checkpoint()

Feature Selection

Pruning

Data Segmentation RDD.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

GC

.map(TreeMapFunc)

savedRDD

.checkpoint()

=

CPU

~

.map(PruningMapFunc)

Benchmark Performance Evaluations

44 © NEC Corporation 2016

Eval 1: Prediction Error (vs. Spark MLlib algorithms)

Distributed HML achieves low error competitive to a complex model

data # samples Distributed HML

Logistic Regression

Decision Tree

Random Forests

gas sensor array (CO)*

4,208,261 0.542 0.597 0.587 0.576

household power consumption*

2,075,259 0.524 0.531 0.529 0.655

HIGGS* 11,000,000 0.335 0.358 0.337 0.317

HEPMASS* 7,000,000 0.156 0.163 0.167 0.175

* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)

1st

1st

1st

2nd

45 © NEC Corporation 2016

Eval 2: Performance Scalability (vs. Spark MLlib algos)

Distributed HML is competitive with Spark MLlib implementations.

HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores

DistributedHML

LogisticRegression

DistributedHML

* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)

LogisticRegression

Evaluation in Real Case

47 © NEC Corporation 2016

ATM Cash Balance Prediction

Much cash High inventory cost

Little cash High risk ofout of stock

Around 20,000 ATMs

48 © NEC Corporation 2016

Training Speed

2 hours9 days

▌Summary of data

# ATMs: around 20,000 (in Japan)

# training samples: around 10M

▌Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

* Run w/ 1 CPU core and 256GB memory

110xSpeed up

Serial HML* Distributed HML

49 © NEC Corporation 2016

Prediction Error

▌Summary of data

# ATMs: around 20,000 (in Japan)

# training samples: around 23M

▌Cluster spec (10 nodes)# CPU cores: 128

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

Average Error:

17% lower

SerialHML

Distributed HMLSerialHML

SerialHML

vs

Better: Distributed HML

NEC’s HML Data Analytics Appliance powered by Micro Modular Server

51 © NEC Corporation 2016

Dilemma between Cloud and On-Premises

Cloud On-Premises

•High pay-for-use charge

•Complicated application software design

•Data leakage risk

• Complicated operation for Spark/YARN/HDFS

• High installation cost

• High space/power cost

•Quick start

•Low operation effort for Spark/YARN/HDFS

•No space/power cost

•No installation effort

• No pay-for-use charge

• Low data leakage risk

• Simple application software design

52 © NEC Corporation 2016

NEC’s HML Data Analytics Appliance: Concept

Micro Modular Server

Data Scientist ~Plug-and-Analytics!

•Quick deploy!

•Low data leakage risk

•Low operation effort for Spark/YARN/HDFS powered by HDP

•Significantly space saving

•Low power cost

•Simple application software

328 CPU cores, 1.3TB memory and 5TB SSD in 2U space!

HML

53 © NEC Corporation 2016

Speed Test (ATM Cash Balance Prediction)

7275 sec(2h)

▌Summary of data

# ATMs: around 20,000 (in Japan)

# training samples: around 10M

▌Standard Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

Standard Cluster Micro Modular Server

4529 sec(1.25h)

1.6xSpeed up

54 © NEC Corporation 2016

Toward More Big Data!

Micro Modular Server(DX1000)

Scalable Modular Server(DX2000)

• CPU 2.3x speed up• MEM 1.7x increased• SSD 3.3x increased

272 high-speed Xeon-D 2.1GHz cores, 2TB memory and 17TB SSD in 3U space

55 © NEC Corporation 2016

Summary

CPU12

SEP11

SEP

10SEP

HMLMicro Modular Server

56 © NEC Corporation 2016

Summary

HMLMicro Modular Server

Appendix

59 © NEC Corporation 2016

Micro Modular Server (DX1000) Spec.: Server Module

60 © NEC Corporation 2016

Micro Modular Server (DX1000) Spec.: Module Enclosure

61 © NEC Corporation 2016

Scalable Modular Server (DX2000) Spec.: Server Module

62 © NEC Corporation 2016

Scalable Modular Server (DX2000) Spec.: Module Enclosure