Big Data Heterogeneous Mixture Learning on Spark

Masato Asahara and Ryohei Fujimaki

NEC Data Science Research Labs.

Jun/30/2016 @Hadoop Summit 2016

2 © NEC Corporation 2016

Who we are?

▌Masato Asahara (Ph.D.)

▌Researcher, NEC Data Science Research Laboratory Masato received his Ph.D. degree from Keio University, and has worked at

NEC for 6 years in the research field of distributed computing systems and computing resource management technologies.

Masato is currently leading developments of Spark-based machine learning and data analytics systems, particularly using NEC’s Heterogeneous Mixture Learning technology.

▌Ryohei Fujimaki (Ph.D.)

▌Research Fellow, NEC Data Science Research Laboratory Ryohei is responsible for cores to NEC’s leading predictive and prescriptive

analytics solutions, including "heterogeneous mixture learning” and “predictive optimization” technologies. In addition to technology R&D, Ryohei is also involved with co-developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners in the North American and APAC regions.

Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 115-year history.

Agenda

Micro Modular Server

NEC’s Predictive Analytics and Heterogeneous Mixture Learning

Enterprise Applications of HML

Driver RiskAssessment

Inventory Optimization

Churn Retention

PredictiveMaintenance

Product Price Optimization

Sales Optimization

Energy/Water Operation Mgmt.

NEC’s Heterogeneous Mixture Learning (HML)

NEC’s machine learning that automatically derives “accurate” and “transparent” formulas behind Big Data.

Y=a1x1+ ・・・ +anxn

Y=a1x1+ ・・・+knx3

Y=b1x2+ ・・・+hnxn

Y=c1x4+ ・・・+fnxn

Explore Massive Formulas

Transparentdata segmentation

and predictive formulasHeterogeneous

mixture data

The HML Model

Gender

Heighttallshort

weightsports

sports

tea/coffee

beer/wine

The health risk of a tall man can be

predicted by age, alcohol and smoke!

HML (Heterogeneous Mixture Learning)

HML Algorithm

Segment data by a rule-based tree

Feature Selection

Pruning

Data Segmentation

Training data

HML Algorithm

Select features and fit predictive models for data segments

Feature Selection

Pruning

Data Segmentation

Training data

Enterprise Applications of HML

Churn Retention

Sales Optimization

Churn Retention

Energy/Water Demand Forecasting

Sales Forecasting

Demand for Big Data Analysis

5 million(customers)×12(months)＝60,000,000 training samples

Churn Retention

PredictiveMaintenanc

100(items)×1000(shops)=100,000 predictive models

Sales Optimization

Churn Retention

PredictiveMaintenanc

Sales Optimization

Distributed HML on Spark

Why Spark, not Hadoop?

Because Spark’s in-memory architecture can run HML faster

Feature Selection

Pruning

Data Segmentation

CPU CPU

Data segmentation Feature selection

CPU CPU CPU

Data segmentation Feature selection

Pruning

Data Scalability powered by Spark

Treat unlimited scale of training data by adding executors

executor executor executor

CPU CPU CPU

driver

Add executors,and get

more memory and CPU power

Training data

Distributed HML Engine: Architecture

Distributed HMLCore (Scala)

invoke

HML model

Distributed HMLInterface (Python)

Matrix computation

libraries(Breeze, BLAS)

Data Scientist

3 Technical Key Points to Fast Run ML on Spark

Challenge to Avoid Data Shuffling

driver

Model: 10KB~10MB

HML Model Merge AlgorithmTraining data: TB~

driver

Naïve Design Distributed HML

Execution Flow of Distributed HML

Executors build a rule-based tree from their local data in parallel

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Driver merges the trees

Driver

Executor

Feature Selection

Pruning

Data Segmentation

HML Segment Merge Algorithm

Driver broadcasts the merged tree to executors

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Executors perform feature selection with their local data in parallel

Driver

Executor

Feature Selection

Pruning

Data Segmentation

Driver merges the results of feature selection

Driver

Executor

Feature Selection

Pruning

Data Segmentation

HML Feature Merge Algorithm

Driver merges the results of feature selection

Driver

Executor

Feature Selection

Pruning

Data Segmentation

HML FeatureMerge Algorithm

Driver broadcasts the merged results of feature selection to executors

Driver

Executor

Feature Selection

Pruning

Data Segmentation

SatSatSat

Wait Time for Other Executors Delays Execution

Machine learning is likely to cause unbalanced computation load

Executor

Executor Executor

Balance Computational Effort for Each Executor

Executor optimizes all predictive formulas with equally-divided training data

Executor

Executor Executor

Balance Computational Effort for Each Executor

Executor optimizes all predictive formulas with equally-divided training data

Executor

Executor Executor

Sat Sat

Completion time reduced!

1TBCPU

Machine Learning Consists of Many Matrix Computations

Feature Selection

Pruning

Data Segmentation

Basic Matrix ComputationMatrix Inversion

Combinatorial Optimization…

RDD Design for Leveraging High-Speed Libraries

Distributed HML’s RDD Design

Matrix object1

element

Straightforward RDD Design

Training sample1

partition

Training sample2CPU

Matrix computation libraries(Breeze, BLAS)

~Training sample1

Training sample2

seq(Training sample1,Training sample2, …)

mapPartition

Matrix object1

Training sample1

Training sample2

Iterator

Matrixobject

creation

element

Performance Degradation by Long-Chained RDD

Long-chained RDD operations cause high-cost recalculations

Feature Selection

Pruning

Data Segmentation RDD.map(TreeMapFunc)

.map(FeatureSelectionMapFunc)

.reduce(TreeReduceFunc)

.reduce(FeatureSelectionReduceFunc)

.map(TreeMapFunc)

.map(PruningMapFunc)

Long-chained RDD operations cause high-cost recalculations

Feature Selection

Pruning

.map(TreeMapFunc)

Cut long-chained operations periodically by checkpoint()

Feature Selection

Pruning

.map(TreeMapFunc)

savedRDD

.checkpoint()

Benchmark Performance Evaluations

Eval 1: Prediction Error (vs. Spark MLlib algorithms)

Distributed HML achieves low error competitive to a complex model

data # samples Distributed HML

Logistic Regression

Decision Tree

Random Forests

gas sensor array (CO)*

4,208,261 0.542 0.597 0.587 0.576

household power consumption*

2,075,259 0.524 0.531 0.529 0.655

HIGGS* 11,000,000 0.335 0.358 0.337 0.317

HEPMASS* 7,000,000 0.156 0.163 0.167 0.175

* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)

Eval 2: Performance Scalability (vs. Spark MLlib algos)

Distributed HML is competitive with Spark MLlib implementations.

HIGGS* (11M samples) HEPMASS* (7M samples)# CPU cores

DistributedHML

LogisticRegression

DistributedHML

* UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)

LogisticRegression

Evaluation in Real Case

ATM Cash Balance Prediction

Much cash High inventory cost

Little cash High risk ofout of stock

Around 20,000 ATMs

Training Speed

2 hours9 days

▌Summary of data

# ATMs: around 20,000 (in Japan)

# training samples: around 10M

▌Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

* Run w/ 1 CPU core and 256GB memory

110xSpeed up

Serial HML* Distributed HML

Prediction Error

▌Summary of data

▌Cluster spec (10 nodes)# CPU cores: 128

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

Average Error:

17% lower

SerialHML

Distributed HMLSerialHML

SerialHML

Better: Distributed HML

NEC’s HML Data Analytics Appliance powered by Micro Modular Server

Dilemma between Cloud and On-Premises

Cloud On-Premises

•High pay-for-use charge

•Complicated application software design

•Data leakage risk

• Complicated operation for Spark/YARN/HDFS

• High installation cost

• High space/power cost

•Quick start

•Low operation effort for Spark/YARN/HDFS

•No space/power cost

•No installation effort

• No pay-for-use charge

• Low data leakage risk

• Simple application software design

NEC’s HML Data Analytics Appliance: Concept

Micro Modular Server

Data Scientist ~Plug-and-Analytics!

•Quick deploy!

•Low data leakage risk

•Low operation effort for Spark/YARN/HDFS powered by HDP

•Significantly space saving

•Low power cost

•Simple application software

328 CPU cores, 1.3TB memory and 5TB SSD in 2U space!

Speed Test (ATM Cash Balance Prediction)

7275 sec(2h)

▌Summary of data

▌Standard Cluster spec (10 nodes) # CPU cores: 128 (Xeon E5-2640 v3)

Memory: 2.5TB

Spark 1.6.1, Hadoop 2.7.1 (HDP 2.3.2)

Standard Cluster Micro Modular Server

4529 sec(1.25h)

1.6xSpeed up

Toward More Big Data!

Micro Modular Server(DX1000)

Scalable Modular Server(DX2000)

• CPU 2.3x speed up• MEM 1.7x increased• SSD 3.3x increased

272 high-speed Xeon-D 2.1GHz cores, 2TB memory and 17TB SSD in 3U space

Summary

HMLMicro Modular Server

Summary

HMLMicro Modular Server

Appendix

Micro Modular Server (DX1000) Spec.: Server Module

Micro Modular Server (DX1000) Spec.: Module Enclosure

Scalable Modular Server (DX2000) Spec.: Server Module

Scalable Modular Server (DX2000) Spec.: Module Enclosure

Big Data Heterogeneous Mixture Learning on Spark

Technology

Transcript of Big Data Heterogeneous Mixture Learning on Spark

Classifying Matter Quiz Page 35. 1. a) element b) compound c) Homogeneous Mixture d) Heterogeneous Mixture 1. a) element b) compound c) Homogeneous Mixture.

GENERAL CHEMISTRY 101 Matter Mixture Pure Substance Physical Change Heterogeneous Mixture Homogeneous Mixture ElementCompound Chemical Change.

Solutions Mixtures Heterogeneous mixture: substances in mixture are not spread uniformly throughout mixture. Homogeneous mixture: components uniformly.

Meson: Heterogeneous Workflows with Spark at Netflix

By Aimee Chavez. Matter Heterogeneous mixture Homogenous Mixture solutionPure substance compoundElement Uniform Distribution? YES NO Fixed Composition.

Chapter Two Properties of Matter. Matter Pure Substance ElementCompoundMixture Homogeneous mixture Solution Heterogeneous mixture ColloidSuspension Classification.

Science 9 Unit 2 Review. Write the difference between: heterogeneous mixture and a homogeneous mixture a physical change and a chemical change an element.

Matter Flowchart MATTER Can it be physically separated? Homogeneous Mixture (solution) Heterogeneous MixtureCompoundElement MIXTUREPURE SUBSTANCE yesno.

› uploads › 8 › 3 › ... · Pure Substances, Mixtures, and SolutionsHeterogeneous mixture : the components are not evenly distributed among each other. An heterogeneous mixture

Mixture formation for spark ignition engine.pdf

HETEROGENEOUS AZEOTROPIC DEHYDRATION OF ETHANOL … · HETEROGENEOUS AZEOTROPIC DEHYDRATION OF ETHANOL TO OBTAIN A CYCLOHEXANE-ETHANOL MIXTURE Vicente Gomis …

What is a solution?. Vocabulary Heterogeneous Mixture Homogeneous Mixture Precipitate Solute Solution Solvent Substance.

Introduction SC/BC Objective Indicator Content Problem Quit Homogeneous and Heterogeneous Mixture.

Solutions. Heterogeneous Mixtures Substances mixed with phases (heterogeneous) Suspension: a mixture containing particles that settle out if left undisturbed.

Spark ignition of hydrogen-air mixture - Institute of Physics

Mixtures and Solutions Section 14-1 Mixtures Heterogeneous vs. HomogeneousHeterogeneousHomogeneous A heterogeneous mixture is a mixture that does not.

Solution and Colligative properties€¦ · property that may differ drastically from those of the individual components. Heterogeneous mixture: A heterogeneous mixture consist of

Salt, baking soda, water, sugar oxygen, iron, hydrogen, gold Matter Mixture Pure Substance ElementsCompounds Heterogeneous Mixture Homogeneous Mixture.

Chapter 2. MatterSubstanceElementCompoundMixture Heterogeneous Mixture ColloidSuspension Homogeneous Mixture.

MATTER - Sharp & Savvy · CHEMICAL CHANGES . SEPARATION OF MIXTURES HETEROGENEOUS MIXTURES . Pure substances vs. mixtures Matter Pure Mixture substance Homogeneous Heterogeneous Element