Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing...

Impact of hybrid optimization strategies on distributedmachine learning algorithms

Prateek GaurTechnische Universitat Berlin, IT4BI

Thesis Advisors: Max Heimel and Christoph BodenThesis Supervisor: Dr. Volker Markl

September 04, 2014Thesis defense

September 4, 2014

Prateek Gaur Hybrid Optimization September 4, 2014 1 / 27

Motivation

Close enough is not good enough!!

Hadoop is slower than it has to be

Performance tuning and optimizations

Hadoop and its successors are still interesting

Computation is cheap

Large scale iterative tasks pose a major threat

Existing MR-based solutions trade-o↵ performance with accuracy


Outline

1 IntroductionWhat is Large Scale Machine Learning?ContributionsIterations and MapReduceWhy a new Technique?

2 Proposed ApproachExisting ApproachesHybrid Training for Clustering

3 EvaluationWhere do our datasets come from?A sample UsecaseResults

4 Summary


Large Scale ML

Descriptive => Predictive Analytics

Training examples can’t fit asingle machine

100+ features present perexample

Training data arrivescontinuously

Subsampling undesirable


Contributions

Proposed and tested a large scale supervised and unsupervised MLtechnique using existing solutions

Can solve variety of use-cases by adjusting various properties

Simple architecture- using existing techniques

Platform Agnostic as Algorithmic levelNot comparing di↵erent computing platforms [1]

No custom distributed computing


Iterative MapReduce

What are Iterative Learning Algorithms?

PageRank, KMeans, Neural Networks

Shortcomings

High startup costsAwkward state retainmentSingle Reducer Problem, Straggler E↵ect

Mapreduce is not designed to run iterations e�ciently


Existing Techniques: Global/ Batch vs. Local/ Online

When all you have is a hammer thenget rid of everything that’s not a nail! [J.Lin, Twitter]


Existing Techniques: Batch vs. Online

When all you have is a hammer thenget rid of everything that’s not a nail! [J.Lin, Twitter]

Can’t give up on accuracy: Youtube recommendationsCan’t give up on speed: Twitter’s Trending topics


Proposed Technique: Hybrid

1 Examples divided uniformlyacross all the nodes

2 Single online pass

3 Centrally, take average of onlineresults to get a warm-start forbatch optimization

4 Centralized Batch step acrossthe cluster


Logistic Regression

Machine Learning as an optimization problem


Numerical Optimizations: Existing Solutions

Gradient Descentw

(t+1) = w

(t) + �(t) 1n

Pn

i=0 rl(h(xi

; ✓(t)), yi

)

batch learning

Stochastic Gradient Descentw

(t+1) = w

(t) + �(t)rl(h(x ; ✓(t)), y)

online learning

Solves the iteration problem butwhat about Single Reducer problem?


Ensembles

Set of classifiers outperform a single classifier

Learn multiple alternative models for a single conceptCombine decisions to take the final decision

Some are sequential, not MR friendly

Boosting

Others rely on randomization

BaggingEmbarrassingly parallel


Classification: Hybrid Training Architecture


KMeans

Group n d-dimensional datapoints into k disjoint sets to minimize

Batch Learning: Lloyd’s KMeans

Online Learning: Streaming KMeans


Clustering: Hybrid Training


Datasets


Example Usecase: Named Entity Recognition

Identify persons from Clueweb Dataseta

aJ. Callan, M. Hoy, C. Yoo, and L. Zhao. Clueweb09 data set, 2009.

Accuracy Paradox -> Fscore


Overview of the Results

Tested for Quality and Performance

Online training

is faster than batch but suboptimalperformance and quality can be improved by Ensembling but upto alimit

Hybrid training

achieves better performance than batch and onlineachieves better quality fasterinsensitive to the size, complexity of the dataset and the choice ofquality metricis sensitive to the choice of hyperparameters


Classification: Precision(Avg. 3 Runs, Ensemble=4, 50 iterations)


Classification: Scaling


Classification: Running Time(Avg. 3 Runs, Ensemble=4, 50 Iterations)

Approx. 1 iteration required to breakeven


Clustering Quality(Avg. 3 Runs, 10 iterations)


Other Works

Large Scale ML

Google’s Sybil1: Improved IterativeDi↵erent computation platforms [2]: Spark, Flink, GraphLab

Iterative extensions for Hadoop

HaLoop, Twister, PrIter [3]

Hybrid

Terascale learning2

Twitter’s Summingbird3 : A framework for Integrating Batch andOnline MapReduce Computations

1T. Chandra, E. Ie, K. Goldman, T. L. Llinares, J. McFadden, F. Pereira, J.Redstone, T. Shaked, and Y. Singer. Sibyl: a system for large scale machine learning.Keynote I PowerPoint presentation, Jul, 28, 2010.

2A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford. A reliable e↵ective terascalelinear learning system.

3O. Boykin, S. Ritchie, I. O’Connell, and J. Lin. Summingbird: A framework forintegrating batch and online mapreduce computations.


Conclusion

Contributions:

Evaluated existing techniques: Online and BatchProposed a“hybrid” technique that o↵ers the“best of both worlds”Di↵erent scenarios to show its e↵ectivenessIdentified the shortcomings of the proposed approach against theexisting ones

Hybrid approach proves promising

re-uses existing codeplatform agnostic

Github: https://github.com/gaurprateek/parallelml


Future Work

Test and contribute to Flink and extend to Spark

Semi-supervised learning and Ranking

Implementation-based optimizations

An extra knob to chose between online and batch


Acknowledgements

Thanks to:

My committee

My Advisors: Max Heimel and Christoph Boden

Dr. Volker Markl; academic supervisor

Dr. Ralf-Detlef Kutsche; program coordinator

Johannes Kirschnick; database queries


Bibliography

A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske,A. Heise, O. Kao, M. Leich, U. Leser, V. Markl et al., “Thestratosphere platform for big data analytics,”The VLDB Journal, pp.1–26, 2014.

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: cluster computing with working sets,” in Proceedings of the2nd USENIX conference on Hot topics in cloud computing, 2010, pp.10–10.

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, andG. Fox,“Twister: a runtime for iterative mapreduce,” in Proceedings ofthe 19th ACM International Symposium on High PerformanceDistributed Computing. ACM, 2010, pp. 810–818.


Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing...

Documents

Transcript of Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing...