Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing...
Transcript of Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing...
Impact of hybrid optimization strategies on distributedmachine learning algorithms
Prateek GaurTechnische Universitat Berlin, IT4BI
Thesis Advisors: Max Heimel and Christoph BodenThesis Supervisor: Dr. Volker Markl
September 04, 2014Thesis defense
September 4, 2014
Prateek Gaur Hybrid Optimization September 4, 2014 1 / 27
Motivation
Close enough is not good enough!!
Hadoop is slower than it has to be
Performance tuning and optimizations
Hadoop and its successors are still interesting
Computation is cheap
Large scale iterative tasks pose a major threat
Existing MR-based solutions trade-o↵ performance with accuracy
Prateek Gaur Hybrid Optimization September 4, 2014 2 / 27
Motivation
Close enough is not good enough!!
Hadoop is slower than it has to be
Performance tuning and optimizations
Hadoop and its successors are still interesting
Computation is cheap
Large scale iterative tasks pose a major threat
Existing MR-based solutions trade-o↵ performance with accuracy
Prateek Gaur Hybrid Optimization September 4, 2014 2 / 27
Outline
1 IntroductionWhat is Large Scale Machine Learning?ContributionsIterations and MapReduceWhy a new Technique?
2 Proposed ApproachExisting ApproachesHybrid Training for Clustering
3 EvaluationWhere do our datasets come from?A sample UsecaseResults
4 Summary
Prateek Gaur Hybrid Optimization September 4, 2014 3 / 27
Large Scale ML
Descriptive => Predictive Analytics
Training examples can’t fit asingle machine
100+ features present perexample
Training data arrivescontinuously
Subsampling undesirable
Prateek Gaur Hybrid Optimization September 4, 2014 4 / 27
Contributions
Proposed and tested a large scale supervised and unsupervised MLtechnique using existing solutions
Can solve variety of use-cases by adjusting various properties
Simple architecture- using existing techniques
Platform Agnostic as Algorithmic levelNot comparing di↵erent computing platforms [1]
No custom distributed computing
Prateek Gaur Hybrid Optimization September 4, 2014 5 / 27
Iterative MapReduce
What are Iterative Learning Algorithms?
PageRank, KMeans, Neural Networks
Shortcomings
High startup costsAwkward state retainmentSingle Reducer Problem, Straggler E↵ect
Mapreduce is not designed to run iterations e�ciently
Prateek Gaur Hybrid Optimization September 4, 2014 6 / 27
Existing Techniques: Global/ Batch vs. Local/ Online
When all you have is a hammer thenget rid of everything that’s not a nail! [J.Lin, Twitter]
Prateek Gaur Hybrid Optimization September 4, 2014 7 / 27
Existing Techniques: Batch vs. Online
When all you have is a hammer thenget rid of everything that’s not a nail! [J.Lin, Twitter]
Can’t give up on accuracy: Youtube recommendationsCan’t give up on speed: Twitter’s Trending topics
Prateek Gaur Hybrid Optimization September 4, 2014 8 / 27
Proposed Technique: Hybrid
1 Examples divided uniformlyacross all the nodes
2 Single online pass
3 Centrally, take average of onlineresults to get a warm-start forbatch optimization
4 Centralized Batch step acrossthe cluster
Prateek Gaur Hybrid Optimization September 4, 2014 9 / 27
Logistic Regression
Machine Learning as an optimization problem
Prateek Gaur Hybrid Optimization September 4, 2014 10 / 27
Numerical Optimizations: Existing Solutions
Gradient Descentw
(t+1) = w
(t) + �(t) 1n
Pn
i=0 rl(h(xi
; ✓(t)), yi
)
batch learning
Stochastic Gradient Descentw
(t+1) = w
(t) + �(t)rl(h(x ; ✓(t)), y)
online learning
Solves the iteration problem butwhat about Single Reducer problem?
Prateek Gaur Hybrid Optimization September 4, 2014 11 / 27
Ensembles
Set of classifiers outperform a single classifier
Learn multiple alternative models for a single conceptCombine decisions to take the final decision
Some are sequential, not MR friendly
Boosting
Others rely on randomization
BaggingEmbarrassingly parallel
Prateek Gaur Hybrid Optimization September 4, 2014 12 / 27
Classification: Hybrid Training Architecture
Prateek Gaur Hybrid Optimization September 4, 2014 13 / 27
KMeans
Group n d-dimensional datapoints into k disjoint sets to minimize
Batch Learning: Lloyd’s KMeans
Online Learning: Streaming KMeans
Prateek Gaur Hybrid Optimization September 4, 2014 14 / 27
Clustering: Hybrid Training
Prateek Gaur Hybrid Optimization September 4, 2014 15 / 27
Datasets
Prateek Gaur Hybrid Optimization September 4, 2014 16 / 27
Example Usecase: Named Entity Recognition
Identify persons from Clueweb Dataseta
aJ. Callan, M. Hoy, C. Yoo, and L. Zhao. Clueweb09 data set, 2009.
Accuracy Paradox -> Fscore
Prateek Gaur Hybrid Optimization September 4, 2014 17 / 27
Overview of the Results
Tested for Quality and Performance
Online training
is faster than batch but suboptimalperformance and quality can be improved by Ensembling but upto alimit
Hybrid training
achieves better performance than batch and onlineachieves better quality fasterinsensitive to the size, complexity of the dataset and the choice ofquality metricis sensitive to the choice of hyperparameters
Prateek Gaur Hybrid Optimization September 4, 2014 18 / 27
Classification: Precision(Avg. 3 Runs, Ensemble=4, 50 iterations)
Prateek Gaur Hybrid Optimization September 4, 2014 19 / 27
Classification: Scaling
Prateek Gaur Hybrid Optimization September 4, 2014 20 / 27
Classification: Running Time(Avg. 3 Runs, Ensemble=4, 50 Iterations)
Approx. 1 iteration required to breakeven
Prateek Gaur Hybrid Optimization September 4, 2014 21 / 27
Clustering Quality(Avg. 3 Runs, 10 iterations)
Prateek Gaur Hybrid Optimization September 4, 2014 22 / 27
Other Works
Large Scale ML
Google’s Sybil1: Improved IterativeDi↵erent computation platforms [2]: Spark, Flink, GraphLab
Iterative extensions for Hadoop
HaLoop, Twister, PrIter [3]
Hybrid
Terascale learning2
Twitter’s Summingbird3 : A framework for Integrating Batch andOnline MapReduce Computations
1T. Chandra, E. Ie, K. Goldman, T. L. Llinares, J. McFadden, F. Pereira, J.Redstone, T. Shaked, and Y. Singer. Sibyl: a system for large scale machine learning.Keynote I PowerPoint presentation, Jul, 28, 2010.
2A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford. A reliable e↵ective terascalelinear learning system.
3O. Boykin, S. Ritchie, I. O’Connell, and J. Lin. Summingbird: A framework forintegrating batch and online mapreduce computations.
Prateek Gaur Hybrid Optimization September 4, 2014 23 / 27
Conclusion
Contributions:
Evaluated existing techniques: Online and BatchProposed a“hybrid” technique that o↵ers the“best of both worlds”Di↵erent scenarios to show its e↵ectivenessIdentified the shortcomings of the proposed approach against theexisting ones
Hybrid approach proves promising
re-uses existing codeplatform agnostic
Github: https://github.com/gaurprateek/parallelml
Prateek Gaur Hybrid Optimization September 4, 2014 24 / 27
Future Work
Test and contribute to Flink and extend to Spark
Semi-supervised learning and Ranking
Implementation-based optimizations
An extra knob to chose between online and batch
Prateek Gaur Hybrid Optimization September 4, 2014 25 / 27
Acknowledgements
Thanks to:
My committee
My Advisors: Max Heimel and Christoph Boden
Dr. Volker Markl; academic supervisor
Dr. Ralf-Detlef Kutsche; program coordinator
Johannes Kirschnick; database queries
Prateek Gaur Hybrid Optimization September 4, 2014 26 / 27
Bibliography
A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske,A. Heise, O. Kao, M. Leich, U. Leser, V. Markl et al., “Thestratosphere platform for big data analytics,”The VLDB Journal, pp.1–26, 2014.
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: cluster computing with working sets,” in Proceedings of the2nd USENIX conference on Hot topics in cloud computing, 2010, pp.10–10.
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, andG. Fox,“Twister: a runtime for iterative mapreduce,” in Proceedings ofthe 19th ACM International Symposium on High PerformanceDistributed Computing. ACM, 2010, pp. 810–818.
Prateek Gaur Hybrid Optimization September 4, 2014 27 / 27