Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re
description
Transcript of Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re
![Page 1: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/1.jpg)
Automatic optimization of MapReduce
Programs
Michael Cafarella, Eaman Jahani, Christopher Re
August 2011
![Page 2: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/2.jpg)
MapReduce is victorious
• Google statistics:
• Hadoop statistics:7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters1
Aug 04 Mar 06 Sept 07 May 10Number of jobs 29K 171K 2127K 4474K
Machine years used 217 2002 11081 39121
Input Data (TB) 3,288 52,254 403,152 946,460
Output Data (TB) 193 2,970 14,018 45,720Average worker
machines 157 268 394 368
1. Omer Trajman, Cloudera VP, http://www.dbms2.com/
![Page 3: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/3.jpg)
MapReduce in relational land
• Designers original Intention: free-formed datao web-scale indexing/log processing
• But, many relational workloads1
o Complex queries/data analysis
• Caveat: MR performance lags RDBMS performance
1. Karmasphere corporation: A study of hadoop developers, http://karmasphere.com, 2010
![Page 4: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/4.jpg)
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
Selection is Slower with MapReduce
![Page 5: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/5.jpg)
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
Join is Even Slower
![Page 6: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/6.jpg)
MR Lags in Relational Land
• Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’1
• Query processing taskso No metadata, semantics, indiceso Free-formed input is a double-edged sword
1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008
![Page 7: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/7.jpg)
Manimal• Manimal is a hybrid system, combining
MapReduce programming model and well-known execution techniques
• Techniques today only found in RDBMS, but shouldbe in MapReduce, too.
![Page 8: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/8.jpg)
Manimal Approachbytecode *.classMR
EngineStatic
Analyzer
Optimizer logic
Execution Framewo
rk
optimizationopportunities
execution
path
void map(Text key, WebPage w) {if(w.rank > 10) emit(w.url,w.rank);
}
• Challenges:o Safely detect query semantic optimizationo How much performance gain?
SELECTION from B+Tree index on W.RANK
![Page 9: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/9.jpg)
Manimal Contributions
• Our Manimal system:o Detect safe relational optimizations in users’
compiled MapReduce programs
• Our results:o Runs with unmodified MapReduce codeo Runs up to 11x faster on same codeo Provides framework for more optimizations
![Page 10: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/10.jpg)
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
![Page 11: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/11.jpg)
Execution framework
public void map(Text key, WebPage w, OutputCollector<Text, LongWritable> out) {
if(w.rank > 10)emit(w.url, w.rank);
}
![Page 12: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/12.jpg)
Execution Framework
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
![Page 13: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/13.jpg)
13
Execution Framework
void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) }
(SELECT f, w.rank>10)
Analyzer in: user programAnalyzer out: optimization descriptor
index-generation program
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
![Page 14: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/14.jpg)
14
Execution Framework
Optimizer in: optimization descriptor catalogOptimizer out: execution descriptor
/logs/log.1 /logs/log.1.idx select src…
/logs/log.2 /logs/log.2.idx select src…
(SELECT,“log.1.idx”,w.rank>10)
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
(SELECT f, w.rank>10)
![Page 15: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/15.jpg)
15
Execution Framework
numwords 19519
(SELECT,“log.1.idx”,w.rank>10)
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
Execution in: execution descriptor user programExecution out: program output
![Page 16: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/16.jpg)
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
![Page 17: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/17.jpg)
An Optimization Example
//webpage.java: SCHEMA!Class WebPage {String URL,int rank,String content}
//mapper.javavoid map(Text key, WebPage w) {
if (w.url==‘teaparty.fr’)emit(w.url, 1);
}
• Data-centric programming idioms == relational ops
PROJECTED view: (url,null,null)DIRECT-OP on compressed Webpage
![Page 18: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/18.jpg)
Semantic Extraction• Query semantic are obvious to human readers,
but not explicit in the code for framework
• EXTRACT IT!o Static code analysiso Control-flow graph and data-flow grapho Find opportunities: selection, projection, direct opo Safe optimizations: same output
![Page 19: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/19.jpg)
Analyzer: An Example//webpage.javaClass WebPage {String URL,int rank,String content}
//mapper.javamap(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank);}
Fn Entry w.rank > 10 Fn Exit
Analyzer
emit(url,rank)
![Page 20: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/20.jpg)
Current Optimizations• B+-Tree for Selections • Projected views• Delta compression on numerics• Direct operation of compressed data
• Hadoop compression is not semantic aware
![Page 21: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/21.jpg)
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
![Page 22: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/22.jpg)
Experiments: Analyzer• Test MapReduce programs from Pavlo, SIGMOD ‘09:
• Detected 5 out of 8 opportunities:o Two misses due to custom serialization classo Another miss requires knowledge of
java.util.Hashtable semantics
![Page 23: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/23.jpg)
Experiments: Performance
• Optimize four Web page handling tasks:o Selection (filtering)o Projection (aggregation on subfield of page)o Join (pages to user visits)o User Defined Functions (aggregation)
• 5 cluster nodes, 123GB of data
![Page 24: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/24.jpg)
Experiments: Performance
Description
Hadoop
Selection 430 sProjection 5496 s
Join 6078 s
![Page 25: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/25.jpg)
Experiments: Performance
Description
Hadoop Manimal Speedup
Selection 430 s 38 s 11.2Projection 5496 s 1856 s 2.96
Join 6078 s 904 s 6.73
![Page 26: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/26.jpg)
Experiments: Performance
• Up to 11x speedup over original Hadoop• Performance comparable to DBMS-X from Pavlo• UDF not detected: running time identical
Description
Hadoop Manimal Speedup
Space Overhead
Selection 430 s 38 s 11.2 0.1%
Projection 5496 s 1856 s 2.96 20%
Join 6078 s 904 s 6.73 11.7%
![Page 27: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/27.jpg)
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
![Page 28: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/28.jpg)
Related Work• Lots of recent MapReduce activity
o Quincy: Task scheduling (Isard et al, SOSP, 2009)
o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010)o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010)o Starfish (Herodotou et al, CIDR 2011)
• Manimal does not introduce new optimizations. It detects and applies existing optimizations to code
![Page 29: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/29.jpg)
Lessons Learned• The Good: We can recognize data processing
idioms in real code. Relational operations still exist even in NoSQL world
• The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)
![Page 30: Automatic optimization of MapReduce Programs Michael Cafarella , Eaman Jahani , Christopher Re](https://reader036.fdocuments.us/reader036/viewer/2022062315/5681665f550346895dd9e7a0/html5/thumbnails/30.jpg)
Conclusion• Manimal provides framework for applying
well-known optimization techniques to MapReduceo Automatic optimization of user codeo Up to 11x speed increaseo Provides framework for more optimizations