Big Data Ecosystem & The Stratosphere Project
-
Upload
stephan-ewen -
Category
Technology
-
view
489 -
download
2
Transcript of Big Data Ecosystem & The Stratosphere Project
![Page 1: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/1.jpg)
StratoSphereAbove the Clouds
Stratosphere
Massively Parallel Analytics
Alexander Alexandrov, Stephan Ewen,Joseph Harjung, Fabian Hüske,
Moritz Kaufmann, Aljoscha Krettek, Volker Markl, Kostas Tzoumas, Sebastian Schelter
![Page 2: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/2.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
The Big Data Context
2
Large Quantitiesof Data
Diverse Data Structures
Complex AnalysisTasks
![Page 3: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/3.jpg)
SQL
?
![Page 4: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/4.jpg)
SQL NoSQL
?
![Page 5: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/5.jpg)
NoMapReduce
SQL NoSQL
?
![Page 6: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/6.jpg)
NoMapReduce
SQL NoSQL
SQL--
?
![Page 7: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/7.jpg)
NoMapReduce
SQL NoSQL
SQL--
?
?
![Page 8: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/8.jpg)
NoMapReduce
SQL NoSQL
SQL--
?
?Question 1:
Is it faster to add a HiveQL parser and
an HDFS adapter to your favorite
parallel database, or develop a parallel
engine from scratch?
![Page 9: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/9.jpg)
NoMapReduce
SQL NoSQL
SQL--
?
?Question 1:
Is it faster to add a HiveQL parser and
an HDFS adapter to your favorite
parallel database, or develop a parallel
engine from scratch?
Question 2:Have we closed the circle (“we want
SQL!”) or is there more in analytics?
![Page 10: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/10.jpg)
10
![Page 11: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/11.jpg)
11
scripting
![Page 12: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/12.jpg)
12
scripting
SQL--
![Page 13: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/13.jpg)
13
scripting
SQL--
XQuery+/-
![Page 14: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/14.jpg)
14
scripting
SQL--
scalable parallel sort
XQuery+/-
![Page 15: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/15.jpg)
15
scripting
SQL--
scalable parallel sort
XQuery+/- not a sortingproblem!
![Page 16: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/16.jpg)
16
scripting
SQL--
columnstore--
scalable parallel sort
XQuery+/- not a sortingproblem!
![Page 17: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/17.jpg)
17
scripting
SQL--
columnstore--
scalable parallel sort
a queryplan
XQuery+/- not a sortingproblem!
![Page 18: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/18.jpg)
18
scripting
SQL--
columnstore--
scalable parallel sort
a queryplan
XQuery+/- not a sortingproblem!
Question 3:
How do we architect systems for the
next wave of rich data analysis?
![Page 19: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/19.jpg)
19
≠
![Page 20: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/20.jpg)
commandments
for Big Data
Analytics
10
![Page 21: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/21.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(I) Thou shalt…
21
… use declarative languages!
![Page 22: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/22.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce22
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(I) Thou shalt…
… use declarative languages!
Executive Summary
Connected components of a graph.
- Joins and aggregations on custom data types
- Incremental / Delta Iterations
- Mixture of operators and UDFs
![Page 23: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/23.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce23
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(II) Thou shalt…
… accept external (dynamic) sources! “In situ” data - no load
![Page 24: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/24.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce24
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(III) Thou shalt…
… use rich primitives! (beyond MapReduce)
![Page 25: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/25.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce25
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(III) Thou shalt…
… use rich primitives! (beyond MapReduce)
Map
Reduce
Cross
Match
CoGroup
![Page 26: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/26.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce26
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(IV) Thou shalt…
… define queries and UDFs in the same language!
UDF
Query definition
![Page 27: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/27.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce27
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(V) Thou shalt…
… use an algebraic butrich data model!
Custom Object Oriented andFunctional Data Types
Use functions as referencesto fields/attributes
![Page 28: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/28.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce28
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(VI) Thou shalt…
… optimize! Auto-parallelization and optimization à la relational databases.
![Page 29: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/29.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce29
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(VII) Thou shalt…
… not treat UDFs as black boxes!
Static code analysis of UDFsto determine field accessesand modificationsVastly increases optimization
potential
![Page 30: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/30.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce30
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(VIII) Thou shalt…
… iterate/recurse!
Step function
Needed for most interesting analysis cases
![Page 31: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/31.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce31
case class Vertex(id: Int, component: Int)case class Edge(from: Int, to: Int)
val vertices = hdfsFile(…);val edges = hdfsFile(…);
val result = step iterate (vertices distinctBy {_.id}, vertices)
def step = (s: Data[Vertex], ws: Data[Vertex]) => {
val neighbors = ws join edges on {_.id} isEqualTo {_.from} using {(v,e) => Vertex(e.to, v.component)}
val min = allNeighbors reduceBy {_.id} ( minBy _.component)
val s1 = minNeighbors join s on {_.id} isEqualTo {_.id} using {(c,o)=> if (c.component < o.component) Some(c) else None} (s1, s1)}
(IX) Thou shalt…
… exploit dynamic computation!
Naïve (Bulk)
Incremental
0200000400000600000800000
100000012000001400000
Superstep
# Ve
rtice
s (t
hous
ands
)
Pregel as a Stratosphere plan with comparable performance.
![Page 32: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/32.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce32
(X) Thou shalt…
… use a scalable and efficient execution engine!
Pipeline and data parallelism, flexible checkpointing, optimized network data transfers
![Page 33: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/33.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Write like a programming language
Fazit
33
Execute like a Database
![Page 34: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/34.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Write like a programming language
Fazit
34
Execute like a DatabaseAdd a bit of "languages and compilers" sauce to the database stack…
![Page 35: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/35.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Stratosphere Programming Stack
35
Nephele Dataflow Engine
Runtime Operators
SOPREMOCompiler
MeteorScript
Scala
Scala-Compiler Plugin
Stratosphere Optimizer
Nephele Parallel Dataflow
PACT Program
Layered approach – several entry points to the system
![Page 36: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/36.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Stratosphere Programming Stack
36
Nephele Dataflow Engine
Runtime Operators
SOPREMOCompiler
MeteorScript
Scala
Scala-Compiler Plugin
Stratosphere Optimizer
Nephele Parallel Dataflow
PACT Program
![Page 37: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/37.jpg)
Pact programScala program
Scala compiler plug-in
RuntimeHash- and sort-based out-of-core operator implementations, memory management
Stratosphere optimizerPicks data shipping and local strategies, operator order
Execution plan
Nephele Execution EngineTask scheduling, network data transfers, resource allocation, checkpointing
Job graph Execution graph
![Page 38: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/38.jpg)
Pact programScala program
Scala compiler plug-in
RuntimeHash- and sort-based out-of-core operator implementations, memory management
Stratosphere optimizerPicks data shipping and local strategies, operator order
Execution plan
Nephele Execution EngineTask scheduling, network data transfers, resource allocation, checkpointing
Job graph Execution graph
1
2
3
![Page 39: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/39.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
StratoSphereAbove the Clouds
PARALLEL PROGRAMMING MODEL
Part 1
39
![Page 40: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/40.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Background: PACTs
40
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Second-orderfunction
First-order function(UDF)Data Data
Map Reduce Cross Match CoGroup
![Page 41: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/41.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ Data flow operators (UDFs)are first-order functions
■ Application of UDFs to thedata through second-orderfunctions that defineparallel semantics
■ Declarative, as executionstrategies are not fixed
Background: PACTs
41
Reduce (on A)sum(B), avg(C)
Match (A = D)if (A>3) emit
MapC := max(A,B)
Mapif (D>4) emit
Sink 1
Source 1Extract (A,B)
Source 2Extract (D,E)
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
![Page 42: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/42.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Iterative Programs
42
S. Ewen, K. Tzoumas, M. Kaufmann, V. Markl:Spinning Fast Iterative Data Flows. PVLDB 5(11), 2012
Wi Si
(v2, cid) Match
(v1,v2), (vid,cid)
(vid, cid)CoGroup
[(vid,cid)],(vid, cid)
N
Wi+1 Di+1
U.
Edges
Bulk Iteration(Page Rank)
Incremental Iteration(Connected Components)
(pid, tid, p)
Join Pand A
(pid, r)
A
Reduce (on tid)(pid=tid, r=∑ k)
Match (on pid)(tid, k=r*p)
Sum uppartial ranks
p
![Page 43: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/43.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
How does it look in code
43
val result = step iterate (vertices distinctBy {_.id}, messages)
def step = (s: Data[Vertex], ws: Data[Message]) => { val sNext = ws join s on {…} isEqualTo {…} using {…} val wNext = sNext join edges on … (sNext, wNext)}
Java
Scala
![Page 44: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/44.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Incremental Iterations matter…
44
0 3 6 9 12 15 18 21 24 27 30 330
200000
400000
600000
800000
1000000
1200000
1400000
Superstep
# Ve
rtice
s (t
hous
ands
)
Naïve (Bulk)
Incremental
Twitter Webbase (20)0
1000
2000
3000
4000
5000
6000
Changes to the iteration's result for Connected Components in each superstep…
… and runtime.
![Page 45: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/45.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Pregel as a Pact program
45
![Page 46: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/46.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
StratoSphereAbove the Clouds
THE PROGRAM COMPILER AND OPTIMIZER
Part 2
46
![Page 47: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/47.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Why an Optimizer for such Programs?
47
Do you want to hand-optimize that?
![Page 48: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/48.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ Cost-based optimizer produces physical execution plan given PACT program□ Annotates data channels with distribution patters, e.g., broadcast, partition□ Chooses physical execution strategies (e.g., hash/sort)□ Reorders PACT functions Deeply embeds MapReduce style UDFs in the
optimization
■ Optimization of iterative programs□ Passing data between super-steps□ Loop-invariant data□ Efficient state maintenance in partitioned indexes
■ Challenge: Semantics of user-defined functions unknown
Pact Optimizer Overview
48
![Page 49: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/49.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Current architecture
49
1) Analyze 3) Parallelize
2) Reorder
![Page 50: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/50.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
1) Opening the Black Boxes …
50
Analyze user code to discover:
■ Read set Rf: Attributes of the input record(s) that might influence output
■ Write set Wf: Attributes of the output record(s) that might have different values from respective input attributes
■ Emit cardinality Ef: Bounds on records emitted per call (1, >1, …)
PACTf
(Rf,Wf,Ef)
![Page 51: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/51.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
1 void match (Record left,2 Record right,3 Collector col) {4 Record out = copy (left);5 if (left.get(0) > 3) {6 double a = right.get(2);7 out.set(2,1.0/a);8 }9 out.set(1, 42);10 out.set(3,right.get(0));11 out.set(4,right.get(1));12 out.set(5,right.get(2));13 col.emit (out);14 }
… via Static Code Analysis
51
Feasible:1. No control flow between
operators 2. Record data model, fixed API
Correct: ■ Difficulty comes from different code
paths■ Correctness guaranteed through
conservatism■ Add to R,W when in doubt
![Page 52: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/52.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Conditions for reordering UDFs
52
Enabled optimizations: Selection push-down (Bushy) join reordering Aggregation push-down
Equivalent to invariant grouping transformation [Chaudhuri & Shim 1994]
Reordering of non-relational Reduce functions
Theorem 1: Two Map operators can be reordered if their UDFs have only read-read conflictsTheorem 2: For a Map and a Reduce, we need in addition the Reduce key groups to be preserved
![Page 53: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/53.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ Simple enumeration algorithm that checks pairwise reordering for all neighboring operators
■ Current problem: Walking all points in the search space
■ Next: Deduce join-graph-like information from reordering degrees-of-freedom
Optimizer Architecture (I)
53
![Page 54: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/54.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ Operators are defined in terms of possible global data properties (partitioning/replication/...) and local data properties (order/grouping/uniqueness/...)
■ Nodes propagate requested properties top-down□ Filtered by UDF‘s field modification□ Filtered by incompatibility□ Every data flow edge has a set of possible requested properties
■ Requested properties are instantiated at each point□ Global properties by exchange strategies□ Local properties by local operators
■ Requested properties used for pruning candidate (as with intersting properties)
Optimizer Architecture (II)
54
![Page 55: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/55.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ Determine static and dynamic data flow paths for iterations□ Static path contains data that is loop-invariant
■ Use heuristics to place caches such that loop-invariant computations are not repeated□ Cache loop-invariant data also in ordered form, or as hash tables
■ Weigh costs for static and dynamic path differently□ Optimizer favors plans that „push“ work into static path
Optimizer Architecture (III)
55
![Page 56: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/56.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
PageRank: Two Optimizer Plans
56
Match (on pid)(tid, k=r*p)
Reduce (on tid)(pid=tid, r=∑ k)
O
I(pid, tid, p)
CACHE
Join P and A
Sum uppartial ranks
(pid, r)
Abroadcast
part./sort (tid)
probeHashTable (pid)buildHash-Table (pid)
p
O
I(pid, tid, p)
buildHashTable (pid)
Join P and A
(pid, r)
A
part./sort (tid)
partition (pid)
CACHEprobeHash-Table (pid)
Reduce (on tid)(pid=tid, r=∑ k)
Match (on pid)(tid, k=r*p)
Sum uppartial ranks
ppartition (pid)
fifo
fifo
![Page 57: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/57.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
StratoSphereAbove the Clouds
THE FUNCTIONAL LANGUAGE COMPILATION
Part 3
57
![Page 58: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/58.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
The Compiler Mismatch
58
Parser/Checker Optimizer Code
Generation Runtime
Parser/Checker
Code Generation Optimizer Runtime
The Database Approach
UDF Systems: MapReduce &Stratosphere (original)
Code Generation AFTERcontext of operation is fixed.
Code Generation BEFOREcontext of operation is fixed.
Query Compiler
Language Compiler
![Page 59: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/59.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
The Program Compilation Pipeline
59
Program Code
Parser/Checker
ByteCode
Generator
Analyzer and Code
Generator
GlobalSchema
Generator
PactOptimizer
ProgramInstantiation
Schema and Code
Finalization
Parallel Data Flow
Generator
Parallel Data Flow
Language Compiler
![Page 60: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/60.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ Supported Types□ Primitive (Integers, Floating-Point, Strings, …), Lists, Tuples, Product Types
(classes), Summation Types (class hierarchies) , Recursive Types
■ Data types are logically flattened□ Some fields are transparent members of the flat model, some are black box
members
■ Transparent members may be references in selector functions
■ Selector Functions are likewise analyzed and translated into logical positions
1) Analyzing Data Types
60
![Page 61: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/61.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ User Code is pure Scala, no Stratosphere specific types, interfaces
■ Wrapper code necessary to run it as a UDF in Stratosphere
■ Serializer/Comparator Code is generated as a template (omitting exact field positions, storing logical positions)
■ Code is inserted by modifying the program's Abstract-Syntax-Tree
2) Generating Glue Code
61
![Page 62: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/62.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ Schema generated from logical flattened model■ Each field in every operator’s result gets a unique name
□ Unless exact copy of an input field (info from code analysis)
■ Run Stratosphere optimizer□ Potentially reorders functions
■ Prune unused fields early□ Information whether fields are accessed by UDF from code analysis
■ Create physical data layout■ Finalize serializer / comparator code
3) Schema Generation
62
![Page 63: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/63.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
Some preliminary results...
63
![Page 64: Big Data Ecosystem & The Stratosphere Project](https://reader036.fdocuments.us/reader036/viewer/2022062513/554f9371b4c905435d8b51d6/html5/thumbnails/64.jpg)
Stratosphere – Parallel Analytics Beyond MapReduce
■ MapReduce ■ Pig, JAQL, Hive■ AQL■ Scope■ Datalog for Machine Learning■ BOOM■ Twister / HaLoop■ Spark■ Naiad■ Flume Java / Plume Java■ Scalops■ Jet■ LINQ
Related Work
64