April 20, 2015 for Big Data Analytics - Harvard...

The Stratosphere Platform for Big Data Analytics

Hongyao MaFranco Solleza

April 20, 2015

Stratosphere

Big Data Analytics

● “BIG Data”

● Heterogeneous datasets: structured / unstructured / semi-structured

● Users have different needs for declarativity and expressivity

What we have covered so far

● Polybase

● Shark

● MLBase

● SharedDB

● BlinkDB

The Promises● Declarative, high-level language

● “In situ” data analysis

● Richer set of primitives than MapReduce

● Treat UDFs at first-class citizens

● Automated parallelization and optimization

● Support for iterative programs

● Includes external memory query processing algorithms to support arbitrarily long programs

Outline

● Meteor & Sopremo

● PACT

● Nephele

● Experiment Results

● Future work & Discussions

Sopremo

Meteor Script

● Declarative interface● High level script

Meteor Translates To SopremoOutput

Lineitem

Filter

ComputeRevenue

Supplier

Sopremo

● Modular and extensible● Composable

Sopremo compiled to PACTOutput

Lineitem

Filter

ComputeRevenue

Supplier

PACT● Programmer makes a “pact”

with system● Uses one of 5 functions

Map Reduce Cross

Match Co-group

Map Reduce Cross

Match Co-group

Map Reduce Cross

Match Co-group

Map Reduce Cross

Match Co-group

What’s a PACT?

● Data and a function● Specifies how data are partitioned across the system● An atomic(?) operation on all specified data

Iterative PACT Programs

● Implicitly, iteration mutates state

● Implicitly, iteration mutates state● How to do iteration without explicit

mutation of state?

● Bulk iteration

Starts with a solution set

● Bulk iteration

Sends group by label to neighbors

● Bulk iteration

Find minimum among those neighbors

● Bulk iteration

Outputs an incremental solution set

● Bulk iteration

Incremental solution set becomes input to next iteration

● Bulk iteration

● Incremental iteration

Starts with a work set, and a solution set

Calculates the min for a group

Merges work set with solution set and checks if label changed

If the label is new, it becomes part of the delta set ..

Which gets sent back to the next iteration

If changed, also gets matched to the neighbors...

And those matches become the new workset

PACT Optimization

Nephele

Nephele Execution

Nephele Execution● Tasks, channels,

scheduling

Tasks with all local pipelines associated with that task are pushed by to slaves

scheduling

Tasks can request to send data over network (only when necessary or ready)

Nephele Execution● Fault tolerance

Conceptually, follows the same concept as lineage (RDDs) but...

Intermediate

Blocking operator model

Intermediate

Non- Blocking operator model

Nephele Execution● Runtime operators

Does it deliver?

● Maybe - what do the experiments say?● What’s old?

○ A lot of things

● What’s new?○ second-order functions that abstract parallelization○ optimization in a UDF-heavy environment○ Integrate iterative processing○ an extensible query language and underlying operator model

Experimental Evaluation

Experimental SetupSetup:

● 1 master + 25 slave machines● 16 cores @ 2.0Hz with 32GB of RAM (29GB of operating memory)● 80TB HDFS in plain ASCII, 4 SATA drives at 500MB/s read/write per node● 8 parallel tasks per slave, total DOP 40-200

Comparison with Hadoop

● Vanilla MapReduce engine● Apache Hive● Apache Giraph

Summary of Results

● Stratosphere achieves linear speedup and similar performance to Hadoop for simple tasks (TeraSort, Word Count)

● Stratosphere beats Hive and Hadoop by 5 times for complicated tasks like TPC-H and triangle enumeration, though no gain from increasing DOP

● Stratosphere performed worse on Connected Components than Giraph due to the better tuned implementation of the latter

● Checkpointing adds little overhead and saves much time when failure occurs

TeraSort --- Stratosphere v.s. HadoopStratosphere achieves similar performance as Hadoop and Linear Speedup

Word Count --- Stratosphere v.s. HadoopStratosphere is 20% faster than Hadoop and achieves linear speedup

Triangle Enumeration: Reducer 1

Triangle Enumeration: Reducer 2

Triangle Enumeration: PACT

Triangle EnumerationStratosphere is 5x faster than Hadoop, though parallelism does not help

TPC-H Query

TPC-H --- Stratosphere v.s. HiveParallelism does not seem to help, however, Stratosphere is 5x faster

Connected ComponentsGiraph is faster, due to better tuned implementation

CC --- Execution time per superstep

Fault ToleranceCheckpointing adds little overhead and saves much time when failure occurs

What Else Do We Want to See? For presented experiments:

● Breakdown of execution time to distinguish bottlenecks● What happens with even smaller DOP?● What happens with more/less tasks on each core?

Further:

● What happens with even larger data? Current size does fit into RAM● Comparison with MPP, or split query processing systems like Polybase, or

Shark given the size of the tested data

The Promises?● Declarative, high-level language

● “In situ” data analysis

● Richer set of primitives than MapReduce

● Treat UDFs at first-class citizens

● Automated parallelization and optimization

● Support for iterative programs

● Includes external memory query processing algorithms to support arbitrarily long programs

Ongoing and Future Work● One-pass optimizer unifying PACT and sopremo layers

● Strengthening fault-tolerant capabilities

● Improving scalability and efficiency of Nephele

● Design, compilation and optimization of higher-level languages

● Scalable, efficient, and adaptive algorithms and architecture

● “Stateful” systems for fast ingestion and low-latency data analysis

Discussions and Questions

● Declarativity - expressiveness tradeoff

○ More declarative -> less expressive, but easier to optimize

● Run-time optimization is the way to go?

○ Skewed data distribution may become a bottleneck for such systems

○ Detecting performance bottleneck on the fly

QEDTHANKS!

April 20, 2015 for Big Data Analytics - Harvard...

Documents

Transcript of April 20, 2015 for Big Data Analytics - Harvard...

Stratosphere Las Vegas

Stratosphere troposphere coupling and annular mode ...

Western Star Stratosphere Sleepers Brochure

E 4. Ozone depletion in stratosphere

Big Data Analytics Technologische Herausforderungen in ......For similarity operators, stream processing, etc Stratosphere: A platform for efficient data processing at scale | V. Markl

Big Data Analytics on Modern Hardware Architecturescs.ulb.ac.be/conferences/ebiss2012/files/saecker_ebiss...porting JAQL Hadoop Stack Stratosphere Stack PACT Programming Model MapReduce

Stratosphere Intro (Java and Scala Interface)

Ozone in the Troposphere and Stratosphere

Astronomers Returns to the Stratosphere

The Stratosphere Platform: Big Data Analytics at Scaleasterios.katsifodimos.com/assets/publications/posters/... · 2020. 6. 18. · - Optimizing data analysis program workloads Iterative

Evolution Stratosphere - orangetreesamples.com · Evolution Stratosphere User's Guide Page 11 of 37 Evolution Stratosphere uses Kontakt's snapshot system to manage factory and user

Chemistry of the Stratosphere

1. Ozone in the stratosphere

Stratosphere in Las Vegas

I LI . Stratosphere I L

The Stratosphere Big Data Analytics Platform

Troposphere stratosphere mesosphere. What is Ozone? Chemical Formula is O 3 Found in the Stratosphere (good) and Troposphere (bad). Stratosphere ozone.

Improving our understanding of the processes connecting the Antarctic stratosphere to the global stratosphere…

Dynamics in the stratosphere dynamics.pdf · stratosphere (in particular variability in the zonal ow at the equator. The dom-inant internal oscillation in the stratosphere is the

apesnerds.weebly.comapesnerds.weebly.com/.../climate_change_and_ozone_… · Web viewOzone Layer Depletion in Stratosphere Ozone is found naturally in the stratosphere! GOOD in stratosphere!