The Stratosphere Platform: Big Data Analytics at...

1
Partners and Collaborations Funding Organisations The Stratosphere Plaorm: Big Data Analycs at Scale Open Source Plaorm for Big Data analycs in massively parallel cluster and cloud environments. Combines key technologies of MapReduce, Compilers, Distributed Systems and Parallel Database Systems. Enables scalable data management, machine learning and deep analysis of Big Data. Declarave Analycs - Empower data analysts to use Big Data by unifying data and programming models in a declarave abstracon - Cross-opmize data extracon, querying and modeling Data Streaming - Streaming semancs for advanced analycal funcons - Modeling and managing distributed operator state - Opmizing data analysis program workloads Iterave Processing - Fault tolerance and numerical stability for iterave algorithms - Advanced opmizaon techniques for iterave jobs Soſtware Stack Enterprise Data Hub Enterprise Cloud Advanced Big Data Analytics JDBC Local Files HDFS NoSQL Storage YARN Direct EC2 Cluster Manager Stratosphere Runtime Stratosphere Optimizer Java API Scala API Meteor Scripting Language Hadoop MR Hive and others Spargel Graph Processing JDBC Pr ogramming Model/Operators reduce map cross join cogroup iterate union iterate delta Stratosphere extends the well-known MapReduce model with new operators. These operators represent many common data analysis tasks more naturally and efficiently. All operators will start working in memory and gracefully go out of core under memory pressure. People with Big Data Analycs Skills Today Data Analysts Systems Experts Future Big Data Users Example: K Means Clustering (Scala) val dataPoints = DataSource(dataPointInput, DelimitedInputFormat(parseInput)) val clusterPoints = DataSource(clusterInput, DelimitedInputFormat(parseInput)) def computeNewCenters(centers: DataSet[(Int, Point)]) = { val distances = dataPoints.cross(centers) . map computeDistance val nearestCenters = distances.groupBy { case (pid, _) => pid } .reduceGroup { ds => ds.minBy(_._2.distance) } .map asPointSum tupled val newCenters = nearestCenters groupBy { case (cid, _) => cid } .reduceGroup(sumPointSums) .map { case (cid, pSum) => cid -> pSum.toPoint() } return newCenters } val finalCenters = clusterPoints.iterate(numIterations, computeNewCenters) Incremental Iteraons can exploit sparse computaon dependencies without sacrificing dataflow programming abstracon. The performance of incremental iteraons in Stratosphere matches that of specialized engines. Iterave Algorithms Contact & Further Informaon Current and Future Work More Information: Project: http://stratosphere.eu Source Code: http://github.com/stratosphere Contact Us: Database Systems and Informaon Management, Technische Universität Berlin Participating in the Google Summer of Code 2014 - Cost-based opmizer choice of operators and shipping strategies. - In-memory pipelining of operators - Reducon of shipped and wrien data volume - Input Sampling to determine cardinalies Built-In Opmizer Big Data Analysts Sll Hard to Find e-mail: [email protected] REDUCE aggregate lineitem supplier output MAP filter MAP project MATCH join 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 REDUCE aggregate lineitem supplier output MAP filter MAP project MATCH join 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 REDUCE aggregate lineitem supplier output MAP filter MAP project MATCH join 0 1 2 3 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 5 5 5 5 0 5 5 0 Our goal: empower data analysts to focus on their domain problem without requiring systems programming. Scalable Data Management Machine Learning, Stascs, Data analysis ML DM Applicaons Big Data Analycs Iterave Algorithms Error Esmaon Acve Sampling Sketches Curse of Dimensionality Isolaon Convergence Monte Carlo Mathemacal Programming Linear Algebra Stochasc Gradient Descent Regression Stascs Hashing Control flow Parallelizaon Query Opmizaon Fault T olerance Relaonal Algebra/SQL Scalability Data Analysis Languages Compilers Memory Management Memory Hierarchy Dataflow Hardware Adapon Indexing Resource Management Data Warehouses/OLAP/XQuery

Transcript of The Stratosphere Platform: Big Data Analytics at...

Page 1: The Stratosphere Platform: Big Data Analytics at Scaleasterios.katsifodimos.com/assets/publications/posters/... · 2020. 6. 18. · - Optimizing data analysis program workloads Iterative

Partners and Collaborations

Funding Organisations

The Stratosphere Platform: Big Data Analytics at Scale

Open Source Platform for Big Data analytics in massively parallel cluster and cloud environments. Combines key technologies of MapReduce, Compilers, Distributed Systems and Parallel Database Systems.

Enables scalable data management, machine learning and deep analysis of Big Data.

Declarative Analytics- Empower data analysts to use Big Data by unifying data andprogramming models in a declarative abstraction- Cross-optimize data extraction, querying and modeling

Data Streaming- Streaming semantics for advanced analytical functions- Modeling and managing distributed operator state- Optimizing data analysis program workloads

Iterative Processing- Fault tolerance and numerical stability for iterative algorithms- Advanced optimization techniques for iterative jobs

Software Stack

Enterprise Data Hub

Enterprise Cloud

Advanced Big Data Analytics

JDBC Local Files HDFS NoSQL Storage

YARN Direct EC2 Cluster Manager

Stratosphere Runtime

Stratosphere Optimizer

Java API

Scala API

MeteorScripting Language

Hadoop MR

Hive and others

SpargelGraph

Processing

JDBC

Programming Model/Operators

reducemap crossjoin

cogroupiterateunion iteratedelta

Stratosphere extends the well-known MapReduce model with new operators. These operators represent many common data analysis tasks more naturally and efficiently. All operators will start working in memory and gracefully go out of core under memorypressure.

People with Big DataAnalytics Skills Today

Data Analysts

Systems Experts

Future Big Data Users

Example: KMeans Clustering (Scala) val dataPoints = DataSource(dataPointInput, DelimitedInputFormat(parseInput)) val clusterPoints = DataSource(clusterInput, DelimitedInputFormat(parseInput))

def computeNewCenters(centers: DataSet[(Int, Point)]) = { val distances = dataPoints.cross(centers). map computeDistance val nearestCenters = distances.groupBy { case (pid, _) => pid } .reduceGroup { ds => ds.minBy(_._2.distance) } .map asPointSum tupled val newCenters = nearestCenters groupBy { case (cid, _) => cid } .reduceGroup(sumPointSums) .map { case (cid, pSum) => cid -> pSum.toPoint() } return newCenters } val finalCenters = clusterPoints.iterate(numIterations, computeNewCenters)

Incremental Iterations can exploit sparse computation dependencies without sacrificing dataflow programming abstraction. The performance of incremental iterations in Stratosphere matches that of specialized engines.

Iterative Algorithms

Contact & Further Information

Current and Future Work

More Information:Project: http://stratosphere.euSource Code: http://github.com/stratosphere

Contact Us:

Database Systems and Information Management, Technische Universität Berlin

Participating in the Google Summer of Code 2014

- Cost-based optimizer choice of operators and shipping strategies.- In-memory pipelining of operators- Reduction of shipped and written data volume- Input Sampling to determine cardinalities

Built-In Optimizer

Big Data Analysts Still Hard to Find

e-mail: [email protected]

REDUCEaggregate

lineitem

supplier

output

MAPfilter

MAPproject

MATCHjoin

0 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 80 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8

REDUCEaggregate

lineitem

supplier

output

MAPfilter

MAPproject

MATCHjoin

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 80 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

REDUCEaggregate

lineitem

supplier

output

MAPfilter

MAPproject

MATCHjoin

0 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 80 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 80 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 44 5 6 7 8

0 1 2 3 44 5 6 7 8

05

5

5

5 0

5

5 0

Our goal: empower data analysts to focus on their domain problem without requiring systems programming.

Scal

able

Dat

a M

anag

emen

t

Machine Learning,

Statistics, Data analysis M

L DM

Applications

Big Data Analytics

Iterative Algorithms

Error Estimation

Active Sampling

Sketches

Curse of Dimensionality

Isolation

Convergence

Monte Carlo

Mathematical Programming

Linear Algebra

Stochastic Gradient Descent

Regression

Statistics

Hashing

Control flow

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra/SQL

Scalability

Data Analysis Languages

Compilers

Memory Management

Memory Hierarchy

Dataflow

Hardware Adaption

Indexing

Resource Management Data Warehouses/OLAP/XQuery