Welcome to the first Workshop on Big data Open Source...

34
Welcome to the first Workshop on Big data Open Source S ystems (BOSS) September 4th, 2015 Co-located with VLDB 2015 Tilmann Rabl

Transcript of Welcome to the first Workshop on Big data Open Source...

Page 1: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Welcome to the first Workshop on Big data Open Source

Systems (BOSS)September 4th, 2015

Co-located with VLDB 2015

Tilmann Rabl

Page 2: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Hands on Big Data

• 8 parallel tutorials• 8 systems

• Open source• Publicly available

• Presenters• System experts

• Hands on• This is not a demo!

• You can pick two!

Page 3: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

But why?

• Initial idea: Malu Castellanos• Mike Carey

• Doing It On Big Data: a Tutorial/Workshop• Driving force

• Other people involved• Volker Markl• Norman Patton• Lipyeow Lim• Kerstin Forster

• Experiment• Tell us what you think• Email: [email protected]

MOAR SYSTEMS!

Page 4: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Presented Systems

• Apache AsterixDB

• Apache Flink

• Apache Reef

• Apache Singa

• Apache Spark

• Padres

• rasdaman

• SciDB

Page 5: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Massively Parallel Program

• Bulk Synchronous ParallelIn

trod

uctio

n&

Fla

sh S

essio

n

Pane

l

Coffe

e Br

eak

Lunc

h

Tuto

rials

Part

2

Tutorial 1 Part 1Tutorial 2 Part 1Tutorial 3 Part 1Tutorial 4 Part 1Tutorial 5 Part 1Tutorial 6 Part 1Tutorial 7 Part 1Tutorial 8 Part 1

Page 6: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Runtime Environment

You are here

Page 7: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Panel – Big Data and Exascale

• Panel Chair• Chaitanya Baru, San Diego

Supercomputing Center

• Panelists• Arie Shoshani, LBNL • Guy Lohmann, IBM• Mike Carey, UC Irvine• Paul G. Brown, Paradigm4• Peter Baumann, Jacobs University• Volker Markl, TU Berlin

Page 8: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Apache AsterixDB (Incubating)

Page 9: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

AsterixDB: “One Size Fits a Bunch!”

Wish-list:

• Able to manage data

• Flexible data model

• Full query capability

• Continuous data ingestion

• Efficient and robust parallel runtime

• Cost proportional to task at hand

• Support today’s “Big Data data types”

Semistructured data management

Parallel DB systems

First--gen BD analysis tools

Page 10: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Apache Flink

Page 11: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Apache Flink™: Stream and Batch processing at Scale

MartonBalassi

ParisCarbone

Gyula For a

VasiaKalavri

(ELTE/SZTAKI,Hungary)(KTH, Stockholm,Sweden)(SICS,Stockholm, Sweden)(KTH, Stockholm,Sweden)

AsteriosKatsifodimos(TU Berlin,Germany)

Page 12: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

What is Flink?

2Kafka

MapReduce

Hive

Flink

Spark Storm

Yarn Mesos

HDFS

Mahout

Cascading

Tez

Pig

Data processingengines

App and resourcemanagement

Applications

Storage, streams HBase

Crunch

Giraph

Page 13: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

What can I do with Flink?Batchprocessing

3An engine that can natively support all these workloads.

Flink

Streamprocessing

Graph Analysis

Machine Learning at scale

Page 14: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

But what will I do with Flink today?

• Graph processing• ETL on Datasets• Graph creation & analysis

• Stream Processing• Rolling Aggregates• Windows & Alerts

Page 15: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Agenda

5

• Introduction• 15’ Overview• 15' Gelly (Graph)API

• 30' Break

• Graph Processing• 20' DataSet/Gelly Hands-on

• Streamprocessing with Flink• 10’ DataStream API• 15’ Fault Tolerance Demo• 45' Streaming Hands-on

Page 16: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Apache Reef

Page 17: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

A meta-framework that eases the development of Big Data applications atop resource managers such as YARN and Mesos

InteractiveQuery

BusinessIntelligence

StreamProcessing

MachineLearning

BatchProcessing

REEF

Reusable control plane for coordinating data plane tasks Adaptation layer for resource managers Container and state reuse across tasks from heterogeneous frameworks Simple and safe configuration management Scalable local, remote event handling Java and C# (.NET) support

Distributed File System

Resource Manager

Data Processing Lib (REEF, Third-party)

In production use (Microsoft Azure)

Deep Dive into Apache REEF (Incubating)Byung-Gon Chun, Brian Cho (Seoul National University)

BOSS 2015Sep. 4, 2015

Page 18: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

ClientTutorial 1. What is REEF? 2. Install REEF3. Run your first REEF job: HelloREEF4. Why would you want REEF? 5. Create your own Task Schedulerwith REEF

Deep Dive into Apache REEF (Incubating)Byung-Gon Chun, Brian Cho (Seoul National University)

BOSS 2015Sep. 4, 2015

Contact: Byung-Gon Chun [email protected] Cho [email protected]

Page 19: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Apache Singa

Page 20: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

A General Distributed Deep Learning Platform

20

Motivation Deep learning is effective for classification tasks, e.g., image recognition Training code is complex to write from the scratch Training is time consuming, e.g., 10 days or weeks

Goals Easy to use

General to support popular deep learning models Extensible for users to do customization, e.g., training new models

Scalable Reduce training time with more computation resources, e.g. machines Improve efficiency of one training iteration by synchronous training Reduce total number of training iterations by asynchronous training

Page 21: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Apache Spark

Page 22: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Spark Tutorial

Reynold Xin @rxinSep 4, 2015 @ VLDB BOSS 2015

Page 23: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Apache Spark

Fast & general distributed data processing engine, with APIs in SQL, Scala, Java, Python, and R

800+ contr ibuto rs and many academic papers

Largest open source project in (big) data & at Apache

Page 24: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

A Brief History

2009 2010 2011 2012 2013 2014 2015

started @ Berkeley

HotCloud

NDSI (RDD)SIGMOD Demo (Shark)

SIGMOD (Shark) SOSP (Streaming) Datab ricks started

Donated to ASF

OSDI (GraphX)

SIGMOD (Spark SQL)

Page 25: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Users

1000+ companies

Distributors + Apps

50+ companies

Page 26: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Our Goal for Spark

Unified engine across data workloads and platforms

SQL ML GraphStreaming Batch …

Page 27: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Agenda Today

Spark 101: RDD Fundamentals

Spark 102: DataFrames

Spark 201: Understanding Spark Internals

(with exercises in Databricks notebooks)

Page 28: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

PADRES

Page 29: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Broker Overlay

Pub/Sub is a communication paradigm / middlewareCommunication between information producers (publisher) and consumers (subscriber) is mediated by a set of brokers (p2p overlay).

08.10.2015 Middleware Systems Research Group 29

S S

PPP

S2

S1S1S1 S1S1

S1

S1 S1S1

S1

S1

P PP

S1

“subscribe”“publish”

Features• Content-based routing• Composite subscription (event P1 and event P2 occurred within 2s)• Load balancing (offload clients to less loaded brokers)• Fault tolerance (maintains integrity of broker network)• Historic access (subscribe to past events)• System monitoring (overlay monitoring & visualization)

Matching Engine

Routing Table

+

S

Sc

P1

P2

P

P

Atomic events

Compositeevent

PBroker Overlay

S S S

S

S Broker Overlay

S

S SS

S

BalancerBroker Overlay

SS

PP

Presenter: Kaiwen ZhangUniversity of Toronto

Page 30: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

rasdaman

Page 31: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

the pioneer Array DBMS: analytics on n-D dense/sparse arraysoptimization & parallel QP on multicore, cloud, modern hw

scalable from cubesat to datacenter federationsseamless integration with R, python, ...

operationally deployed on Petascale, basis for ISO Array SQL

the Array Database

www.rasdaman.org

Page 32: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

SciDB

Page 33: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

SciDB:No cute animals …

No 5 color marketing brochure …

… just an …

Open Source, Transactional,

Massively Parallel, Array DBMS with

A Scalable Analytic Query Engine.

Page 34: Welcome to the first Workshop on Big data Open Source ...boss.dima.tu-berlin.de/media/BOSS15-Introduction.pdfWelcome to the first Workshop on Big data Open Source Systems (BOSS) September

Let‘s go!

Intro & Panel