Next-Generation Reading Next-Generation Advanced Algebra ...
The Next Generation of Data Processing and Open Source
-
Upload
hadoop-summit -
Category
Technology
-
view
344 -
download
0
Transcript of The Next Generation of Data Processing and Open Source
![Page 1: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/1.jpg)
The Next Generation of Data Processing & Open SourceJames Malone, Google Product Manager, Apache Beam PPMCEric Schmidt, Google Developer Relations
![Page 2: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/2.jpg)
Agenda
1
2
3
4
5
6
The Last Generation - Common historical challenges in large-scale data processing
The Next Generation - How large-scale data processing should work
Apache Beam - A solution for next generation data processing
Why Beam matters - A gaming example to show the power of the Beam model
Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds
Things to Remember - Recap and how you can get involved
2
![Page 3: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/3.jpg)
3
Common historical challenges in large-scale data processing
01 The Last Generation
![Page 4: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/4.jpg)
Decide on tool Read docs
Get infrastructure
Setup tools Tune tools
Productionize Get Specialists
Optimistic
Frustrated
Setting up infrastructure
![Page 5: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/5.jpg)
Batch model Streaming model
Batch use case Streaming use case
Streaming engineBatch engine
Batch output Streaming output
Join output
Optimistic
Frustrated
Programming models
![Page 6: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/6.jpg)
Data model
Data pipeline
Execution engine 1
Data model
Data pipeline
Execution engine 1
Data model
Data pipeline
Execution engine 1
FrustratedHappy
Data pipeline portability
![Page 7: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/7.jpg)
Infrastructure is a pain
Models are disconnected
Pipelines are not portable
7
![Page 8: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/8.jpg)
8
How data processing should work
02 The Next Generation
![Page 9: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/9.jpg)
9
Infrastructure is a pain an afterthought
Models are disconnected unified
Pipelines are not portable portable
![Page 10: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/10.jpg)
Skim docs
Decide on product
Start service
Optimistic
Happy
Setting up infrastructure
![Page 11: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/11.jpg)
Unified model
Batch use case
Runner(s)
Streaming use case
Output
Optimistic
Happy
A flexible (unified) model
![Page 12: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/12.jpg)
Data model
Data pipeline
Execution engine
Execution engine
Execution engine
Happy
Happier
Portable data pipelines
![Page 13: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/13.jpg)
Why does this matter?
More time can be dedicated to examining data for actionable insights
Less time is spent wrangling code, infrastructure, and tools used to process data
Hands-on with data
Cloud setup and customization
![Page 14: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/14.jpg)
14
A solution for next generation data processing
03 Apache Beam (incubating)
![Page 15: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/15.jpg)
What is Apache Beam?
1. The (unified stream + batch) Dataflow Beam programming model
2. Java and Python SDKs
3. Runners for Existing Distributed Processing Backends
a. Apache Flink (thanks to dataArtisans)
b. Apache Spark (thanks to Cloudera & PayPal)
c. Google Cloud Dataflow (fast, no-ops)
d. Local (in-process) runner for testing
+ Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others!
15
![Page 16: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/16.jpg)
The Apache Beam vision
1. End users: who want to write pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam concepts available in new languages.
3. Runner writers: who have a distributed processing environment and want to support Beam pipelines
16
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Google Cloud
Dataflow
Execution
![Page 17: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/17.jpg)
Joining several threads into Beam
17
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud Dataflow
Cloud Dataproc
Apache Beam
![Page 18: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/18.jpg)
Creating an Apache Beam community
Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors
Grow - We want to grow the Beam ecosystem and community with active, open involvement so beam is a part of the larger OSS ecosystem
Learn - We (Google) are also learning a lot as this is our first data-related Apache contribution ;-)
![Page 19: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/19.jpg)
Apache Beam Roadmap
02/01/2016Enter Apache
Incubator
End 2016Beam pipelines
run on many runners in
production uses
Early 2016Design for use cases,
begin refactoring
Mid 2016Additional refactoring,non-production uses
Late 2016Multiple runners execute Beam
pipelines
02/25/20161st commit to ASF repository
06/14/20161st incubating
release
June 2016Python SDK
moves to Beam
![Page 20: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/20.jpg)
20
An example to show the power of the Beam model
04 Why Beam Matters
![Page 21: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/21.jpg)
Apache Beam - A next generation model
21
Improved abstractions let you focus on your business logic
Batch and stream processing are both first-class citizens -- no need to choose.
Clearly separates event time from processing time.
![Page 22: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/22.jpg)
Processing time vs. event time
22
![Page 23: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/23.jpg)
Beam model - asking the right questions
23
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
![Page 24: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/24.jpg)
The Beam model - what is being computed?
24
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
![Page 25: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/25.jpg)
The Beam model - what is being computed?
25
![Page 26: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/26.jpg)
The Beam model - where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
![Page 27: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/27.jpg)
The Beam model - where in event time?
![Page 28: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/28.jpg)
The Beam model - when in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
![Page 29: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/29.jpg)
The Beam model - when in processing time?
![Page 30: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/30.jpg)
The Beam model - how do refinements relate?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
![Page 31: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/31.jpg)
The Beam model - how do refinements relate?
![Page 32: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/32.jpg)
Customizing what where when how
32
3Streaming
4Streaming
+ Accumulation
1Classic Batch
2Windowed
Batch
![Page 33: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/33.jpg)
Apache Beam - the ecosystem
33http://beam.incubator.apache.org/capability-matrix
![Page 34: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/34.jpg)
34
Lets run a Beam pipeline on 3 engines in 2 separate locations
05 Demo
![Page 35: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/35.jpg)
35
Created 1 Beam pipeline
Ran that one pipeline on three execution engines in two places
● Google Cloud Platform○ Google Cloud Dataflow○ Apache Spark on Google Cloud Dataproc
● Local○ Apache Beam local runner○ Apache Flink
100% portability, 0 problems
What we just did
![Page 36: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/36.jpg)
36
Recap and how you can get involved
06 Things to remember
![Page 37: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/37.jpg)
Apache Beam is designed to provide potable pipelines with a unified programming model
37
![Page 38: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/38.jpg)
Get involved with Apache Beam
38
Apache Beam (incubating)http://beam.incubator.apache.org
The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists! [email protected]@beam.incubator.apache.org
Join the Apache Beam Slack channel
https://apachebeam.slack.com
Follow @ApacheBeam on Twitter
![Page 39: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/39.jpg)
A special thank you
39
A special thank you to Frances Perry and Tyler Akidau for sharing Apache Beam content which was used in this presentation.
![Page 40: The Next Generation of Data Processing and Open Source](https://reader031.fdocuments.us/reader031/viewer/2022030306/586fdea01a28ab18428b6c77/html5/thumbnails/40.jpg)
40
Thank you