Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable...
Transcript of Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable...
![Page 1: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/1.jpg)
Beyond MapReduce, Beyond
Lambda Easy, unified, reliable processing for stream and batch
William Vambenepe
@vambenepe
Lead Product Manager for Big Data on Google Cloud Platform
![Page 2: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/2.jpg)
http://research.google.com/archive/mapreduce.html
![Page 3: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/3.jpg)
http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
![Page 4: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/4.jpg)
http://research.google.com/pubs/pub41378.html
![Page 5: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/5.jpg)
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
![Page 6: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/6.jpg)
The Lambda Architecture
![Page 7: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/7.jpg)
2012 2013 2002 2004 2006 2008 2010
Google Cloud
Dataflow
MapReduce
GFS Big Table
Dremel
Pregel
Flume
Colossus
Spanner MillWheel
![Page 8: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/8.jpg)
Event Time - When Events Happened
Stream Time - When Events Are Processed
![Page 9: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/9.jpg)
Batch vs Streaming
![Page 10: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/10.jpg)
MapReduce
Batch
![Page 11: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/11.jpg)
MapReduce
[10:00 - 11:00) [10:00 - 11:00) [11:00 -
12:00) [12:00 -
13:00) [13:00 -
14:00) [14:00 -
15:00) [15:00 -
16:00) [16:00 -
17:00) [18:00 -
19:00) [19:00 -
20:00) [21:00 -
22:00) [22:00 -
23:00) [23:00 - 0:00)
Batch: Fixed Windows
![Page 12: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/12.jpg)
MapReduce
[10:00 - 11:00) [11:00 - 12:00)
Batch: User Sessions
Joan
Larry
Ingo
Amanda
Cheryl
Arthur
[11:00 - 12:00) [10:00 - 11:00)
![Page 13: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/13.jpg)
Streaming
11:00 10:00 16:00 15:00 14:00 13:00 12:00
![Page 14: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/14.jpg)
Unordered
Unbounded
Of Varying Event Time Skew
Confounding characteristics of data streams
![Page 15: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/15.jpg)
Event Time Skew
Str
ea
m T
ime
Event Time
Skew
![Page 16: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/16.jpg)
Approaches
![Page 17: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/17.jpg)
1.Time-Agnostic Processing
2.Approximation
3.Stream Time Windowing
4.Event Time Windowing
Approaches to reasoning about time
![Page 18: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/18.jpg)
1. Time-Agnostic Processing - Filters
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
![Page 19: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/19.jpg)
1. Time-Agnostic Processing - Hash Join
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
![Page 20: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/20.jpg)
2. Approximation via Online Algorithms
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
![Page 21: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/21.jpg)
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
3. Windowing by Stream Time
![Page 22: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/22.jpg)
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Fixed Windows
![Page 23: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/23.jpg)
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time
11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Sessions
![Page 24: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/24.jpg)
Dataflow API
![Page 25: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/25.jpg)
What are you computing?
Where in event time?
When in stream time?
![Page 26: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/26.jpg)
What = Aggregation API
Where = Windowing API
When = Watermarks + Triggers API
![Page 27: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/27.jpg)
Dataflow improvements over Lambda
Low-latency, approximate results
Complete, correct results as soon as possible
One system: less to manage, fewer resources, one set of bugs
Tools for explicit reasoning about time
= Power + Flexibility + Clarity
![Page 28: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/28.jpg)
And those are just the programming model improvements…
What about the operational model improvements from
marrying Dataflow with Cloud?
![Page 29: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/29.jpg)
Cloud Dataflow as a No-op Cloud service
Google Cloud Platform
Managed Service
User Code & SDK
Work Manager
De
plo
y &
Sch
ed
ule
Pro
gre
ss &
Log
s
Monitoring UI
Job Manager
![Page 30: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/30.jpg)
Putting it all together
Stream
Batch
Cloud
Pub/Sub
Cloud Logs
Analytics
Premium
Cloud
Storage
App
Engine
Cloud
Dataflow
BigQuery
Storage (tables)
Cloud
Storage (files)
Cloud
Dataflow
BigQuery
Analytics (SQL)
Bigtable (noSQL)
![Page 31: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/31.jpg)
Optimizing Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability Deployment &
configuration
Handling
Growing
Scale
Utilization
improvements
Data Processing with
Cloud Dataflow Typical Data Processing
Programming
![Page 32: Beyond MapReduce, Beyond Lambda · Beyond MapReduce, Beyond Lambda Easy, unified, reliable processing for stream and batch William Vambenepe @vambenepe Lead Product Manager for Big](https://reader035.fdocuments.us/reader035/viewer/2022070710/5ec5db410efcdc47420f5b08/html5/thumbnails/32.jpg)
For more info Google Cloud Services:
https://cloud.google.com/dataflow/
https://cloud.google.com/bigquery/
https://cloud.google.com/pubsub/
https://cloud.google.com/hadoop/
Contact me:
William Vambenepe
twitter: @vambenepe
email: [email protected]
Dataflow programming model
is open-source:
SDK @ github
/GoogleCloudPlatform/DataflowJavaSDK
(Python SDK in progress)
Spark runner @ github
/cloudera/spark-dataflow
Flink runner @ github
/dataArtisans/flink-dataflow