Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources •...
Transcript of Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources •...
![Page 1: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/1.jpg)
Google Dataflow 小試
Simon Su @ LinkerNetworks{Google Developer Expert}
![Page 2: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/2.jpg)
var simon = {/** I am at GCPUG.TW **/};
simon.aboutme = 'http://about.me/peihsinsu';
simon.nodejs = ‘http://opennodes.arecord.us';
simon.googleshare = 'http://gappsnews.blogspot.tw'
simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';
simon.blog = ‘http://peihsinsu.blogspot.com';
simon.slideshare = ‘http://slideshare.net/peihsinsu/';
simon.email = ‘[email protected]’;
simon.say(‘Good luck to everybody!');
![Page 3: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/3.jpg)
https://www.facebook.com/groups/GCPUG.TW/
https://plus.google.com/u/0/communities/116100913832589966421
![Page 4: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/4.jpg)
●●
●
![Page 5: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/5.jpg)
![Page 6: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/6.jpg)
Virtualized Data Centers
Standard virtual kit for Rent. Still yours to
manage.
2nd WaveColocation
1st Wave
Your kit, someone else’s building.
Yours to manage.
Assembly required True On Demand Cloud
Next
Storage Processing Memory Network
Clusters
Distributed Storage, Processing & Machine Learning
Containers
3rd WaveAn actual, global
elastic cloudInvest your energy in
great apps.
![Page 7: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/7.jpg)
FoundationInfrastructure & Operations
Data Services
Application Runtime Services
Enabling No-Touch Operations
Breakthrough Insights, Breakthrough Applications
The Gear that Powers Google
![Page 8: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/8.jpg)
Dataflow
StoreCapture Analyze
BigQuery LargerHadoop
Ecosystem
Pub/SubLogs
App EngineBigQuery streaming
Process
CloudStorage
CloudDatastore(NoSQL)
Cloud SQL(mySQL)
BigQuery Storage
Dataproc Dataproc
![Page 9: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/9.jpg)
Devices
Physical or virtual servers
as frontend data receiver
MapReduce servers for large
data transformation in
batch way or streaming
Strong queue service for
handling large scale data injection
Large scale data store for storing data and serve query workload
Smart devices, IoT devices and Sensors
1M Devices
16.6K Events/sec
16.6K Events/sec
43B Events/month
![Page 10: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/10.jpg)
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
![Page 11: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/11.jpg)
GCP
Managed Service
User Code & SDKWork Manager
Deploy & Schedule
Monitoring UI
Job Manager
Progress & Logs
![Page 12: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/12.jpg)
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous computation
• Composition
• External orchestration
• Simulation
OrchestrationAnalysisETL
![Page 13: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/13.jpg)
![Page 14: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/14.jpg)
![Page 15: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/15.jpg)
![Page 16: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/16.jpg)
![Page 17: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/17.jpg)
![Page 18: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/18.jpg)
Online cloud import (Cloud
Storage Transfer Service)
Object lifecycle management
ACLs Object change notification
Offline import (third party)Regional buckets
Object versioning
![Page 19: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/19.jpg)
![Page 20: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/20.jpg)
![Page 21: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/21.jpg)
![Page 22: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/22.jpg)
![Page 23: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/23.jpg)
![Page 24: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/24.jpg)
●●●●
![Page 25: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/25.jpg)
![Page 26: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/26.jpg)
![Page 27: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/27.jpg)
![Page 28: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/28.jpg)
Map
Shuffle
Reduce
ParDo
GroupByKey
ParDo
![Page 29: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/29.jpg)
● A Direct Acyclic Graph of data processing
transformations
● Can be submitted to the Dataflow Service for
optimization and execution or executed on an
alternate runner e.g. Spark
● May include multiple inputs and multiple outputs
● May encompass many logical MapReduce
operations
● PCollections flow through the pipeline
![Page 30: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/30.jpg)
Your
Source/Sink
Here
❯ Read from standard Google Cloud Platform data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by teaching Dataflow how to read it in parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON, XML, Avro formatted data
![Page 31: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/31.jpg)
❯ A collection of data of type T in a pipeline - PCollection<K,V>
❯ Maybe be either bounded or unbounded in size
❯ Created by using a PTransform to:• Build from a java.util.Collection• Read from a backing data store• Transform an existing PCollection
❯ Often contain the key-value pairs using KV
{Seahawks, NFC, Champions, Seattle, ...}
{..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...}
![Page 32: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/32.jpg)
● A step, or a processing operation that transforms data○ convert format , group , filter data
● Type of Transforms○ ParDo
○ GroupByKey
○ Combine
○ Flatten
■ Multiple PCollection objects that contain the same data type, you can
merge them into a single logical PCollection using the Flatten transform
![Page 33: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/33.jpg)
❯ Processes each element of a PCollection independently using a user-provided DoFn
❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo
❯ Useful for
Filtering a data set.
Formatting or converting the type of each element
in a data set.
Extracting parts of each element in a data set.
Performing computations on each element in a
data set.
{Seahawks, NFC, Champions, Seattle, ...}
{ KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, …}
KeyBySessionId
![Page 34: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/34.jpg)
Wait a minute… How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value pairs and gathers up all values with the same key
• Corresponds to the shuffle phase in Hadoop
{KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
![Page 35: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/35.jpg)
![Page 36: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/36.jpg)
●●
![Page 37: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/37.jpg)
![Page 38: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/38.jpg)
![Page 39: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/39.jpg)
![Page 40: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/40.jpg)
![Page 41: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/41.jpg)
![Page 43: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/43.jpg)
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB Subscription YC
Subscription ZC
Cloud Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1Message 2
Message 3Message 3
Globally redundant Low latency (sub sec.) N to N coupling Batched read/write Push & Pull Guaranteed Delivery Auto expiration
![Page 44: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/44.jpg)
●●●
![Page 45: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/45.jpg)
DatalabAn easy tool for analysis and report
BigQueryAn interactive analysis service
![Page 46: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/46.jpg)
![Page 47: Google Dataflow 小試 - JCConf.tw · Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore Write your own custom source by teaching Dataflow](https://reader033.fdocuments.us/reader033/viewer/2022050108/5f6d7ac34a86b9420e400d48/html5/thumbnails/47.jpg)