Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017
-
Upload
shirshanka-das -
Category
Data & Analytics
-
view
4.636 -
download
0
Transcript of Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017
![Page 1: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/1.jpg)
The Data Driven Network
Kapil Surlaker Director of Engineering
Bridging Batch and Streaming Data Integration with Gobblin
Shirshanka Das
Gobblin team 26th Apr, 2017
Big Data Meetup
github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin
![Page 2: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/2.jpg)
Data Integration: key requirements
Source, Sink Diversity
Batch +
StreamingData
Quality
So, we built
![Page 3: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/3.jpg)
SFTP
JDBCREST
Simplifying Data Integration
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more…
Apache incubation under way
SFTP
Azure StorageAzure
Storage
![Page 4: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/4.jpg)
4
Other Open Source Systems in this Space
Sqoop, Flume, Falcon, Nifi, Kafka Connect
Flink, Spark, Samza, Apex
Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)
![Page 5: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/5.jpg)
: Under the Hood
5
![Page 6: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/6.jpg)
6
Gobblin: The Logical Pipeline
![Page 7: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/7.jpg)
7
WorkUnit A logical unit of work, typically bounded but not necessary.
Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200
HDFS Folder: /data/Login, File: part-0.avro
Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh
![Page 8: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/8.jpg)
8
Source: A provider of WorkUnits
(typically a system like Kafka, HDFS etc.)
![Page 9: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/9.jpg)
9
Task: A unit of execution that operates on a WorkUnit
Extracts records from the source, writes to the destination
Ends when WorkUnit is exhausted of records
(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)
![Page 10: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/10.jpg)
10
Extractor: A provider of records given a WorkUnit
Connects to Data Source
Deserializer of records
![Page 11: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/11.jpg)
11
Converter: A 1:N mapper of input records to output records
Multiple converters can be chained
(e.g. Avro <-> JSON, Schema project, Encrypt)
![Page 12: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/12.jpg)
12
Quality Checker: Can check if the quality of the output is
satisfactory
Row-level (e.g. time value check)
Task-level (e.g. audit check, schema compatibility)
![Page 13: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/13.jpg)
13
Writer: Writes to the destination
Connection to the destination, Serializer of records
Sync / Async
e.g. FsWriter, KafkaWriter, CouchbaseWriter
![Page 14: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/14.jpg)
14
Publisher: Finalizes / Commits the data
Used for destinations that support atomicity
(e.g. move tmp staging directory to final
output directory on HDFS)
![Page 15: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/15.jpg)
15
Gobblin: The Logical Pipeline
![Page 16: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/16.jpg)
16
State Store (HDFS, S3, MySQL, ZK, …)
Load config previous watermarks
save watermarks
Gobblin: The Logical PipelineStateful
^
![Page 17: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/17.jpg)
: Pipeline Specification
17
![Page 18: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/18.jpg)
Gobblin: Pipeline Specification
job.name=PullFromWikipediajob.group=Wikipediajob.description=AgettingstartedexampleforGobblin
source.class=gobblin.example.wikipedia.WikipediaSourcesource.page.titles=LinkedIn,Wikipedia:Sandboxsource.revisions.cnt=5
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10
converter.classes=gobblin.example.wikipedia.WikipediaConverter
Pipeline Name, Description
Source + configuration
![Page 19: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/19.jpg)
source.revisions.cnt=5
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10
converter.classes=gobblin.example.wikipedia.WikipediaConverter
extract.namespace=gobblin.example.wikipedia
writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner
data.publisher.type=gobblin.publisher.BaseDataPublisher
Gobblin: Pipeline Specification
Converter
Writer + configuration
![Page 20: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/20.jpg)
converter.classes=gobblin.example.wikipedia.WikipediaConverter
extract.namespace=gobblin.example.wikipedia
writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner
data.publisher.type=gobblin.publisher.BaseDataPublisher
Gobblin: Pipeline Specification
Publisher
![Page 21: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/21.jpg)
Gobblin: Pipeline Deployment
Bare Metal / AWS / Azure / VM
Standalone: Single Instance
Small Medium Large
AWS (EC2) Hadoop (YARN / MR)Standalone Cluster
Pipeline Specification
Static Cluster Elastic ClusterOne Box
One Spec Multiple Environments
![Page 22: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/22.jpg)
Execution Model: Batch versus Streaming
Batch Determine work, Acquire slots, Run, Checkpoint, Repeat
+ Cost-efficient, deterministic, repeatable - Higher latency
- Setup, Checkpoint costs dominate if “micro-batching”
![Page 23: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/23.jpg)
Execution Model: Batch versus Streaming
Streaming Determine work streams, Run continuously, Checkpoint periodically
+ Low latency
- Higher-cost because it is harder to provision accurately
- More sophistication needed to deal with change
![Page 24: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/24.jpg)
Batch
Execution Model Scorecard
Batch
Streaming
Streaming
Streaming
Streaming
Batch
Batch
JDBC <->HDFS Kafka ->HDFS
HDFS ->Kafka Kafka <->Kinesis
![Page 25: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/25.jpg)
Can we run in both models using the same system?
![Page 26: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/26.jpg)
26
Gobblin: The Logical Pipeline
![Page 27: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/27.jpg)
27
Batch Determine work
Streaming Determine work
- unbounded WorkUnit
Pipeline Stages: Start
![Page 28: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/28.jpg)
28
Batch Acquire slots, Run
Streaming Run continuously
Checkpoint periodically
Shutdown gracefully
Pipeline Stages: Run
Watermark Manager
State Storage
notify ack
shutdown
![Page 29: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/29.jpg)
29
Batch Checkpoint, Commit
Streaming Do nothing
- NoOpPublisher
Pipeline Stages: End
![Page 30: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/30.jpg)
Enabling Streaming mode
task.executionMode = streaming
Standalone: Single Instance
AWS Hadoop (YARN / MR)Standalone Cluster
![Page 31: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/31.jpg)
A Streaming Pipeline Spec: Kafka 2 Kafka
# A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling
job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
Pipeline Name, Description
![Page 32: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/32.jpg)
job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.common.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092
# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter
Source, configuration
A Streaming Pipeline Spec: Kafka 2 Kafka
![Page 33: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/33.jpg)
gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092
# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
data.publisher.type=gobblin.publisher.NoopPublisher
A Streaming Pipeline Spec: Kafka 2 Kafka
Converter, configuration
![Page 34: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/34.jpg)
# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
A Streaming Pipeline Spec: Kafka 2 Kafka
Writer, configuration
Publisher
![Page 35: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/35.jpg)
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
# Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString=localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000
A Streaming Pipeline Spec: Kafka 2 Kafka
Execution Mode, watermark storage configuration
![Page 36: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/36.jpg)
Gobblin Streaming: Cluster view
Cluster of processes
Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster
Master
Helix
Worker 1
Worker 2
Worker 3
Sink (Kafka, HDFS, …)
Stream Source
![Page 37: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/37.jpg)
Active Workstreams in Gobblin
Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs
Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin
Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.
![Page 38: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/38.jpg)
Roadmap
Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes
Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems
Security improvements Credential storage, Secure specs
![Page 39: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017](https://reader034.fdocuments.us/reader034/viewer/2022052116/5a647b317f8b9a82568b4713/html5/thumbnails/39.jpg)
39
Gobblin Team @ LinkedIn