Streaming analytics better than batch when and why - (Big Data Tech 2017)
-
Upload
getindata -
Category
Technology
-
view
310 -
download
1
Transcript of Streaming analytics better than batch when and why - (Big Data Tech 2017)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Streaming analytics better than batch - when and why ?
_A. Kawa - D. Wysakowicz - K. Zarzycki_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Have you ever built cool Big Data
pipelines?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Use-Case
■ Can be done in batch and real-time■ User session analytics at Spotify
● Simple stats■ Duration, number of songs, skips,
searches etc.
● Advanced analytics■ Mood, physical activity, real-time content,
ads
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen to a new edition of Discover Weekly?
_1. Dashboards_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen to a new edition of Discover Weekly?
Australian users are listening to Discover Weekly too short !!!
_1. Dashboards_ _2. Alerts_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen to a new edition of Discover Weekly?
Australian users are listening to Discover Weekly too short !!!
Recommend songs and ads based on
current activity.
_1. Dashboards_ _2. Alerts_ _3. Content_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
1st - Batch Architecture
1h1h
1h
1h - 1d
1hUser
EventsUser
Sessions
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
1st - Batch Architecture
1h1h
1h
1d
1hUser
EventsUser
Sessions
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
The More Moving Parts …
⬇ The higher learning curve⬇ The more gluing code⬇ The larger administrative effort⬇ The more error-prone solution
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Long Waiting Time
Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
2nd - Micro-Batch Architecture
1m - 1h
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
♪ ♪
No Built-In Session Windows
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[10:00 - 11:00) [11:00 - 12:00)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
♪ ♪
No Built-In Session Windows
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[10:00 - 11:00) [11:00 - 12:00)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Late Data …
♪ ♪ ♪ ♪ ♪ ♪ Event Time
14:55 - 16:35
Processing Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
... Included in Current Batch
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪
♪
14:55 - 16:35 16:50 - …
Event Time
Processing Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ Event Time
Processing Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ ♪ ♪ ♫
♪ ♪ ♫
Event Time
Processing Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ ♪ ♪ ♫
♪ ♪ ♫
Event Time
Processing Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
... Breaks Correctness
♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫
♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫♪
Event Time
Processing Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Problems
FILES, BATCHES, DATA LAKES
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Solving Streaming Problem With Batch?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
3rd - Streaming-First Architecture
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
User Session Windows
♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
User 3 ♪ ♪ ♪ ♪ ♪ ♪
Session gap eg. 15 minutes
♪
♪♪
5
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
User Session Windows
♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
User 3 ♪ ♪ ♪ ♪ ♪ ♪
Session gap eg. 15 minutes
♪
♪♪
5
[3,2]
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reading From Kafkaval sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Session Windows With Gapval sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪
User 1
User 2
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Session Windows With Gapval sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
User 1 ♪ ♪ ♪ ♪ ♪ ♪
Session gap - 15 minutes
♪♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Analyzing User Sessionval sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Handling Late Eventsval sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Triggering Early Resultsval sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Sessionization Exampleval sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
Working example:https://github.com/getindata/flink-use-case
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Modern Stream Processing Engines
■ Rich stream processing semantic● Built-in support for event-time windows● Accurate results for late / out-of-order events and replays● Early triggers
■ Low latency and high-throughput■ Exactly-once stateful processing
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Modern Stream Processing Engines
■ Rich stream processing semantic● Built-in support for event-time windows● Accurate results for late / out-of-order events and replays● Early triggers
■ Low latency and high-throughput■ Exactly-once stateful processing
User survey:http://data-artisans.com/flink-user-survey-2016-part-1http://data-artisans.com/flink-user-survey-2016-part-2
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
How can I reprocess data?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reprocessing Events In Flink
1. Take periodic snapshots of a job● It stores Kafka offsets, on-flight sessions, application state
2. Restart a job from a savepoint rather than from a beginning
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
What if data is no longer in Kafka?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Consuming Data From HDFS
■ Run your streaming code on HDFS (bounded data)● You need to read data in event-time based order
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
How to join with other data sets/streams?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Join With Other Datasets / Streams■ Flink can join windowed streams easily■ Join of data stream with data set is WIP
● Even with slowly changing data set!● Even keyed data
Stream 2
Stream 1
Joined Stream Input Stream Joined Stream
+
Id Name
1 John Doe
2 Jane Doe
Dataset
+
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
When is batch processing good?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Batch Processing Use-Cases
■ Ad-hoc analytics and data exploration● Notebooks, Spark/Flink/Hive, Parquet, complete data sets
■ Technical advantages● A large swaths of historical data in HDFS● High-level libraries in mature batch technologies
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Batch Processing Use-Cases
■ Ad-hoc analytics and data exploration● Notebooks, Spark/Flink/Hive, Parquet, complete data sets
■ Implementation advantages● Offline experiments over large historical data
■ Historical events are usually stored in HDFS, not Kafka
● High-level libraries in batch processing technologies■ Spark MLlib, H2O
(when data arrives continuously)
don’t solve streaming problem
with batch jobs
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
I like this streaming API.Can I use it for batch?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Unified batch and streaming API
■ Not with raw Flink API■ But with Flink Table API■ Apache Beam
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Who Are You, actually?
■ At GetInData, we build custom Big Data solutions● Hadoop, Flink, Spark, Kafka and more
■ Our team is today represented by
Krzysztof Zarzycki
Dawid Wysakowicz
Adam Kawa
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
■ Stream often the natural representation of your data
■ Stream processing is not only about low latency
Summary
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Q&A
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Thanks ! Big Data Tech Warsaw !
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Log Abstraction
11:00 - 12:00
12:00 - 13:00
… …
10:00 - …
10:00 - …
10:00 - 11:00
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Spark Structured Streaming
⬇ It’s still ALPHA and the APIs are still experimental⬇ Operates on top of micro-batches (Spark SQL engine)
⬆ Easy-to-learn API (Dataset/DataFrame)⬆ Rich ecosystem of tools and libraries e.g. MLlib⬆ Supports event-time
⬇ Sessionization not yet supported - SPARK-10816⬇ Queryable state not yet supported - SPARK-16738
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Kafka Streams
⬇ No exactly-once (just at-least-once)⬇ Kafka as the only data source⬇ No bounded streams (batch) optimizations
⬆ Simplicity⬆ Embedded into application⬆ Supports event-time
⬇ Lack of session windows
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Beam
⬆ Unified API for batch and streaming⬆ Rich streaming processing semantics
⬆ Complex TriggerDSL⬆ Multiple runtime environments
⬆ Spark, Flink, Apex, Dataflow⬆ Side inputs and outputs
⬇ Verbose Java API⬇ New project - Top level since 01/2017
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Google Dataflow
■ Runtime environment for Apache Beam in Google Cloud
⬇ No support for Iterative Computations⬆ Supports Side Outputs⬆ Works with every Google Cloud Service (Pub/Sub, BigTable
etc.)