Windowing data in big data streams
-
Upload
softwaremill -
Category
Technology
-
view
99 -
download
0
Transcript of Windowing data in big data streams
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
BIG DATA? FAST DATA?
▸ What is big data?
▸ Shift of focus
▸ Processing speed
▸ Fast data -> streaming
A TYPE OF DATA PROCESSING ENGINE THAT IS DESIGNED WITH INFINITE DATA SETS IN MIND
Tyler Akidau, Google
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHAT IS STREAMING?
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WINDOWING
▸ Time becomes the focus point
▸ How many invalid password errors where there in the last 5 minutes
▸ During which 30-minute window did we get most traffic?
▸ What’s the average 5-minute speed on a section of a highway throughout the day?
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
HOW TO DO STREAMING? WITH WINDOWS?
▸ Many possibilities:
▸ Spark Streaming
▸ Spark Structured Streaming
▸ Kafka Streams
▸ Flink
▸ Akka Streams
▸ …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
/ME
▸ coder @
▸ Lightbend, Confluent, Datastax consulting partner
▸ mainly Scala
▸ open-source: MacWire, ElasticMQ, Quicklens, …
▸ http://www.warski.org / @adamwarski
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHAT’S THE TIME?
▸ How to associate time with an event:
▸ event time: “logical”, data-dependent
▸ ingestion time: when the event entered the system
▸ processing time: when the event is being processed
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
TYPES OF WINDOWS
▸ Time-based
▸ fixed/tumbling
▸ sliding
▸ Session-based
time
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
OUT-OF-ORDER: WATERMARKS, LATENESS
▸ Windows GC
▸ At some point, enough is enough
▸ Watermark:
▸ all events before X have been observed
▸ heuristics
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
TRIGGERS
▸ When to emit window results
▸ Watermark progress
▸ Event time progress
▸ Processing time progress
▸ Punctuations
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
ACCUMULATION OF RESULTS
▸ If we trigger many times …
▸ discard
▸ accumulate
▸ retract & accumulate
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
FINALLY … HOW TO MANIPULATE THE DATA
▸ map, flatMap, filter …
▸ stateful computation
▸ fold, reduce
▸ past-dependent operations
▸ where to store the state
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Event/ingestion/processing time
▸ Tumbling/sliding/session windows
▸ Watermarks
▸ Triggers
▸ Accumulation of results
▸ State management
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STREAMING
▸ Micro-batches (DStream)
▸ .window() API:
▸ tumbling/sliding windows
▸ only processing time
▸ no watermarks
▸ triggers at the end of the window
▸ state persisted in cluster (e.g. updateStateByKey())
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STREAMING - WHY BOTHER?
▸ Popular
▸ Not only streaming
▸ ML
▸ SQL
▸ GraphX
▸ but …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STRUCTURED STREAMING
▸ Alpha in Spark 2.0
▸ Micro-batches not exposed
▸ groupBy(window(…))
▸ Event-time support
▸ No watermarks, session windows (2.1?)
▸ Trigger: processing time; outputs changed windows
▸ Exactly-once processing*
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
FLINK
▸ Mostly with keyed streams (parallelism)
▸ TimeCharacteristic: event/ingestion/processing
▸ TimestampAssigner: also generates watermarks
▸ WindowAssigner: arbitrary, built-in tumbling, sliding, session
▸ Trigger: event/processing time, count, single/continuous
▸ Window function: fold/reduce/with-kv-state
▸ Exactly-once* / at-least-once
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
KAFKA STREAMS
▸ State: Kafka topics/local key-value backed by a topic for resiliency
▸ Watermarks: no, but windows are retained for 1 day
▸ Time: event/ingestion/processing; TimestampExtractor
▸ Tumbling/sliding windows
▸ Trigger: after every element
▸ aggregate by key&window into an ever-updating KTable
▸ At-least-once
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
AKKA STREAMS
▸ Single-node, no clustering
▸ No OOTB support, but quite easy to implement:
▸ Windows: arbitrary, assign windows to each element
▸ Trigger: only window-close
▸ State: local
▸ Watermarks: can be implemented
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Spark: widely used, some features missing
▸ Flink: versatile
▸ Kafka: simple model
▸ Akka: single-node
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Windowing is just one of the aspects
▸ Other:
▸ State management
▸ Work distribution
▸ Processing guarantees
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Other stream processing systems out there!
▸ Apache Storm
▸ Google Cloud Dataflow
▸ Amazon Kinesis
▸ Apache Beam
▸ …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
LINKS
▸ Streaming 101 & 102:
▸ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
▸ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
▸ https://softwaremill.com/windowing-data-in-akka-streams/