REAL-TIME DATA PROCESSING AT RTB HOUSE · real-time data processing at rtb house big data...
Transcript of REAL-TIME DATA PROCESSING AT RTB HOUSE · real-time data processing at rtb house big data...
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018
ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSE
TABLE OF CONTENTS
Agenda:- our rtb platform- the first iteration: mutable structures- the second iteration: data-flow- the third iteration: immutable streams of events- the fourth iteration: multi-dc architecture- the current iteration: kafka workers- summary
02/30
OUR RTB PLATFORM
OUR RTB PLATFORM: THE CONTEXT 04/30
Bid requests:2M/s (peak)~30 SSP networks<50-100ms
User events:1.5B tags/day350M impressions/day3.5M clicks/day1.5M conversions/day
Other events:bidlogs, accesslogs,domain events etc.
OUR RTB PLATFORM: DATA PROCESSING NUMBERS
Kafka:- up to 250K+ messages per second- 50TB+ processed data every day- 6 clusters in 4 datacenters- 26 Kafka brokers- 85 topics, 5000+ partitions
Docker (processing components only):- 44 engines- 1408 cpu cores, 5.5TB ram- 800+ containers
05/30
HDFS:- 2PB+ data, up to 10GB/s
BigQuery:- 1PB+ data, up to 10GB/min
Elasticsearch:- 40TB data, up to 50K events/s
Aerospike (processing only):- 80TB data, up to 8K events/s
THE FIRST ITERATION
THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30
THE 1ST ITERATION: DRAWBACKS
Issues:- long, overloading data migrations (30 days back)- complex servlets' logic, inability to reprocess- inflexible, various schemas- single-DC- inconsistencies
08/30
THE SECOND ITERATION: DATA-FLOW
THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30
THE 2ND ITERATION: DISTRIBUTED LOG
Why Apache Kafka:- distributed log- topics partitioning- partition replication- log retention- stateless- efficient data consuming
11/30
THE 2ND ITERATION: BATCH LOADING
Why Apache Camus:- "Kafka to HDFS" pipeline- batch tool- map-reduce jobs- storing offsets in log files- data partitioning
12/30
THE 2ND ITERATION: AVRO & SCHEMA VERSIONING
Why Apache Avro:- compact, efficient format- schema: JSON format, payload: binary format- self-describing container files- rich data structures- schema changes support, reader & writer schemas
Our approach:- Kafka's messages and HDFS files- schema registry- avro-fastserde
13/30
(github.com/RTBHOUSE/avro-fastserde)
THE 2ND ITERATION: ACCURATE STATISTICS
Why Apache Storm:- real-time processing- streams of tuples, topologies- fault-tolerance
Why Trident:- transactions, exactly-once processing- microbatches (latency & throughput)
14/30
THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/30
THE 2ND ITERATION: DRAWBACKS
Hybrid architecture:- aggregates (real-time)- raw events (2-hour batches)- joined events (end-of-day batch jobs)
Other issues:- Hive joins- mutable events- servlets' complex logic
16/30
THE THIRD ITERATION: NEW APPROACH
THE 3RD ITERATION: NEW APPROACH
{ "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS”}
{ "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION”}
{ "CONVERSION”: "TIME”, "CLICK_ID”, ... "IMPRESSION”, "CLICK”}
New approach:- real-time processing- publishing light events- immutable streams of events
18/30
THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/30
THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/30
THE FOURTH ITERATION: MULTI-DC
THE 4TH ITERATION: NEW REQUIREMENTS
Main changes:- 5-6x larger scale:
> from 350K to 2M bid requests/s within 1.5 years- full multi-dc architecture:
> merging streams of events> synchronization of user profiles
- end-to-end exactly-once processing:> at-least-once output semantics + deduplication
- a few better components:> merger> new stats-counter, new data-flow> dispatcher & loader> logstash
22/30
THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30
THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30
(picture from kafka.apache.org)
Why Kafka Streams:- fully embedded library with no stream processing cluster- no external dependencies- Kafka's parallelism model and group membership mechanism- event-at-a-time processing(not microbatch)- exactly-once processing semantics (but at-least-once was good enough)
THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30
THE CURRENT ITERATION: KAFKA WORKERS
THE 5TH ITERATION: KAFKA WORKERS
Main features:- higher level of distribution- possibility to pause and resume processing for given partition- asynchronous processing- tighter control of offsets commits- backpressure- at-least-once semantics- processing timeouts- handling failures- multiple consumers (in progress)- kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress)
27/30
(github.com/RTBHOUSE/kafka-workers)
THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30
SUMMARY
What we have achieved:- platform monitoring- much more stable platform- higher quality of data processing- HDFS & BigQuery & Elasticsearch streaming- multi-DC architecture and data synchronization- high scalability- better data-flow monitoring, deployment & maintenance
29/30
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018OCTOBER 10-11, 2018
THANK YOU FOR YOUR ATTENTION