BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ...

23
ARCHITECTURE & LESSONS LEARNED BARTOSZ ŁOŚ REAL-TIME DATA PROCESSING AT RTB HOUSE REAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY 9, 2017

Transcript of BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ...

Page 1: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

ARCHITECTURE & LESSONS LEARNED

BARTOSZ ŁOŚ

REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE

BIG DATA TECHNOLOGY WARSAW SUMMIT 2017FEBRUARY 9, 2017

Page 2: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

TABLE OF CONTENTS

Agenda:- real-time bidding- the first iteration: mutable structures- the second iteration: data-flow- the third iteration: immutable streams of events

02/23

Page 3: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

REAL-TIME BIDDING

Page 4: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

REAL-TIME BIDDING: RTB PLATFORM

Processing bid requests(350K/s, ~30 SSP networks, <50-100ms)

04/23

Page 5: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

REAL-TIME BIDDING: DATA & MACHINE LEARNING

Impressions:~ 150M events / day~ 4TB data / day

Clicks:~ 1M events / day~ 35GB data / day

Conversions: ~ 450K events / day ~ 25GB data / day

05/23

Page 6: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE FIRST ITERATION

Page 7: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/23

Page 8: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 1ST ITERATION: DRAWBACKS

Issues:- long, overloading data migrations (30 days back)- complex servlets' logic, inability to reprocess- inflexible, various schemas- single-DC- inconsistencies

08/23

Page 9: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE SECOND ITERATION: DATA-FLOW

Page 10: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/23

Page 11: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 2ND ITERATION: DISTRIBUTED LOG

Why Apache Kafka:- distributed log- topics partitioning- partition replication- log retention- stateless- efficient data consuming

11/23

Page 12: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 2ND ITERATION: BATCH LOADING

Why Apache Camus:- "Kafka to HDFS" pipeline- map-reduce jobs, batches- storing offsets in log files- data partitioning

12/23

Page 13: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 2ND ITERATION: AVRO & SCHEMA VERSIONING

Why Apache Avro:- data serialization framework- rich data structures- self-describing container files- reader & writer schemas- binary data format- schema registry

13/23

Page 14: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 2ND ITERATION: ACCURATE STATISTICS

Why Apache Storm:- real-time processing- streams of tuples, topologies- fault-tolerance

Why Trident:- transactions, exactly-once processing- microbatches (latency & throughput)

14/23

Page 15: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/23

Page 16: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 2ND ITERATION: DRAWBACKS

Hybrid architecture:- aggregates (real-time)- raw events (2-hour batches)- joined events (end-of-day batch jobs)

Other issues:- Hive joins- mutable events- servlets' complex logic

16/23

Page 17: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE THIRD ITERATION: NEW APPROACH

Page 18: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 3RD ITERATION: NEW APPROACH

{ "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS”}

{ "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION”}

{ "CONVERSION”: "TIME”, "CLICK_ID”, ... "CLICK”}

New approach:- real-time processing- publishing light events- immutable streams of events

18/23

Page 19: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/23

Page 20: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/23

Page 21: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THE 3RD ITERATION: EVENTS MERGE 21/23

Page 22: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

SUMMARY

What we have achieved:- multi-DC architecture- HDFS & BigQuery streaming- platform monitoring- much more stable platform- higher quality of data processing- better data-flow monitoring, deployment & maintenance

22/23

Page 23: BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY …architecture & lessons learned bartosz ŁoŚ real-time data processing at rtb house big data technology warsaw summit 2017 february

THANK YOUFOR YOUR ATTENTION