Experience with Kafka & Storm

Target and Connect Intelligently

Experience with Kafka & Storm

Otto MokSolution Architect, AcuityAdsApril 30, 2014 – Toronto Hadoop User Group

2

Agenda

• Background– What does AcuityAds do?

• Use case– What are we trying to do?

• High-level System Architecture– How does the data flow?

• Kafka & Storm– What did we do wrong?

3

Background

Source: https://www.google.ca/search?q=banner+ads&tbm=isch&tbo=u

4

Background

• Digital Advertising– Website banner, pre-roll video, free mobile app

• Buy ad impressions at ‘real-time’– Response within 50ms for auction

• Find best match between people and ads– Show ad that you care about

• Use machine learning algo to ‘learn’– Data, data, data

5

Use case

• 10+ billion daily impressions• 30,000+ new sites daily

• How many daily impressions by site?

• How are the impressions distributed?– Country, Province, Gender, Age Range, etc...

6

High-level System Architecture

• 10+ billion daily bid requests

• Make up to 4 billion daily bids

• Serve millions of daily impressions

• 10+ TB of messages daily

• 300k+ message / second

Bidder Adserver

Kafka

Hbase/Hadoop

Storm

7

Kafka

Source: http://kafka.apache.org/documentation.html

8

Kafka - Spec

• Kafka v0.8.0• Servers – 10 x 2U(10 x 3TB) JBOD• Total storage – 300 TB• Replication – 3x• Unique data – 100 TB• Capacity – a few days• Producer acknowledgment – never waits• Topic - BIDREQUEST

9

Kafka - Monitoring

• Nagios– Ping, CPU, memory, network I/O, disk space

• Producer-Consumer group message counting– Hourly consumption rate check

Topic Consumer Group ID Producer Count Consumer Count Error Ratio

BIDREQUEST InventoryTopology 122,450,812 122,444,294 None 1.00

BIDREQUEST SearchTargetingTopology 122,450,812 107,755,295 Ratio below 98% 0.88

10

Kafka - Monitoring

• Kafka Web Console– Partition offset for each consumer group

11

Kafka - Issues

• Issue 1 - Partitions– 10 partitions– Each partition > 1 TB a day– 100 TB / 1 TB – no problem!

• Each partition is stored in a directory– /disk05/kafka-logs/BIDREQUEST-09– /disk09/kafka-logs/BIDREQUEST-03

12

Kafka - Issues

• Issue 2 – Unbalanced partition distribution– Some servers running out of space– Some servers are not “leader” for any partition

• Network glitch cause server to drop out of cluster, no longer leader after rejoin

• auto.leader.rebalance.enable=true

13

Lots of data – now what?

Source: http://bookriotcom.c.presscdn.com/wp-content/uploads/2013/03/server-farm-shot.jpg

14

Use case - again

• 10+ billion daily impressions• 30,000+ new sites daily

• How many daily impressions by site?

• How are the impressions distributed?– Country, Province, Gender, Age Range, etc...

15

Storm

Source: http://storm.incubator.apache.org/documentation/Tutorial.html

16

Storm - Spec

• Storm v0.8.2• Servers – 13 x Dual Quad Core Xeon 36G RAM• 4 worker slots per server• Total logical CPUs – 208• Total memory – 468 G• Total slots – 52 worker slots (JVMs)

17

Storm - Monitor

18

Storm - Topology

• Spout read each BidRequest from Kafka topic• Determine new or existing, emit tuples to

different “streams”

19

Storm - Topology

• InsertInventoryBolt– Process tuples from NewInventory stream– Field grouping on sourceId, domainName– Tick tuple every 1 second

• UpdateInventoryBolt– Process tuples from ExistingInventory stream– Field grouping on inventoryId– Tick tuple every 1 second

20

Storm - Topology

• LogInventoryBolt– Process tuples from ExistingInventory stream– Field grouping on inventoryId– Tick tuple every 10 seconds

21

Storm - Issues

• Issue – Low uptime– 10 workers, 100 executors– Not processing many tuples– Process latency < 10ms

• Bolts restarts due to uncaught Exceptions

22

Conclusion

• Cost– Bleed edge technology bugs– Support mailing lists– Monitoring roll your own– Operation dedicated personnel

• Benefit– Near real-time data on site impression volume &

distribution by geo, demo, etc...

23

Forward Looking

• Kafka v0.8.1.1– Allow specify broker hostname for producer &

consumer– Change # of partitions of a topic online

• Storm v0.9.1– Faster pure Java Netty transport– View logs from each server from Storm UI– Tick tuple using floating point seconds– Storm on Hadoop (HDP 2.1)

24

Thank you

Otto [email protected]: http://jamesgieordano.files.wordpress.com/2011/05/babyelephant.jpg

Experience with Kafka & Storm

Technology

Transcript of Experience with Kafka & Storm