Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491...

14
Apache Ka)a CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook

Transcript of Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491...

Page 1: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

ApacheKa)a

CMSC491Hadoop-BasedDistributedCompu=ng

Spring2016AdamShook

Page 2: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Overview

•  Ka)aisa“publish-subscribemessagingrethoughtasadistributedcommitlog”

•  Fast•  Scalable•  Durable•  Distributed

Page 3: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Ka)aadop=onandusecases•  LinkedIn:ac=vitystreams,opera=onalmetrics,databus

–  400nodes,18ktopics,220Bmsg/day(peak3.2Mmsg/s),May2014•  Ne*lix:real-=memonitoringandeventprocessing•  Twi/er:aspartoftheirStormreal-=medatapipelines•  Spo4fy:logdelivery(from4hdownto10s),Hadoop•  Loggly:logcollec=onandprocessing•  Mozilla:telemetrydata•  Airbnb,Cisco,Gnip,InfoChimps,Ooyala,Square,Uber,…

3

Page 4: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

HowfastisKa)a?•  “Upto2millionwrites/secon3cheapmachines”

–  Using3producerson3differentmachines,3xasyncreplica=on•  Only1producer/machinebecauseNICalreadysaturated

•  Sustainedthroughputasstoreddatagrows–  Slightlydifferenttestconfigthan2Mwrites/secabove.

4

Page 5: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

WhyisKa)asofast?•  Fastwrites:

–  WhileKa)apersistsalldatatodisk,essen=allyallwritesgotothepagecacheofOS,i.e.RAM.

•  Fastreads:

–  Veryefficienttotransferdatafrompagecachetoanetworksocket–  Linux:sendfile()systemcall

•  Combina=onofthetwo=fastKa)a!–  Example(Opera=ons):OnaKa)aclusterwheretheconsumersaremostly

caughtupyouwillseenoreadac=vityonthedisksastheywillbeservingdataen=relyfromcache.

5

Page 6: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Afirstlook•  Thewhoiswho–  Producerswritedatatobrokers.

–  Consumersreaddatafrombrokers.

–  Allthisisdistributed.

•  Thedata–  Dataisstoredintopics.–  Topicsaresplitintopar44ons,whicharereplicated.

6

Page 7: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Afirstlook

7

Page 8: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Broker(s)

Topics

8

new

ProducerA1

ProducerA2

ProducerAn…

Producers always append to “tail” (think: append to a file)

Kafka prunes “head” based on age or max size or “key”

Oldermsgs Newermsgs

KaLatopic

•  Topic:feednametowhichmessagesarepublished–  Example:“zerg.hydra”

Page 9: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Broker(s)

Topics

9

new

ProducerA1

ProducerA2

ProducerAn…

Producers always append to “tail” (think: append to a file)

Oldermsgs Newermsgs

ConsumergroupC1 Consumers use an “offset pointer” to track/control their read progress

(and decide the pace of consumption) ConsumergroupC2

Page 10: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Par==ons

10

•  Atopicconsistsofpar44ons.•  Par==on:ordered+immutablesequenceofmessages

thatiscon=nuallyappendedto

Page 11: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Par==ons

11

•  #par==onsofatopicisconfigurable•  #par==onsdeterminesmaxconsumer(group)parallelism

–  cf.parallelismofStorm’sKa)aSpoutviabuilder.setSpout(,,N)

–  Consumer group A, with 2 consumers, reads from a 4-partition topic–  Consumer group B, with 4 consumers, reads from the same topic

Page 12: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Par==onoffsets

12

•  Offset:messagesinthepar==onsareeachassignedaunique(perpar==on)andsequen=alidcalledtheoffset–  Consumerstracktheirpointersvia(offset,par--on,topic)tuples

ConsumergroupC1

Page 13: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Replicasofapar==on

•  Replicas:“backups”ofapar==on– Theyexistsolelytopreventdataloss.– Replicasareneverreadfrom,neverwrinento.

•  TheydoNOThelptoincreaseproducerorconsumerparallelism!

– Ka)atolerates(numReplicas-1)deadbrokersbeforelosingdata•  LinkedIn:numReplicas==2à1brokercandie

13

Page 14: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe

Ka)aQuickstart

•  StepsfordownloadingKa)a,star=ngaserver,andcrea=ngaconsole-basedconsumer/producer

•  RequiresZooKeepertobeinstalledandrunning

•  hnps://ka)a.apache.org/documenta=on.html#quickstart

•  hnps://github.com/adamjshook/hadoop-demos/tree/master/ka)a