Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491...
Transcript of Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491...
ApacheKa)a
CMSC491Hadoop-BasedDistributedCompu=ng
Spring2016AdamShook
Overview
• Ka)aisa“publish-subscribemessagingrethoughtasadistributedcommitlog”
• Fast• Scalable• Durable• Distributed
Ka)aadop=onandusecases• LinkedIn:ac=vitystreams,opera=onalmetrics,databus
– 400nodes,18ktopics,220Bmsg/day(peak3.2Mmsg/s),May2014• Ne*lix:real-=memonitoringandeventprocessing• Twi/er:aspartoftheirStormreal-=medatapipelines• Spo4fy:logdelivery(from4hdownto10s),Hadoop• Loggly:logcollec=onandprocessing• Mozilla:telemetrydata• Airbnb,Cisco,Gnip,InfoChimps,Ooyala,Square,Uber,…
3
HowfastisKa)a?• “Upto2millionwrites/secon3cheapmachines”
– Using3producerson3differentmachines,3xasyncreplica=on• Only1producer/machinebecauseNICalreadysaturated
• Sustainedthroughputasstoreddatagrows– Slightlydifferenttestconfigthan2Mwrites/secabove.
4
WhyisKa)asofast?• Fastwrites:
– WhileKa)apersistsalldatatodisk,essen=allyallwritesgotothepagecacheofOS,i.e.RAM.
• Fastreads:
– Veryefficienttotransferdatafrompagecachetoanetworksocket– Linux:sendfile()systemcall
• Combina=onofthetwo=fastKa)a!– Example(Opera=ons):OnaKa)aclusterwheretheconsumersaremostly
caughtupyouwillseenoreadac=vityonthedisksastheywillbeservingdataen=relyfromcache.
5
Afirstlook• Thewhoiswho– Producerswritedatatobrokers.
– Consumersreaddatafrombrokers.
– Allthisisdistributed.
• Thedata– Dataisstoredintopics.– Topicsaresplitintopar44ons,whicharereplicated.
6
Afirstlook
7
Broker(s)
Topics
8
new
ProducerA1
ProducerA2
ProducerAn…
Producers always append to “tail” (think: append to a file)
…
Kafka prunes “head” based on age or max size or “key”
Oldermsgs Newermsgs
KaLatopic
• Topic:feednametowhichmessagesarepublished– Example:“zerg.hydra”
Broker(s)
Topics
9
new
ProducerA1
ProducerA2
ProducerAn…
Producers always append to “tail” (think: append to a file)
…
Oldermsgs Newermsgs
ConsumergroupC1 Consumers use an “offset pointer” to track/control their read progress
(and decide the pace of consumption) ConsumergroupC2
Par==ons
10
• Atopicconsistsofpar44ons.• Par==on:ordered+immutablesequenceofmessages
thatiscon=nuallyappendedto
Par==ons
11
• #par==onsofatopicisconfigurable• #par==onsdeterminesmaxconsumer(group)parallelism
– cf.parallelismofStorm’sKa)aSpoutviabuilder.setSpout(,,N)
– Consumer group A, with 2 consumers, reads from a 4-partition topic– Consumer group B, with 4 consumers, reads from the same topic
Par==onoffsets
12
• Offset:messagesinthepar==onsareeachassignedaunique(perpar==on)andsequen=alidcalledtheoffset– Consumerstracktheirpointersvia(offset,par--on,topic)tuples
ConsumergroupC1
Replicasofapar==on
• Replicas:“backups”ofapar==on– Theyexistsolelytopreventdataloss.– Replicasareneverreadfrom,neverwrinento.
• TheydoNOThelptoincreaseproducerorconsumerparallelism!
– Ka)atolerates(numReplicas-1)deadbrokersbeforelosingdata• LinkedIn:numReplicas==2à1brokercandie
13
Ka)aQuickstart
• StepsfordownloadingKa)a,star=ngaserver,andcrea=ngaconsole-basedconsumer/producer
• RequiresZooKeepertobeinstalledandrunning
• hnps://ka)a.apache.org/documenta=on.html#quickstart
• hnps://github.com/adamjshook/hadoop-demos/tree/master/ka)a