Multi-Datacenter Kafka - Strata San Jose 2017

When One Data Center Is Not EnoughBuilding Large-scale Stream Infrastructures Across Multiple Data Centerswith Apache KafkaGwen Shapira

There’s a book on that!

Actually… a chapter

Outline

Kafka overviewCommon multi data center patterns Future stuff

What is Kafka?▪It’s like a message queue, right?-Actually, it’s a “distributed commit log”-Or “streaming data platform”

0 1 2 3 4 5 6 7 8

Data Source

Data Consumer

A

Data Consumer

B

Topics and Partitions▪Messages are organized into topics, and each topic is split into partitions.

- Each partition is an immutable, time-sequenced log of messages on disk.- Note that time ordering is guaranteed within, but not across, partitions.

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

Partition 0

Partition 1

Partition 2

Data SourceTopic

Scalable consumption model

Topic T1Partition 0Partition 1

Partition 2Partition 3

Consumer Group 1

Consumer 1

Topic T1



Consumer Group 1Consumer 1

Consumer 2

Consumer 3

Consumer 4

Kafka usage

Common use case

Large scale real time data integration

Other use cases

Scaling databasesMessagingStream processing…

Important things to remember:

1. Consumers offset commits2. Within a cluster – each partition has replicas3. Inter-cluster replication, producer and consumer defaults – all tuned for LAN

Why multiple data centers (DC)

Offload work from main clusterDisaster recoveryGeo-localization

• Saving cross-DC bandwidth• Better performance by being closer to users• Some activity is just local• Security / regulations

CloudSpecial case: Producers with network issues

Why is this difficult?

1. It isn’t, really – you consume data from one cluster and produce to another2. Network between two data centers can get tricky3. Consumers have state (offsets) – syncing this between clusters get tough

• And leads to some counter intuitive results

Pattern #1: stretched cluster

Typically done on AWS in a single region• Deploy Zookeeper and broker across 3 availability zones

Rely on intra-cluster replication to replica data across DCs

Kafka

producers

consumers

DC 1

DC 3

DC 2 produce

rsproduce

rs

consumers

consumers

On DC failure

Producer/consumer fail over to new DCs• Existing data preserved by intra-cluster replication• Consumer resumes from last committed offsets and will see same data

Kafka

producers

consumers

DC 1

DC 3

DC 2 produce

rs

consumers

When DC comes back

Intra cluster replication auto re-replicates all missing dataWhen re-replication completes, switch producer/consumer back

Kafka

producers

consumers

DC 1

DC 3

DC 2 produce

rsproduce

rs

consumers

consumers

Be careful with replica assignment

Don’t want all replicas in same AZRack-aware support in 0.10.0

• Configure brokers in same AZ with same broker.rack

Manual assignment pre 0.10.0

Stretched cluster NOT recommended across regions

Asymmetric network partitioning

Longer network latency => longer produce/consume timeCross region bandwidth: no read affinity in Kafka

region 1Kafk

a ZK

region 2Kafk

a ZK

region 3Kafk

a ZK

Pattern #2: active/passive

Producers in active DCConsumers in either active or passive DC

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

consumers

Critical Apps

Nice Reports

Cross Datacenter Replication

Consumer & Producer: read from a source cluster and write to a target clusterPer-key ordering preservedAsynchronous: target always slightly behindOffsets not preserved

• Source and target may not have same # partitions• Retries for failed writes

Options:• Confluent Multi-Datacenter Replication• MirrorMaker

On active DC failure

Fail over producers/consumers to passive clusterChallenge: which offset to resume consumption

• Offsets not identical across clusters

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

Solutions for switching consumers

Resume from smallest offset• Duplicates

Resume from largest offset• May miss some messages (likely acceptable for real time consumers)

Replicate offsets topic• May miss some messages, may get duplicates

Set offset based on timestamp• Old API hard to use and not precise• Better and more precise API in Apache Kafka 0.10.1 (Confluent 3.1)• Nice tool coming up!

Preserve offsets during replication• Harder to do

When DC comes back

Need to reverse replication• Same challenge: determining the offsets

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

Limitations

Reconfiguration of replication after failoverResources in passive DC under utilized

Pattern #3: active/active

Local aggregate replication to avoid cyclesProducers/consumers in both DCs

• Producers only write to local clusters

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumers

consumers

ReplicationKafka local

DC 1

DC 2

consumers

consumers

On DC failure

Same challenge on moving consumers on aggregate cluster• Offsets in the 2 aggregate cluster not identical• Unless the consumers are continuously running in both clusters

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumers

consumers


DC 1

DC 2

consumers

consumers

SFKafka

Cluster

HoustonKafka

Cluster

Allapps

Allapps

West coastUsers

South CentralUsers

When DC comes back

No need to reconfigure replication

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumers

consumers


DC 1

DC 2

consumers

consumers

Alternative: avoid aggregate clusters

Prefix topic names with DC tagConfigure replication to replicate remote topics onlyConsumers need to subscribe to topics with both DC tags

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

producers

consumers

Beyond 2 DCs

More DCs better resource utilization• With 2 DCs, each DC needs to provision 100% traffic• With 3 DCs, each DC only needs to provision 50% traffic

Setting up replication with many DCs can be daunting• Only set up aggregate clusters in 2-3

Comparison

Pros ConsStretched • Better utilization of

resources• Easy failover for

consumers

• Still need cross region story

Active/passive

• Needed for global ordering • Harder failover for consumers• Reconfiguration during failover• Resource under-utilization

Active/active • Better utilization of resources

• Can be used to avoid consumer failover

• Can be challenging to manage• More replication bandwidth

Multi-DC beyond Kafka

Kafka often used together with other data storesNeed to make sure multi-DC strategy is consistent

Example application

Consumer reads from Kafka and computes 1-min countCounts need to be stored in DB and available in every DC

Independent database per DC

Run same consumer concurrently in both DCs• No consumer failover needed

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumer

consumer


DC 1

DC 2

DB DB

Stretched database across DCs

Only run one consumer per DC at any given point of time

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumer

consumer


DC 1

DC 2

DB DB

on failover

Practical tips

• Consume remote, produce local• Unless you need encrypted data on the wire• Monitor!

• Burrow for replication lag• Confluent Control Center for end-to-end• JMX metrics for rates and “busy-ness”

• Tune!• Producer / Consumer tuning• Number of consumers, producers• TCP tuning for WAN

• Don’t forget to replicate configuration• Separate critical topics from nice-to-have topics

Future work

Offset reset toolOffset preservation“Remote Replicas”2-DC stretch cluster

Other cool Kafka future:• Exactly Once• Transactions• Headers

THANK YOU!Gwen Shapira| [email protected] | @gwenshap

Kafka Training with Confluent University• Kafka Developer and Operations Courses• Visit www.confluent.io/training

Want more Kafka?• Download Confluent Platform Enterprise at http://www.confluent.io/product• Apache Kafka 0.10.2 upgrade documentation at http://docs.confluent.io/3.2.0/upgrade.html • Kafka Summit recordings now available at http://kafka-summit.org/schedule/

http://www.confluent.io/training

http://www.confluent.io/product

http://docs.confluent.io/3.0.0/upgrade.html

http://docs.confluent.io/3.0.0/upgrade.html

http://kafka-summit.org/schedule/

Discount code: kafstrataSpecial Strata Attendee discount code = 25% off www.kafka-summit.orgKafka Summit New York: May 8Kafka Summit San Francisco: August 28

Presented by

Multi-Datacenter Kafka - Strata San Jose 2017

Data & Analytics

Transcript of Multi-Datacenter Kafka - Strata San Jose 2017