Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an...

21
Twitter @instaclustr [email protected] instaclustr.com Lessons Learned from Building an Apache Kafka Managed Service

Transcript of Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an...

Page 1: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.comTwitter @instaclustr [email protected] instaclustr.com

Lessons Learned from Building an Apache Kafka Managed Service

Page 2: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Introduction

● Over 20 million node-hours of experience managing Cassandra, Spark and Elassandra

● Our platform provides automated provisioning, monitoring and management

● Available on AWS, GCP, Azure and IBM Cloud

● Managed Apache Kafka released May 21st

Page 3: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Agenda

● Context - our offering and development process

● Hardware choice and benchmarking

● Topic and user management

● Broker security configuration

● Monitoring

● Backup and Restore

Page 4: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Instaclustr Managed Kafka - Key Features

● Preview Release available:○ Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and Azure○ Broker monitoring○ Instaclustr monitoring and provisioning API support○ Private network clusters (AWS only)○ Run in your cloud provider account or ours○ Topic management via a custom CLI tool

Page 5: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Instaclustr Managed Kafka - Key Features

● For GA (end June):○ SOC2 compliant○ User & credential management○ Providing more cluster config options○ Topic level and synthetic transaction monitoring○ Infrastructure config tuning

Page 6: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Instaclustr Managed Kafka - Development Process

● First customer requests 2016

● Internal infrastructure deployment and usage of Kafka mid 2017

● Managed service platform developmentcommenced November 2017

● Early access program with 4 customerscommenced December 2017

● Public preview release 21 May 2018

● GA expected 25 June 2018

Page 7: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Hardware Choice and Benchmarking - GP2 vs ST1

● Disk Type○ AWS benchmark - r4.large w 500GB disks

■ 1 x 500GB ST1 volume■ 10 x 50GB GP2 volumes in RAID0 configuration

○ Avg 10% improved throughput with ST1 vs GP2 EBS○ ST1 is 45% of the cost of GP2○ Non-RAIDed mount simplifies re-sizing EBS volumes

Type Writes (m/s) Reads (m/s) Mixed (m/s)

ST1 223,851 149,506 W: 171,305 / R: 49,898

GP2 203,409 127,127 W: 162,966 / R: 44,869

Page 8: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

ST1

GP2

Page 9: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Provider Comparison

Page 10: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Hardware Choice and Benchmarking - SSL vs non-SSL

● Encryption enabled on broker-to-broker and client-to-broker○ AWS benchmark - r4.large w 1500GB ST1 disk○ 512 byte messages○ ~30% decrease in throughput with Broker and Client SSL enabled

● Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561○ 50% increased throughput in writes○ 80% increased throughput in reads

Page 11: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Page 12: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Hardware Choice and Benchmarking - Number of Topics

● Possible urban myth that increasing topics reduces performance

● However, more topics = more partitions

● Significantly slows recovery time from node failure

10Topic

s

100Topic

s

1000Topic

s

5000Topic

s

Page 13: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Hardware Choice and Benchmarking -Colocated Zookeeper

● Often recommended to host zookeeper separately to Kafka● However, recent changes have significantly reduced load on Zookeeper from Kafka

○ Consumer offsets are no longer stored in Zookeeper● Our benchmarking showed no measurable difference in performance, at least for smaller clusters

Page 14: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Hardware Choice and Benchmarking -Colocated Zookeeper

Consumer Rate - Separate Consumer Rate - Colocated

● 6 node cluster with broker restart○ Similar results with dedicated Zookeeper disk vs. shared

Page 15: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Topic and User Configuration Management

● Kafka utilities require direct access to Zookeeper● Zookeeper does not have a robust external security model● Felt that providing access to Zookeeper was a risk

● Solutions○ Developed command line tool to use Kafka API for topic configuration

https://github.com/instaclustr/ic-kafka-tools■ Future: Console UI support?■ Value topic configuration versioning and management

○ Adding user management to Instaclustr Console■ Additional authentication required

Page 16: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Broker Security Configuration

● Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication○ Used for client->broker○ Broker->broker uses SASL plaintext

● Using SASL plaintext authentication○ Used for broker->broker○ Were planning on integrating SCRAM authentication, but dynamic configuration still requires

broker restart○ Instead planning on short-lived signed broker keys as dynamic configuration does not require

restart

Page 17: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Broker Security Configuration

● Access to managed clusters○ Public IPs and whitelisting in firewall (security group or equivalent)○ Private IPs with VPC Peering (or equivalent in other cloud providers)○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for

admin access○ Don’t expose Zookeeper through firewall due to weak security model

Page 18: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Monitoring

● Metrics exposed via JMX○ Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann ->

Cassandra+Spark -> Console, APIs, Grafana● Exposing broker-level and per-topic metrics ● Alerting

○ Basics: service state, disk usage free space, server still exists○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated

■ Active controller very sensitive, are re-assessing alert thresholds○ Synthetic transactions: publish and consume message to controlled topic, measure success and

latency

Page 19: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Monitoring

● Central Logging○ Fleet logs transferred via Kafka to an Elassandra cluster○ 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra○ Kafka experience in this project has been very positive

● Only issue○ Auto offset commit failed for group logstash: Commit offsets failed with retriable exception. You

should retry committing offsets.○ We weren’t monitoring consumer lag closely enough○ Increased consumer session and request timeouts

Page 20: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Backup and Restore

● Internet wisdom = Kafka Backups is not a thing○ Rely on replication within cluster or mirror maker

replication to another cluster● Cassandra experience says backups are valuable

○ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication

● Future○ Working on regular automated backup and restore of

topic and security configuration○ Consider using Kafka Connect to write important

messages to offline backup

Page 21: Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an Apache Kafka Managed Service. instaclustr.com Introduction Over 20 million node-hours

instaclustr.com

Thanks for listening!

● Currently in Preview● Would love any feedback, suggestions or just telling us what we missed● 14-day free trial option (no CC needed) - console.instaclustr.com