Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an...
Transcript of Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an...
![Page 1: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/1.jpg)
instaclustr.comTwitter @instaclustr [email protected] instaclustr.com
Lessons Learned from Building an Apache Kafka Managed Service
![Page 2: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/2.jpg)
instaclustr.com
Introduction
● Over 20 million node-hours of experience managing Cassandra, Spark and Elassandra
● Our platform provides automated provisioning, monitoring and management
● Available on AWS, GCP, Azure and IBM Cloud
● Managed Apache Kafka released May 21st
![Page 3: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/3.jpg)
instaclustr.com
Agenda
● Context - our offering and development process
● Hardware choice and benchmarking
● Topic and user management
● Broker security configuration
● Monitoring
● Backup and Restore
![Page 4: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/4.jpg)
instaclustr.com
Instaclustr Managed Kafka - Key Features
● Preview Release available:○ Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and Azure○ Broker monitoring○ Instaclustr monitoring and provisioning API support○ Private network clusters (AWS only)○ Run in your cloud provider account or ours○ Topic management via a custom CLI tool
![Page 5: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/5.jpg)
instaclustr.com
Instaclustr Managed Kafka - Key Features
● For GA (end June):○ SOC2 compliant○ User & credential management○ Providing more cluster config options○ Topic level and synthetic transaction monitoring○ Infrastructure config tuning
![Page 6: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/6.jpg)
instaclustr.com
Instaclustr Managed Kafka - Development Process
● First customer requests 2016
● Internal infrastructure deployment and usage of Kafka mid 2017
● Managed service platform developmentcommenced November 2017
● Early access program with 4 customerscommenced December 2017
● Public preview release 21 May 2018
● GA expected 25 June 2018
![Page 7: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/7.jpg)
instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
● Disk Type○ AWS benchmark - r4.large w 500GB disks
■ 1 x 500GB ST1 volume■ 10 x 50GB GP2 volumes in RAID0 configuration
○ Avg 10% improved throughput with ST1 vs GP2 EBS○ ST1 is 45% of the cost of GP2○ Non-RAIDed mount simplifies re-sizing EBS volumes
Type Writes (m/s) Reads (m/s) Mixed (m/s)
ST1 223,851 149,506 W: 171,305 / R: 49,898
GP2 203,409 127,127 W: 162,966 / R: 44,869
![Page 8: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/8.jpg)
instaclustr.com
ST1
GP2
![Page 9: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/9.jpg)
instaclustr.com
Provider Comparison
![Page 10: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/10.jpg)
instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
● Encryption enabled on broker-to-broker and client-to-broker○ AWS benchmark - r4.large w 1500GB ST1 disk○ 512 byte messages○ ~30% decrease in throughput with Broker and Client SSL enabled
● Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561○ 50% increased throughput in writes○ 80% increased throughput in reads
![Page 11: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/11.jpg)
instaclustr.com
![Page 12: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/12.jpg)
instaclustr.com
Hardware Choice and Benchmarking - Number of Topics
● Possible urban myth that increasing topics reduces performance
● However, more topics = more partitions
● Significantly slows recovery time from node failure
10Topic
s
100Topic
s
1000Topic
s
5000Topic
s
![Page 13: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/13.jpg)
instaclustr.com
Hardware Choice and Benchmarking -Colocated Zookeeper
● Often recommended to host zookeeper separately to Kafka● However, recent changes have significantly reduced load on Zookeeper from Kafka
○ Consumer offsets are no longer stored in Zookeeper● Our benchmarking showed no measurable difference in performance, at least for smaller clusters
![Page 14: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/14.jpg)
instaclustr.com
Hardware Choice and Benchmarking -Colocated Zookeeper
Consumer Rate - Separate Consumer Rate - Colocated
● 6 node cluster with broker restart○ Similar results with dedicated Zookeeper disk vs. shared
![Page 15: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/15.jpg)
instaclustr.com
Topic and User Configuration Management
● Kafka utilities require direct access to Zookeeper● Zookeeper does not have a robust external security model● Felt that providing access to Zookeeper was a risk
● Solutions○ Developed command line tool to use Kafka API for topic configuration
https://github.com/instaclustr/ic-kafka-tools■ Future: Console UI support?■ Value topic configuration versioning and management
○ Adding user management to Instaclustr Console■ Additional authentication required
![Page 16: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/16.jpg)
instaclustr.com
Broker Security Configuration
● Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication○ Used for client->broker○ Broker->broker uses SASL plaintext
● Using SASL plaintext authentication○ Used for broker->broker○ Were planning on integrating SCRAM authentication, but dynamic configuration still requires
broker restart○ Instead planning on short-lived signed broker keys as dynamic configuration does not require
restart
![Page 17: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/17.jpg)
instaclustr.com
Broker Security Configuration
● Access to managed clusters○ Public IPs and whitelisting in firewall (security group or equivalent)○ Private IPs with VPC Peering (or equivalent in other cloud providers)○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for
admin access○ Don’t expose Zookeeper through firewall due to weak security model
![Page 18: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/18.jpg)
instaclustr.com
Monitoring
● Metrics exposed via JMX○ Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann ->
Cassandra+Spark -> Console, APIs, Grafana● Exposing broker-level and per-topic metrics ● Alerting
○ Basics: service state, disk usage free space, server still exists○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated
■ Active controller very sensitive, are re-assessing alert thresholds○ Synthetic transactions: publish and consume message to controlled topic, measure success and
latency
![Page 19: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/19.jpg)
instaclustr.com
Monitoring
● Central Logging○ Fleet logs transferred via Kafka to an Elassandra cluster○ 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra○ Kafka experience in this project has been very positive
● Only issue○ Auto offset commit failed for group logstash: Commit offsets failed with retriable exception. You
should retry committing offsets.○ We weren’t monitoring consumer lag closely enough○ Increased consumer session and request timeouts
![Page 20: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/20.jpg)
instaclustr.com
Backup and Restore
● Internet wisdom = Kafka Backups is not a thing○ Rely on replication within cluster or mirror maker
replication to another cluster● Cassandra experience says backups are valuable
○ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication
● Future○ Working on regular automated backup and restore of
topic and security configuration○ Consider using Kafka Connect to write important
messages to offline backup
![Page 21: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra](https://reader030.fdocuments.us/reader030/viewer/2022041014/5ec56410e582ad621b7d04d5/html5/thumbnails/21.jpg)
instaclustr.com
Thanks for listening!
● Currently in Preview● Would love any feedback, suggestions or just telling us what we missed● 14-day free trial option (no CC needed) - console.instaclustr.com