Authorization in Apache Kafka - Seattle Kafka Meetup - Ashish Singh
Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an...
Transcript of Lessons Learned from Building an Apache Kafka Managed Service · Lessons Learned from Building an...
instaclustr.comTwitter @instaclustr [email protected] instaclustr.com
Lessons Learned from Building an Apache Kafka Managed Service
instaclustr.com
Introduction
● Over 20 million node-hours of experience managing Cassandra, Spark and Elassandra
● Our platform provides automated provisioning, monitoring and management
● Available on AWS, GCP, Azure and IBM Cloud
● Managed Apache Kafka released May 21st
instaclustr.com
Agenda
● Context - our offering and development process
● Hardware choice and benchmarking
● Topic and user management
● Broker security configuration
● Monitoring
● Backup and Restore
instaclustr.com
Instaclustr Managed Kafka - Key Features
● Preview Release available:○ Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and Azure○ Broker monitoring○ Instaclustr monitoring and provisioning API support○ Private network clusters (AWS only)○ Run in your cloud provider account or ours○ Topic management via a custom CLI tool
instaclustr.com
Instaclustr Managed Kafka - Key Features
● For GA (end June):○ SOC2 compliant○ User & credential management○ Providing more cluster config options○ Topic level and synthetic transaction monitoring○ Infrastructure config tuning
instaclustr.com
Instaclustr Managed Kafka - Development Process
● First customer requests 2016
● Internal infrastructure deployment and usage of Kafka mid 2017
● Managed service platform developmentcommenced November 2017
● Early access program with 4 customerscommenced December 2017
● Public preview release 21 May 2018
● GA expected 25 June 2018
instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
● Disk Type○ AWS benchmark - r4.large w 500GB disks
■ 1 x 500GB ST1 volume■ 10 x 50GB GP2 volumes in RAID0 configuration
○ Avg 10% improved throughput with ST1 vs GP2 EBS○ ST1 is 45% of the cost of GP2○ Non-RAIDed mount simplifies re-sizing EBS volumes
Type Writes (m/s) Reads (m/s) Mixed (m/s)
ST1 223,851 149,506 W: 171,305 / R: 49,898
GP2 203,409 127,127 W: 162,966 / R: 44,869
instaclustr.com
ST1
GP2
instaclustr.com
Provider Comparison
instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
● Encryption enabled on broker-to-broker and client-to-broker○ AWS benchmark - r4.large w 1500GB ST1 disk○ 512 byte messages○ ~30% decrease in throughput with Broker and Client SSL enabled
● Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561○ 50% increased throughput in writes○ 80% increased throughput in reads
instaclustr.com
instaclustr.com
Hardware Choice and Benchmarking - Number of Topics
● Possible urban myth that increasing topics reduces performance
● However, more topics = more partitions
● Significantly slows recovery time from node failure
10Topic
s
100Topic
s
1000Topic
s
5000Topic
s
instaclustr.com
Hardware Choice and Benchmarking -Colocated Zookeeper
● Often recommended to host zookeeper separately to Kafka● However, recent changes have significantly reduced load on Zookeeper from Kafka
○ Consumer offsets are no longer stored in Zookeeper● Our benchmarking showed no measurable difference in performance, at least for smaller clusters
instaclustr.com
Hardware Choice and Benchmarking -Colocated Zookeeper
Consumer Rate - Separate Consumer Rate - Colocated
● 6 node cluster with broker restart○ Similar results with dedicated Zookeeper disk vs. shared
instaclustr.com
Topic and User Configuration Management
● Kafka utilities require direct access to Zookeeper● Zookeeper does not have a robust external security model● Felt that providing access to Zookeeper was a risk
● Solutions○ Developed command line tool to use Kafka API for topic configuration
https://github.com/instaclustr/ic-kafka-tools■ Future: Console UI support?■ Value topic configuration versioning and management
○ Adding user management to Instaclustr Console■ Additional authentication required
instaclustr.com
Broker Security Configuration
● Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication○ Used for client->broker○ Broker->broker uses SASL plaintext
● Using SASL plaintext authentication○ Used for broker->broker○ Were planning on integrating SCRAM authentication, but dynamic configuration still requires
broker restart○ Instead planning on short-lived signed broker keys as dynamic configuration does not require
restart
instaclustr.com
Broker Security Configuration
● Access to managed clusters○ Public IPs and whitelisting in firewall (security group or equivalent)○ Private IPs with VPC Peering (or equivalent in other cloud providers)○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for
admin access○ Don’t expose Zookeeper through firewall due to weak security model
instaclustr.com
Monitoring
● Metrics exposed via JMX○ Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann ->
Cassandra+Spark -> Console, APIs, Grafana● Exposing broker-level and per-topic metrics ● Alerting
○ Basics: service state, disk usage free space, server still exists○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated
■ Active controller very sensitive, are re-assessing alert thresholds○ Synthetic transactions: publish and consume message to controlled topic, measure success and
latency
instaclustr.com
Monitoring
● Central Logging○ Fleet logs transferred via Kafka to an Elassandra cluster○ 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra○ Kafka experience in this project has been very positive
● Only issue○ Auto offset commit failed for group logstash: Commit offsets failed with retriable exception. You
should retry committing offsets.○ We weren’t monitoring consumer lag closely enough○ Increased consumer session and request timeouts
instaclustr.com
Backup and Restore
● Internet wisdom = Kafka Backups is not a thing○ Rely on replication within cluster or mirror maker
replication to another cluster● Cassandra experience says backups are valuable
○ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication
● Future○ Working on regular automated backup and restore of
topic and security configuration○ Consider using Kafka Connect to write important
messages to offline backup
instaclustr.com
Thanks for listening!
● Currently in Preview● Would love any feedback, suggestions or just telling us what we missed● 14-day free trial option (no CC needed) - console.instaclustr.com