Apache Kafka - Scalable Message-Processing and more !
-
Upload
guido-schmutz -
Category
Technology
-
view
559 -
download
0
Transcript of Apache Kafka - Scalable Message-Processing and more !
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Apache KafkaScalable Message Processing and more!
Guido Schmutz
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 20 yearsOracle ACE Director for Fusion Middleware and SOAConsultant, Trainer Software Architect for Java, Oracle, SOA andBig Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis
More than 30 years of software development experience
Contact: [email protected]: http://guidoschmutz.wordpress.comSlideshare: http://www.slideshare.net/gschmutzTwitter: gschmutz
2 8.12.2016 Big Data & Fast Data
Agenda
1. Introduction & Motivation2. Kafka Core
3. Kafka Connect
4. Kafka Streams
5. Kafka and ”Big Data” / ”Fast Data” Ecosystem
6. Confluent Data Platform7. Summary
Apache Kafka - Scalable Message Processing and more!3
Introduction & Motivation
Apache Kafka - Scalable Message Processing and more!4
Hadoop ClusterdHadoop Cluster
Big Data Cluster
Traditional Big Data Architecture
BITools
Enterprise Data Warehouse
Billing &Ordering
CRM / Profile
MarketingCampaigns
File Import / SQL Import
SQL
Search
Online&MobileApps
Search
NoSQL
ParallelProcessing
DistributedFilesystem
• MachineLearning• GraphAlgorithms• NaturalLanguageProcessing
Event HubEvent
Hub
Hadoop ClusterdHadoop Cluster
Big Data Cluster
Event Hub – handle event stream data
BITools
Enterprise Data Warehouse
Location
Social
Clickstream
Sensor Data
Billing &Ordering
CRM / Profile
MarketingCampaigns
Event Hub
CallCenter
WeatherData
MobileApps
SQL
Search
Online&MobileApps
Search
Data Flow
NoSQL
ParallelProcessing
DistributedFilesystem
• MachineLearning• GraphAlgorithms• NaturalLanguageProcessing
Hadoop ClusterdHadoop ClusterBig Data Cluster
Event Hub – taking Velocity into account
Location
Social
Clickstream
Sensor Data
Billing &Ordering
CRM / Profile
MarketingCampaigns
CallCenter
MobileApps
Batch Analytics
Streaming Analytics
Event HubEvent
HubEvent Hub
NoSQL
ParallelProcessing
DistributedFilesystem
Stream AnalyticsNoSQL
Reference /Models
SQL
Search
Dashboard
BITools
Enterprise Data Warehouse
Search
Online&MobileApps
File Import / SQL Import
WeatherData
Apache Kafka - Scalable Message Processing and more!7
Kafka Stream Data Platform
Source:ConfluentApache Kafka - Scalable Message Processing and more!8
Kafka Core
Apache Kafka - Scalable Message Processing and more!9
Apache Kafka - Overview
Distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics
Apache Kafka - Scalable Message Processing and more!10
Apache Kafka - Motivation
LinkedIn’s motivation for Kafka was:
• “A unified platform for handling all the real-time data feeds a large company might have.”
Must haves
• High throughput to support high volume event feeds.
• Support real-time processing of these feeds to create new, derived feeds.
• Support large data backlogs to handle periodic ingestion from offline systems.
• Support low-latency delivery to handle more traditional messaging use cases.
• Guarantee fault-tolerance in the presence of machine failures.
Apache Kafka - Scalable Message Processing and more!11
Kafka High Level Architecture
The who is who• Producers write data to brokers.• Consumers read data from
brokers.• All this is distributed.
The data• Data is stored in topics.• Topics are split into partitions,
which are replicated.
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Broker 1 Broker 2 Broker 3
ZookeeperEnsemble
Apache Kafka - Scalable Message Processing and more!12
Apache Kafka - Architecture
Kafka Broker
Movement Processor
MovementTopic
Engine-MetricsTopic
1 2 3 4 5 6
EngineProcessor1 2 3 4 5 6
Truck
Apache Kafka - Scalable Message Processing and more!13
Apache Kafka - Architecture
Kafka Broker
Movement Processor
MovementTopic
Engine-MetricsTopic
1 2 3 4 5 6
EngineProcessor
Partition0
1 2 3 4 5 6Partition0
1 2 3 4 5 6Partition1 Movement
ProcessorTruck
Apache Kafka - Scalable Message Processing and more!14
ApacheKafka
Kafka Broker 1
Movement Processor
Truck
MovementTopicP0
Movement Processor
1 2 3 4 5
P2 1 2 3 4 5
Kafka Broker 2MovementTopic
P2 1 2 3 4 5
P1 1 2 3 4 5
Kafka Broker 3MovementTopic
P0 1 2 3 4 5
P1 1 2 3 4 5
Movement Processor
Apache Kafka - Scalable Message Processing and more!15
Apache Kafka - Architecture
• Write Ahead Log / Commit Log
• Producers always append to tail
• think append to file
Kafka Broker
MovementTopic
1 2 3 4 5
Truck
6 6
Apache Kafka - Scalable Message Processing and more!16
Durability Guarantees
Producer can configure acknowledgements
Value Impact Durability0 • Producerdoesn’twaitforleader weak
1(default) • Producerwaitsforleader• Leadersends ack whenmessagewrittentolog• Nowaitforfollowers
medium
all • Producerwaitsforleader• Leadersendsack when allIn-SyncReplicahave
acknowledged
strong
Apache Kafka - Scalable Message Processing and more!17
Apache Kafka - Partition offsets
Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
ConsumerGroupA ConsumerGroupB
Apache Kafka - Scalable Message Processing and more!18
Source:ApacheKafka
Data Retention – 3 options
1. Never
2. Time based (TTL) log.retention.{ms | minutes | hours}
3. Size based log.retention.bytes
4. Log compaction based (entries with same key are removed)kafka-topics.sh --zookeeper localhost:2181 \
--create --topic customers \--replication-factor 1 --partitions 1 \--config cleanup.policy=compact
Apache Kafka - Scalable Message Processing and more!19
Apache Kafka – Some numbers
Kafka at LinkedIn => over 1800+ broker machines / 79K+ Topics
Kafka Performance at our own infrastructure => 6 brokers (VM) / 1 cluster
• 445’622 messages/second• 31 MB / second • 3.0405 ms average latency between producer / consumer
1.3Trillionmessagesperday
330Terabytesin/day
1.2Petabytesout/day
Peakloadforasinglecluster2millionmessages/sec4.7Gigabits/secinbound15Gigabits/secoutbound
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
https://engineering.linkedin.com/kafka/running-kafka-scale
Apache Kafka - Scalable Message Processing and more!20
Kafka Topics
Creating a topic• Command line interface
• Using AdminUtils.createTopic method
• Auto-create via auto.create.topics.enable = true
Modifying a topichttps://kafka.apache.org/documentation.html#basic_ops_modify_topic
Deleting a topic
• Command Line interface
$ kafka-topics.sh –zookeeper zk1:2181 --create \--topic my.topic –-partitions 3 \–-replication-factor 2 --config x=y
Apache Kafka - Scalable Message Processing and more!21
Inspecting the current state of a topic
Use the --describe option
• Leader: brokerID of the currently elected leader broker
• Replica ID’s = broker ID’s
• ISR = “in-sync replica”, replicas that are in sync with the leader. In this example:
• Broker 0 is leader for partition 1.
• Broker 1 is leader for partitions 0 and 2.
• All replicas are in-sync with their respective leader partitions.
$ kafka-topics.sh –zookeeper zk1:2181 –-describe --topic my.topicTopic:zerg2.hydra PartitionCount:3 ReplicationFactor:2 Configs:Topic: my.topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0Topic: my.topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1Topic: my.topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
Apache Kafka - Scalable Message Processing and more!22
Kafka Connect
Apache Kafka - Scalable Message Processing and more!23
Kafka Connect Architecture
Apache Kafka - Scalable Message Processing and more!24
Source:Confluent
Kafka Connector Hub – Certified Connectors
Source:http://www.confluent.io/product/connectors
Apache Kafka - Scalable Message Processing and more!25
Kafka Connector Hub – Additional Connectors
Source:http://www.confluent.io/product/connectors
Apache Kafka - Scalable Message Processing and more!26
Kafka Streams
Apache Kafka - Scalable Message Processing and more!27
Kafka Streams
• Designed as a simple and lightweight library in Apache Kafka
• no external dependencies on systems other than Apache Kafka
• Leverages Kafka as its internal messaging layer
• agnostic to resource management and configuration tools
• Supports fault-tolerant local state
• Event-at-a-time processing (not microbatch) with millisecond latency
• Windowing with out-of-order data using a Google DataFlow-like model
Apache Kafka - Scalable Message Processing and more!28
Kafka Streams - Architecture
Apache Kafka - Scalable Message Processing and more!29
topology defines the stream processing computational logic for your application
topology is a graph of stream processors (nodes) that are connected by streams (edges)
source processor is a stream processor that does not have any upstream processors
sink processor is a special type of stream processor that does not have down-stream processors.
Source:Confluent
Kafka Streams - Processor Topology
Apache Kafka - Scalable Message Processing and more!30
topology defines the stream processing computational logic for your application
topology is a graph of stream processors(nodes) that are connected by streams (edges)
source processor is a stream processor that does not have any upstream processors. Consumes one or Kafka topics.
sink processor is a special type of stream processor that does not have down-stream processors. Produces to a single Kafka topic.
Source:Confluent
Kafka and ”Big Data” / ”Fast Data” Ecosystem
Apache Kafka - Scalable Message Processing and more!31
Kafka and the Big Data / Fast Data ecosystem
Kafka integrates with many popular products / frameworks
• Apache Spark Streaming
• Apache Flink
• Apache Storm
• Apache NiFi
• Streamsets
• Apache Flume
• Oracle Stream Analytics
• Oracle Service Bus
• Oracle GoldenGate
• Spring Integration Kafka Support
• …Stormbuilt-inKafkaSpouttoconsumeeventsfromKafka
Apache Kafka - Scalable Message Processing and more!32
Confluent Platform
Apache Kafka - Scalable Message Processing and more!33
Confluent Data Platform 3.1
Apache Kafka - Scalable Message Processing and more!34
Source:Confluent
Summary
Apache Kafka - Scalable Message Processing and more!35
WeatherData
SQL ImportHadoop ClusterdHadoop Cluster
Hadoop Cluster
Location
Social
Clickstream
Sensor Data
Billing &Ordering
CRM / Profile
MarketingCampaigns
CallCenter
MobileApps
Batch Analytics
Streaming Analytics
Event HubEvent
HubEvent Hub
NoSQL
ParallelProcessing
DistributedFilesystem
Stream AnalyticsNoSQL
Reference /Models
SQL
Search
Dashboard
BITools
Enterprise Data Warehouse
Search
Online&MobileApps
Customer Event Hub – mapping of technologies
Apache Kafka - Scalable Message Processing and more!36
Summary
• Kafka can scale to millions of messages per second, and more
• Easy to start with for a PoC
• A bit more to invest to setup production environment
• Monitoring is key
• Vibrant community and ecosystem
• Fast pace technology
• Confluent provides Kafka Distribution
Apache Kafka - Scalable Message Processing and more!37
Guido SchmutzTechnology Manager
Apache Kafka - Scalable Message Processing and more!38
@gschmutz guidoschmutz.wordpress.com