Stream-style messaging development with Rabbit, Active, ZeroMQ & Apache Kafka by Vyacheslav Lapin
Distributed messaging with Apache Kafka
-
Upload
saumitra-srivastav -
Category
Data & Analytics
-
view
7.274 -
download
0
Transcript of Distributed messaging with Apache Kafka
1
Distributed messaging withApache Kafka
Saumitra Srivastav@_saumitra_
http://www.meetup.com/Bangalore-Apache-Kafka-Group/
2
Introduction
Kafka is a:• distributed• replicated• persistent• partitioned• high throughput• pub-sub
messaging system.
Incubated at LinkedIn. Written in Scala.
3
Demo Application
Twitter stream analytics
4
StreamProducer
Broker-1 Broker-2 Broker-3
Twitter Streaming API
Kafka Cluster
Solr-1
Realtime search
Solr-2 Cassandra-1
Data Store for longer retention
Cassandra-2
Sentiment Analysis
5
Terminology
Topics: categories in which message feed is maintained
Producer: Processes that publish messages to a Kafka topic.
Consumers: processes that subscribe to topics and process the feed of published messages
Brokers: Servers which form a kafka cluster and act as a data transport channel between producers and consumers.
Producer Producer
Consumer Consumer
Broker
Kafka Cluster
Broker Broker
6
Simplified View of a Kafka System
ZookeeperBroker 1 Broker 2 Broker 3
Producer 1 Producer 2
Consumer 1 Consumer 2 Consumer 3
7
Topics and Partitions
TOPIC – 1 (error log)
TOPIC – 2 (security log)
8
Partitions
• Each partition is an ordered, immutable sequence of messages.
• Messages are continuously appended to it.
• Each message in partition is assigned a unique sequential id number called offset.
• Any message in partition can be accessed using this offset.
9
Partitions
• Partition servers 2 purposes:1. Scaling2. Parallelism
• Scaling A topic can be divided into multiple partition, and each partition can be on different servers.
• ParallelismA consumer can consume from multiple partitions at same time(while maintaining ordering guarantee).
10
Distribution & Replication
• The partitions of the log are distributed over Kafka cluster
• Each server handles data and requests for some number of partition
• Each partition is replicated for fault tolerance.
• Each partition has one server which acts as the leader.
• The leader handles all read and write requests for the partition.
• Followers keep replicating the leader.
11
Producers
• Producers publish data to the topics of their choice.
• Producer can choose the topic’s partition to which message should be assigned.
• Partition can be selected in a round robin manner for load balancing.
• Kafka doesn’t care about serialization format. All it need is a byte array.
12
Consumers
• Other messaging systems basically follow 2 models:• Queuing• Publish-Subscribe
• Kafka uses a concept of consumer group which generalizes both these models.
• Consumers label themselves with a consumer group name
• Each message published to a topic, is delivered to one consumer instance, within each subscribing consumer group.
13
Consumers
14
Consumer Groups
ZookeeperBroker 1 Broker 2 Broker 3
Producer 1 Producer 2
Consumer 1 Consumer 2 Consumer 3
Consumer-Group A Consumer-Group B
15
Consumer groups
ZookeeperBroker 1
Topic-1
Broker 2
Topic-1
Broker 3
Topic-1
Producer 1 Producer 2
Consumer 1Consumer-Group A Consumer-Group B
P0 P3 P5 P2 P4
Consumer 2 Consumer 3
16
Message Persistence
• Unlike other messaging system, message are not deleted on consumption.
• Message are retained until a configurable period of time after which they are deleted (even if they are NOT consumed).
• Consumers can re-consume any chunk of older message using message offset.
• Kafka performance is effectively constant with respect to data size, so huge data size is not an issue.
17
DemoRunning a multi-broker kafka cluster
18
Guarantees
1. Ordering guarantee• Messages sent by a producer to a particular topic partition will be
appended in the order they are sent.• A consumer instance sees messages in the order they are stored in the
log.
2. At least once delivery
3. Fault toleranceFor a topic with replication factor N, up to N-1 server failures will not cause any data loss.
4. No corruption of data:• over the network• On the disk
19
DemoConsumer/Producer Java API
20
Misc Design features
1. Stateless broker• Each consumer maintains its own state(offset)
2. Load balancing3. Asynchronous send4. Push/pull model instead of Push/Push5. Consumer Position6. Offline Data Load7. Simple API8. Low Overhead9. Batch send and receive10. No message caching in JVM11. Rely on file system buffering• mostly sequential access patterns
12. Zero-copy transfer: file->socket
21
Use Cases
1. Messaging2. Website Activity Tracking3. Metrics4. Log Aggregation5. Stream Processing
22
Thanks
Website: http://kafka.apache.org/Doc: http://kafka.apache.org/documentation.htmlMailing Lists: [email protected]
Questions?