From a kafkaesque story to The Promised Land
-
Upload
ran-silberman -
Category
Technology
-
view
107 -
download
2
description
Transcript of From a kafkaesque story to The Promised Land
DATA
From a Kafkaesque Story to the Promised Land
7/7/2013Ran Silberman
Open Source paradigm
The Cathedral & the Bazaar by Eric S Raymond, 1999the struggle between top-down and bottom-up design
Challenges of data platform[1]
• High throughput
• Horizontal scale to address growth
• High availability of data services
• No Data loss
• Satisfy Real-Time demands
• Enforce structural data with schemas
• Process Big Data and Enterprise Data
• Single Source of Truth (SSOT)
SLA's of data platform
BI DWH
Real-time Customers
Real-time dashboards
Data Bus
Offline Customers
SLA:1. 98% in < 1/2 hr2. 99.999% < 4 hrs
SLA:1. 98% in < 500 msec2. No send > 2 sec
Real-time servers
Legacy Data flow in LivePerson
BI DWH (Oracle)
RealTime servers
View Reports
Customers
ETL
Sessionize
Modeling
Schema View
1st phase - move to Hadoop
ETL
Sessionize
Modeling
Schema View
RealTime servers
BI DWH (Vertica)
View Reports
HDFS
Hadoop
MR Job transfers data to BI DWH
Customers
2. move to Kafka
6
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1
View Reports
Customers
3. Integrate with new producers
6
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1 Topic-2
New RealTime servers
View Reports
Customers
4. Add Real-time BI
View Reports
6
Customers
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1 Topic-2
New RealTime servers
Storm
Topology
5. Standardize Data-Model using Avro
View Reports
6
Customers
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1 Topic-2
New RealTime servers
Storm
Topology
Camus
6. Define Single Source of Truth (SSOT)
View Reports
6
Customers
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1 Topic-2
New RealTime servers
Storm
Topology
Camus
Kafka[2] as Backbone for Data
• Central "Message Bus"
• Support multiple topics (MQ style)
• Write ahead to files
• Distributed & Highly Available
• Horizontal Scale
• High throughput (10s MB/Sec per server)
• Service is agnostic to consumers' state
• Retention policy
Kafka Architecture
Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2 Producer 3
Node 2 Node 3
Consumer 1 Consumer 1Consumer 1
Group1
Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2
Node 3 Node 4
Consumer 2 Consumer 3Consumer 1
Node 2
Topic1 Topic2
Kafka replay messages.
Zookeeper
Node 3 Node 4
Min Offset ->
Max Offset ->
fetchRequest = new fetchRequest(topic, partition, offset, size);
currentOffset : taken from zookeeperEarliest offset: -2 Latest offset : -1
Kafka API[3]
• Producer API
• Consumer API
o High-level API
using zookeeper to access brokers and to save
offsets
o SimpleConsumer API
direct access to Kafka brokers
• Kafka-Spout, Camus, and KafkaHadoopConsumer all
use SimpleConsumer
Kafka API[3]
• Producermessages = new List<KeyedMessage<K, V>>()
Messages.add(new KeyedMessage(“topic1”, null, msg1));
Send(messages);
• Consumerstreams[] = Consumer.createMessageStream((“topic1”, 1);
for (message: streams[0]{
//do something with message
}
Kafka in Unit Testing• Use of class KafkaServer
• Run embedded server
Introducing Avro[5]
• Schema representation using JSON
• Support types
o Primitive types: boolean, int, long, string, etc.
o Complex types: Record, Enum, Union, Arrays,
Maps, Fixed
• Data is serialized using its schema
• Avro files include file-header of the schema
Add Avro protocol to the story
Topic 1
Schema Repo
Producer 1
Topic 2
Consumers: Camus/Storm
Create Message according to Schema 1.0
Register schema 1.0Add revision to message header
Send message
Read message Extract header and obtain schema version
Get schema by version 1.0
Encode message with Schema 1.0
Decode message with schema 1.0
{event1:{header:{sessionId:"102122"),{timestamp:"12346")}...
Header Avro message
Kafka message
Pass message
1.0
Kafka + Storm + Avro example
• Demonstrating use of Avro data passing from Kafka to
Storm
• Explains Avro revision evolution
• Requires Kafka and Zookeeper installed
• Uses Storm artifact and Kafka-Spout artifact in Maven
• Plugin generates Java classes from Avro Schema
• https://github.com/ransilberman/avro-kafka-storm
Producer machine
Resiliency
Producer
Consistent Topic
Send message to Kafka
local file
Persist message to local disk
Kafka Bridge
Send message to Kafka
Fast Topic
Real-time Consumer: Storm
Offline Consumer: Hadoop
Challenges of Kafka
• Still not mature enough
• Not enough supporting tools (viewers, maintenance)
• Duplications may occur
• API not documented enough
• Open Source - support by community only
• Difficult to replay messages from specific point in time
• Eventually Consistent...
Eventually Consistent
Because it is a distributed system -
• No guarantee for delivery order
• No way to tell to which broker message is sent
• Kafka do not guarantee that there are no duplications
• ...But eventually, all message will arrive!
Event generated
Event destination
Desert
Major Improvements in Kafka 0.8[4]
• Partitions replication
• Message send guarantee
• Consumer offsets are represented numbers instead of
bytes (e.g., 1, 2, 3, ..)
Addressing Data Challenges
• High throughput
o Kafka, Hadoop
• Horizontal scale to address growth
o Kafka, Storm, Hadoop
• High availability of data services
o Kafka, Storm, Zookeeper
• No Data loss
o Highly Available services, No ETL
Addressing Data Challenges Cont.
• Satisfy Real-Time demands
o Storm
• Enforce structural data with schemas
o Avro
• Process Big Data and Enterprise Data
o Kafka, Hadoop
• Single Source of Truth (SSOT)
o Hadoop, No ETL
References
• [1]
Satisfying new requirements for Data Integration By Dav
id Loshin
• [2]Apache Kafka
• [3]Kafka API
• [4]Kafka 0.8 Quick Start
• [5]Apache Avro
• [5]Storm
Thank you!
DATA