Overview of Zookeeper, Helix and Kafka (Oakjug)

Post on 28-Jul-2015

Distributed system goodies: Zookeeper, Helix and Kafka

Chris Richardson

Author of POJOs in Action Founder of the original CloudFoundry.com

@crichardson chris@chrisrichardson.net http://plainoldobjects.com http://microservices.io


Presentation goal

Talk about a collection of interesting technologies for building distributed



About Chris

Founder of a startup that’s creating a platform for developing

event-driven microservices (http://bit.ly/trialeventuate)


Apache ZooKeeper is an open source distributed configuration service, synchronization service, and naming registry for large distributed systems



Distributed system use cases…

Name service

lookup by name,

e.g. service discovery: name => [host, port]*

Group membership

E.g. distributed cache

Cluster members need to talk amongst themselves

Clients need to discover the group members


…Use casesLeader election

N servers, one of which needs to be the master

e.g. master/slave replication

Distributed locking and latches

e.g. cluster wide singleton



Zookeeper serverIn-

memory DB

datadirsnapshot logs

txn logs

Leader FollowerFollower


Zookeeper clientsLanguages:

Ships with Java, C, Perl, and Python

Community: Scala, NodeJS, Go, Lua, …

Client connects to one of a list of servers

Client establishes a session

Survives TCP disconnects

Client-specified session timeout


Zookeeper data modelHierarchical tree of named znodes

Znodes have binary data and children

Znodes can be ephemeral - live for as long as the client session

Clients can watch a node - get notified of changes


Zookeeper operationscreate(path, data, mode)

Persistence or ephemeral?

Sequential: append parent’s counter value to name?



readData(path, watch?) : Object

writeData(path, data)

getChildren(path, watch?) : List[String]


Znode watches

readData/getChildren can establish a watch

client gets a one-time notification when changed


Using the zkCli$ bin/zkCli.sh -server $DOCKER_HOST_IP [zk] create /cer x Created /cer [zk] create /cer/foo y Created /cer/foo

[zk] get /cer/foo watch y

[zk] set /cer/foo z set /cer/foo z

WatchedEvent state:SyncConnected type:NodeDataChanged path:/cer/foo


Creating an ephemeral sequential node

[zk] create -s -e /cer/baz aa Created /cer/baz0000000001]

[zk] ls /cer watch ls /cer watch [baz0000000001, foo]

[Zk] exit

WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/cer [zk] ls /cer watch ls /cer watch [foo]


Leader election example/


guidA_0 hostA, portA

guidB_1 hostB, portB

guidC_2 hostC, portC

Server A guidA

val = 0

Server B guidB

val = 1

Server C guidC

val = 2

watches watches




Lowest value


Apache Curator

Open source library developed by Netflix

Simplifies connection management

Simplifies error handling

Implements recipes

Three projects: client, framework, and recipes



Netflix Exhibitor

Supervisory process for managing a Zookeeper instance

Watches a ZK instance and makes sure it is running

Performs periodic backups

Perform periodic cleaning of ZK log directory

A GUI explorer for viewing ZK nodes









About Helix


Built on Zookeeper


Typical distributed systems

Partitioning - e.g. use a PK (or other attribute) to choose server

Replication - for availability

State machines, e.g. master/slave replication

One replica is the master

Other replica is the slave


Use cases - master/slave replication

MySQL master/slave replication or MongoDB replica sets

N machines

1 master, N slaves

If the master dies then elect a new master


Use cases - Cassandra

Cluster consists of N nodes

Data consists of M partitions (aka vnodes)

Each partition has R replicas

Client can read/write any replica - no master/slave concept

Dynamic assignment of M*R partition replicas to N nodes


Use case - abstractlyCluster:

Set of N nodes (machines)

One or more resources

A resource is

partitioned and replicated

Resource has a state machine

e.g. offline/online, master/slave

State machine has constraints: 1 master replica, other replicas are slaves


dynamically assigns partitions to nodes

Manages state transitions and notifies nodes


Leader/standby state machine





Example assignmentNode 1 Node 2 Node 3

Partition 1 LEADER

Partition 1 STANDBY

Partition 3 LEADER

Partition 2 LEADER

Partition 2 STANDBY

Partition 3 STANDBY

Partition 2 OFFLINE

Partition 1 OFFLINE

Partition 3 OFFLINE


Post-failure assignmentNode 1 Node 2 Node 3

Partition 1 LEADER

Partition 1 STANDBY

Partition 3 LEADER

Partition 2 LEADER

Partition 2 LEADER

Partition 3 STANDBY

Partition 2 STANDBY

Partition 1 STANDBY

Partition 3 OFFLINEX


Helix cluster setup

val admin = new ZKHelixAdmin(ZK_ADDRESS)

admin.addStateModelDef(clusterName, STATE_MODEL_NAME, new StateModelDefinition(StateModelConfigGenerator.generateConfigForLeaderStandby()));


HelixControllerMain.startHelixController(ZK_ADDRESS, clusterName, nodeInfo.nodeId.id, HelixControllerMain.STANDALONE)


Adding an instance to the cluster val ic = new InstanceConfig(nodeInfo.nodeId.id) ic.setHostName(nodeInfo.host) ic.setPort("" + nodeInfo.port) ic.setInstanceEnabled(true)

admin.addInstance(clusterName, ic)

admin.rebalance(clusterName, RESOURCE_NAME, NUM_REPLICAS)

Assign to newly added nodes


Helix - connecting to the cluster

manager = HelixManagerFactory.getZKHelixManager(clusterName, instanceName, InstanceType.PARTICIPANT, ZK_ADDRESS)

val stateModelFactory = new MyStateModelFactory val stateMach = manager.getStateMachineEngine stateMach.registerStateModelFactory(STATE_MODEL_NAME, stateModelFactory)


Connect as a participant

Supply factory to create callbacks for state transitions


State transition callbacks

class MyStateModel(partitionName: String) extends StateModel {

def onBecomeStandbyFromOffline(message: Message, context: NotificationContext) { … }

def onBecomeLeaderFromStandby(message: Message, context: NotificationContext) { … }


class MyStateModelFactory extends StateModelFactory[StateModel] { def createNewStateModel(partitionName: String) = new MyStateModel(partitionName)

} <resourceName>_<partitionNumber>

invoked by Helix


More about HelixSpectators

Non-participants - don’t have resources/partitioned assigned to them

Get notified of changes to cluster

Property store

Write through cache of properties in Zookeeper


Intra-cluster communication








Kakfa concepts - topicClients publish messages to a topic

A topics has a name

A topic is a partitioned log

Topics live on disk

Messages have an offset within partition

Messages are kept for a retention period


Kafka is clusteredKafka cluster consists of N machines

Each topic partition has R replicas

1 machine is the leader (think master) for the topic partition

Clients publish/consume to/from leader

R - 1 machines are followers (think slaves)

Followers consume messages from the leader

Messages are committed when all replicas have written to the log

Producers can optionally wait for a message to be committed

Consumers only ever see committed messages


Kafka producers

Publish message to a topic

Message = (key, body)

Hash of key determines topic partition

Carefully choose key to preserve ordering, e.g. stock ticker symbol => all prices for same symbol end up in same partition

Makes request to topic partition’s leader


Kafka consumer

Consumes the messages from the partitions of one or more topics

Makes a fetch request to a topic partition’s leader

specifies the partition offset in each request

gets back a chunk of messages

Scale by having N topic partitions, N consumers


Kafka consumers - between a rock and a hard place

Simple Kafka consumer

Very flexible

BUT you are responsible for contacting leaders for each topics’ partition, storing offsets

High level consumer

Does a lot: stores offsets in Zookeeper, deals with leaders, ….

BUT it assumes that if you read a message it has been processed

More flexible consumer is on the way


High-level consumer API interface ConsumerConnector { static create(…. Zookeeper configuration…);

public <K,V> Map<String, List<KafkaStream<K,V>>> createMessageStreams(Map<String, Integer> topicCountMap, Decoder<K> keyDecoder, Decoder<V> valueDecoder);

public void commitOffsets(); }

class KafkaStream<K, V> { ConsumerIterator<K,V> iterator() }

interface ConsumerIterator<K,V> { MessageAndMetadata<K, V> next() boolean hasNext() }


Kafka at LinkedIn1100 Kafka brokers organized into more than 60 clusters.


Over 800 billion messages per day

Over 175 terabytes of data

Over 650 terabytes of messages are consumed daily


13 million messages per second

2.75 gigabytes of data per second




Zookeeper, Helix and Kafka are excellent building blocks for distributed systems


@crichardson chris@chrisrichardson.net

http://plainoldobjects.com http://microservices.io