Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka

30
Apache Kafka DC Replicating DB Binary Logs to Kafka Mark Bittmann 7 April 2016

Transcript of Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka

Apache Kafka DCReplicating DB Binary Logs to Kafka

Mark Bittmann7 April 2016

Agenda

Meetup Intro

Tech Overview: Kafka and Binary Logs (binlogs)

Change Data Capture Overview

Demo: binlogs -> maxwell -> kafka -> HDFS/Spark/Zeppelin + Elastic

About Me• Data Scientist who leans Computer

Scientist

• Lead Data Scientist, Stackspace.io and b23.io

• PMC Member & Committer, Apache Metron (incubating)

• Contributed to Apache Spark, MLlib

• @_mbittmann_

All businesses are data businesses.

Tech Overview: Kafka

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.}

FastScalable}

Durable

http://kafka.apache.org/documentation.html

Design Features

• Distributed => cluster-centric design offers strong durability and fault-tolerance guarantees

• Partitioned => messages spread over a cluster of machines for streams that might exceed capacity of a single machine

• Replicated => messages persisted on disk and replicated within the cluster to prevent data loss

http://kafka.apache.org/documentation.html

Topics

http://kafka.apache.org/documentation.html

Producers/Consumers

}consumer groups for queues

http://kafka.apache.org/documentation.html

https://martin.kleppmann.com/2015/05/27/logs-for-data-infrastructure.html

The power of Kafka lies within what you build around it.

quick Kafka demo

Tech Overview: Binary Logs

The binary log contains a record of all changes to the databases, both

data and structure.

https://mariadb.com/kb/en/mariadb/binary-log/

Typical Usage: Replication

http://www.cnblogs.com/fangwenyu/archive/2012/09/03/2669419.html

What does a binary log look like?

It looks like binary.

ROW based binlog

{"database":"bintest","table":"mytable","type":"delete","ts":1459958130,"xid":14261,"commit":true,"data":{"some_blob":"AMgyGQr/","some_text":"text object","id":98,"some_bool":0,"uuid":"fcb3a514-fc0f-11e5-841c-60f81dc2691c","some_value":0,"ts":"2016-04-06"}}

Implementations• MySQL/MariaDB/Aurora/Percona: binlog

• Oracle: GoldenGate

• PostgreSQL: logical decoding

• MongoDB: oplog

• CouchDB: changes feed

quick binlog demo

a database binary log

looks a whole lot like a commit log

Change Data Capture

A change in data means something happened

and when something happens many applications

might want to know about it.

Take a snapshot of your database.

Your database snapshot is out of date by the time it is done

snapshotting.

https://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html

https://martin.kleppmann.com/2015/05/27/logs-for-data-infrastructure.html

https://martin.kleppmann.com/2015/05/27/logs-for-data-infrastructure.html

– all the developers

“Stupid data engineer, ain't no way I'm changing the web app.”

https://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html

beta.stackspace.io

Demo• Database

• MySQL • DB Client

• Custom Python • Binlog Replicator

• Maxwell by ZenDesk

• Data Stacks (AWS)• Kafka/Zookeeper • Spark/YARN/HDFS/Zeppelin • ElasticSearch/Kibana • StreamSets