Introduction to Kafka connect
-
Upload
knoldus-software-llp -
Category
Software
-
view
1.694 -
download
1
Transcript of Introduction to Kafka connect
Introduction to
Kafka Connect
Himani Arora
Software Consultant
Knoldus Software LLP
Topics Covered
What is Kafka Connect ?
Source and Sinks
Motivation behind kafka Connect
Use cases of kafka Connect
Architecture
Demo
What is Kafka Connect ?
Added in 0.9 release of Apache Kafka.
Tool for scalably and reliably streaming data between Apache Kafka and other data systems.
For a long time, companies used to do data processingas big batch jobs.CSV files dumped out of databases, log files collected at the end of the day.
But businesses operate in real time.So, rather than processing data at the end of the day, why not react to it continuosuly as the data arrives.This is where stream processing came into picture And this shift led to the popularity of apache kafka.
But even with apache kafka, building real time data pipeline has required some effort.
And this is why kafka connect was announced as a new feature in 0.9
relaease of kafka
It abstracts away the common problems every connector to Kafka needs to solve:
schema management
fault tolerance
delivery semantics
operations, monitoring etc.
What is Kafka Connect ?
Schema management: The ability of the data pipeline to carry schema information where it is available.
In the absence of this capability, you end up having to recreate it downstream.
Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it.
Fault tolerance: Run several instances of a process and be resilient to failures
Delivery semantics: Provide strong guarantees when machines fail or processes crash
Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
Image source
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems.
It makes it simple to quickly define connectors that move large data sets into and out of Kafka.
Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency.
Sources and Sinks
Image Source
Sources and Sinks
Sources import data into Kafka,
and Sinks export data from Kafka.
An implementation of a Source or Sink is a Connector.
Users deploy connectors to enable data flows on Kafka
Some of the certified connectors utilizing kafka connect framework are :Source -> Jdbc, couchbase, Apache ignite,cassandraSink -> HDFS, Apache ignite, Solr
Motivation behind Kafka Connect
Why build another framework when there are already so many to choose from?
most of the solutions do not integrate optimally with a stream data platform.
where streaming, event-based data is the lingua franca and Kafka is the common medium that serves as a hub for all data.
eg. in log metric collection processing frameworks like
flume,logstash
They do not handle integration well with batch systems.Operationally complex for large data pipelines where an agent runs for each server.
Goblin,Siro ETL of data warehousing
Specific use case.Work with single sink
Benefits of kafka Connect
Broad copying by default
Streaming and batch
Scales to the application
Focus on copying data only
Accessible connector API
Quickly define connectors that copy vast quantities of data between systems
Support copying to and from both streaming and batch-oriented systems.
Scale down to a single process running one connector a small production environment, and scale up to an organization-wide service for copying data between a wide variety of large scale systems.
Focus on reliable, scalable data copying; leave transformation, enrichment, and other modifications It is easy to develop new connectors. The API and runtime model for implementing new connectors should make it simple to use.
Architecture
Three major models :
Connector model
Worker model
Data model
Connector Model
The connector model defines how third-party developers create connector plugins which import or export data from another system.
The model has two key concepts: Connector
Tasks
Connectors are the largest logical unit of work in Kafka Connect and define where data should be copied to and from.
This might cover copying a whole database or collection of databases into Kafka.
connector does not perform any copying itself instead it schedules tasks for it.
Tasks are responsible for producing or consuming sequences of Kafka ConnectRecords in order to copy data.
Connectors, tasks and workers
Image Source
Kafka Connects core concept that users interact with is a connector.
Partitions are balanced evenly across tasks.
Each task reads from its partitions, translates the data to Kafka Connect's format, decides the destination topic (and possibly partition) in Kafka.
Worker and Data Model
The worker model represents the runtime in which connectors and tasks execute.
Worker model allows Kafka Connect to scale to the application.
The data model addresses the remaining requirements, like coupling tightly with Kafka, schema management etc..
This layer decouples the logical work (connectors) from the physical execution (workers executing tasks)
Workers are processes that execute connectors and tasks
Workers automatically coordinate with each other to distribute work and provide scalability and fault tolerance.
All other tasks like schema managemenet,tight coupling with kafka.
Kafka Connect tracks offsets for each connector so that connectors can resume from their previous position in the event of failures or graceful restarts for maintenance.
It has two types of workers: Standalone
Distributed.
Worker and Data Model
so that connectors can resume from their previous position in the event of failures or graceful restarts for maintenance.
Standalone mode is the simplest mode, where a single process is responsible for executing all connectors and tasks. Since it is a single process, it requires minimal configuration.
In distributed mode, you start many worker processes using the same group.id
and they automatically coordinate to schedule execution of connectors and tasks across all available workers.
Balancing Work
simple example of a cluster of 3 workers (processes launched via any mechanism you choose) running two connectors.
The worker processes have balanced the connectors and tasks across themselves
Balancing Work
If a connector adds partitions, this causes it to regenerate task configurations.
Balancing Work
If one of the workers fails, the remaining workers rebalance the connectors
and tasks so the work previously handled by the failed worker is moved to other workers:
Questions
References
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines
http://docs.confluent.io/3.0.0/connect/intro.html
THANK YOU