Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up
-
Upload
knowledgent -
Category
Data & Analytics
-
view
303 -
download
0
description
Transcript of Stream Processing with Big Data: Knowledgent Big Data Palooza Meet-Up
©2014 Knowledgent Group Inc. All Rights Reserved
Stream Processing with Big Data
Learn Apache KafkaKishore VeletiBig Data Engineer
©2014 Knowledgent Group Inc. All Rights Reserved2
• Big Data Engineer at Knowledgent
• Background in enterprise application development using Hadoop stack, Java, PHP
• Worked in Healthcare, Banking, and Social Media Applications
• Passionate in sharing knowledge
About Me
©2014 Knowledgent Group Inc. All Rights Reserved3
Tutorial
©2014 Knowledgent Group Inc. All Rights Reserved4
• What is Apache Kafka?
• Apache Kafka Terminology
• Apache Kafka – about Topic & Partition
• Apache Kafka hands-on
We will discuss:
©2014 Knowledgent Group Inc. All Rights Reserved5
• Apache Kafka is a publish-subscribe messaging system implemented as a distributed commit log
• It is written in Java/Scala
• Built by LinkedIn to process activity stream data from their website
What is Apache Kafka?
©2014 Knowledgent Group Inc. All Rights Reserved6
• All the messages in Kafka are real-time
• There are many subscribers to a message
• Kafka persists messages to the disk
• Messages are retained for a specific time period
• Subscribers/clients store the state of their reads
• Easy to replay messages
What is Apache Kafka?
©2014 Knowledgent Group Inc. All Rights Reserved7
• Message: A datum to send
• Topic: Kafka maintains messages in categories called “topics”
• Partition: A logical division of a topic
• Producer: An API to publish messages to Kafka topic
• Broker: A server
• Cluster: Kafka cluster comprises one or more brokers
• Consumer: API to consume published messages and process further
• Replication: Kafka replicates log for each partition across servers
Apache Kafka Terminology
©2014 Knowledgent Group Inc. All Rights Reserved8
Message Topic Partition Producer Broker
Consumer
At a high level, producers send messages over the network to the Kafka cluster.
Kafka cluster in turn serves them up to consumers.
Apache Kafka Terminology & Big Picture
©2014 Knowledgent Group Inc. All Rights Reserved9
Message Topic Partition Producer Broker
Consumer
Let’s do a hands-on exercise of Kafka with knowledge we’ve learned until now
Apache Kafka Terminology & Big Picture
©2014 Knowledgent Group Inc. All Rights Reserved10
Message Topic Partition Producer Broker
Consumer
In Kafka for each topic a partition log is maintained.
Each partition is an ordered, immutable sequence of messages that is appended to
Each message in the partition is assigned a sequential id number called the offset
Apache Kafka: About Topic and Partition
Partition 1
Writes
Partition 2
Partition 3
©2014 Knowledgent Group Inc. All Rights Reserved11
Message Topic Partition Producer Broker Consumer
In Kafka, a Producer is an API to publish messages to topic
Apache Kafka: About Topic and Partition
©2014 Knowledgent Group Inc. All Rights Reserved12
Message Topic Partition Producer Broker Consumer
In Kafka, a Consumer is an API to consume messages from topics
Apache Kafka: About Topic and Partition
©2014 Knowledgent Group Inc. All Rights Reserved13
Message Topic Partition Producer Broker
Consumer
Let’s do a hands-on exercise of Kafka with knowledge we’ve learned until now
Apache Kafka Terminology & Big Picture
©2014 Knowledgent Group Inc. All Rights Reserved14
• Trading Systems- Risk Identification in real-time
• Change Data Capture- Capturing the changed data into data lake environment
• Online Gaming- Identifying top scorers of a game
Apache Kafka Use Cases
©2014 Knowledgent Group Inc. All Rights Reserved15
Thank you!
Questions?