Apache Storm

Apache Storm

What is Storm?

• Real time stream processing framework

• Scalable

–Up to 1 million tuples per second per node

• Fault Tolerant

–Tasks reassigned on failure

• Guaranteed Processing

–At least once processing

–Exactly once processing with some more work

• Relatively language agnostic

–Primarily JVM based

–Thrift API for defining and submitting topologies

–JSON based protocol for defining components in other languages

Motivation

• Process large amount of incoming data real time

• Classic use case is processing streams of tweets

–Calculate trending users

–Calculate reach of a tweet

• Data cleansing and normalization

• Personalization and recommendation

• Log processing

Lambda Architecture

Source: http://swaroopch.com/2013/01/12/big-data-nathan-marz/

• Most useful when

– Batch & speed layers do essentially the same

computation

– Sample use case: KPI dashboard

• Less useful when

– When batch & speed layers

do different computation

– Sample use case: Real-time model scoring

Basic Concepts

Tuple: Most fundamental data structure

and is a named list of values that can be of any datatype

Streams: Groups of tuples

Spouts: Generate streams.

Bolts: Contain data processing, persistence and alerting logic. Can also

emit tuples for downstream bolts

Tuple Tree: First tuple and all the tuples

that were emitted by the bolts that

processed it

Topology: Group of spouts and bolts wired together into a workflow

Architecture

Nimbus(Management server)• Similar to job tracker

• Distributes code around cluster

• Assigns tasks • Handles failures

Supervisor(Worker nodes):

• Similar to task tracker• Run bolts and spouts as ‘tasks’

ZooKeeper:• Cluster co-ordination

• Nimbus HA

• Stores cluster metrics

• Consumption related metadata for Trident topologies

Relationship Between Supervisors, Workers, Executors

& Tasks

Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

Each supervisor machine in storm has specific

Predefined ports to which a worker process is assigned

supervisor

Tuple Routing

Grouping type What it does When to use

Shuffle Grouping Sends tuple to a bolt in random round robin sequence

- Doing atomic operations eg. mathoperations.

Fields Grouping Sends tuples to a bolt based on one or or more field's in the tuple

- Segmentation of the incoming stream.- Counting tuples of a certain type.

All grouping Sends a single copy of each tuple to all instances of a receiving bolt

- Send some signal to all bolts like clear cache or refresh state etc.

- Send ticker tuple to signal bolts to save state etc.

Custom grouping Implement your own field grouping so tuples are routed based on custom logic

- Used to get max flexibility to change processing sequence, logic etc. based on different factors like data types, load, seasonality etc.

Direct grouping Source decides which bolt will receive tuple

- Depends.

Global grouping Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest ID)

- Global counts.

Fields grouping provides various ways to control tuple routing to bolts.

Topology creation example

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", kafkaSpout);

builder.setBolt("normalizer", new HashTagNormalizer(),2).shuffleGrouping("spout");

builder.setBolt("enumerator", new

HashTagEnumerator(),2).fieldsGrouping("normalizer", new Fields("hashtag"));

builder.setBolt("reporter", new ResultsReporter(),1).globalGrouping("enumerator");

Get Tweet Find Hashtags Report FindingsCount Hashtags

Kafka Spout"reader"

Bolt"normalizer"

Removes non-alphanumeric characters, extracts hashtag values and emits them.

Bolt"enumerator"

Keeps track of how

many instances of

each hashtag have

occurred.

Bolt"reporter"

Regularly creates reportand uploads it to Amazon S3.

What happens on failure?

• Run everything with monitoring

–E.g. daemontools or monit

–Restarts Nimbus and Supervisors on failure

• Nimbus

–Stateless (kept in either ZooKeeper or on disk)

–Single Point of Failure, Sort Of

– Workers still function, but can’t be reassigned when a node fails

– Supervisors continue as normal

• Supervisor

–Stateless

• Entire Node

–Nimbus reassigns tasks on that machine after timeout

Guaranteed Processing

• Tuples from Spout are tagged with a message ID

• Each of these tuples can result in a tuple tree

• Once every tuple in the tuple tree is processed, the original tuple is considered to be processed.

• Requires two pieces from the user

–Explicitly anchoring an emitted tuple to the input tuple(s)

–Ack or fail every tuple.

• If a tuple isn’t processed quickly enough, a timeout value will cause a failure.

• Spouts like the Kafka spout can replay tuples on failure, either as explicitly indicated by bolts or from timeouts.

–At least once processing!

What is Trident?

• Provides exactly once processing semantics in Storm

• Core concept is to process a group of tuples as a batch rather than process tuple at a time like core Storm does.

• Higher level API for defining topologies.

• All Trident topologies under the covers are automatically converted into Spouts and Bolts.

Parallelism

• Three basic variables: # Slots, # Workers, # Tasks

–No general way to answer beyond profiling and adjusting.

• Can set the number of executors (threads)

• Can set the number of tasks

–Tasks are NOT parallel within an executor

–More than one task for executor is useful for rebalancing while the topology is running

• Number of workers

–Increase when bottlenecked on CPU and each worker has many tuples to process

Patterns – Streaming Joins

• Combine two or more data streams

• Unlike database join, streaming join has infinite input, and unclear semantics.

• Different types of joins for different use cases

• Partition input streams the same wayFields groupbuilder.setBolt("join", new

MyJoiner(), parallelism)

.fieldsGrouping("1", new Fields("joinfield1",

"joinfield2"))

"joinfield2"));

Patterns – Batching

• For efficiency

–E.g. Elasticsearch bulk API

• Hold on to tuples in instance variable

• Process tuples

• Ack all the instance tuples

• When emitting, consider multi-anchored tuple to ensure reliability.

–Anchor to batched tuples to ensure all batched tuples are replayed.

Patterns – Streaming Top N

• Simplest way is to have a bolt that does global grouping on stream and maintains list in memory of top N items

–Doesn’t scale because whole stream goes through one task

• Alternative: Do many top N’s across partitions of stream

• Merge each partition top N to get global top N

• Use fields grouping to get partitioning

builder.setBolt("rank", new RankObjects(), parallelism)

.fieldsGrouping("objects", new Fields("value"));

builder.setBolt("merge", new MergeObjects())

.globalGrouping("rank");

Apache Storm - cis.csuohio.educis.csuohio.edu/~sschung/cis612/LectureNotes_storm.pdf · Apache...

Documents

Transcript of Apache Storm - cis.csuohio.educis.csuohio.edu/~sschung/cis612/LectureNotes_storm.pdf · Apache...

How Apache Kafka is transforming Hadoop, Spark and Storm

Apache Storm: Hands-on Session

AWS Webcast - Amazon Kinesis and Apache Storm

Hadoop Summit Europe 2014: Apache Storm Architecture

· B17 IN V l) 7 -f V y 7 OSS — 11 Apache Tomcat GCC Eclipse Linux MapReduce oss Apache Hadoop Apache Spark Apache Kafta Apache Storm

Performance Analysis of Apache Storm Applications Using ...webdiis.unizar.es/~jmerse/wp-content/plugins/papercite/pdf/rmb-iri1… · The Apache Storm technology [1] is currently used

Scaling Apache Storm - Hadoop Summit 2014

Apache Storm and twitter Streaming API integration

Using Apache Storm to Move Data...Apache Storm Moving Data Into and Out of Apache Storm Using Spouts and Bolts To initialize KafkaSpout, define a SpoutConfig subclass instance of the

Scaling Apache Storm (Hadoop Summit 2015)

Introduction to Apache NiFi And Storm

Amazon Web Services – Amazon Kinesis and Apache Storm

Real time and reliable processing with Apache Storm

Apache Storm Basics

Real-Time Streaming: Apache Spark Streaming i Apache Storm

Distributed Realtime Computation using Apache Storm

Learning Stream Processing with Apache Storm

Resource Aware Scheduling in Apache Storm

Tutorial: Apache Storm - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.Storm_.Tutorial.pdf · Apache Storm • Open source distributed realtime computation