Apache Storm - cis.csuohio.educis.csuohio.edu/~sschung/cis612/LectureNotes_storm.pdf · Apache...

Post on 10-Aug-2018

229 views 0 download

Transcript of Apache Storm - cis.csuohio.educis.csuohio.edu/~sschung/cis612/LectureNotes_storm.pdf · Apache...

© Hortonworks Inc. 2013

Apache Storm

Page 1

© Hortonworks Inc. 2013

What is Storm?

• Real time stream processing framework

• Scalable

–Up to 1 million tuples per second per node

• Fault Tolerant

–Tasks reassigned on failure

• Guaranteed Processing

–At least once processing

–Exactly once processing with some more work

• Relatively language agnostic

–Primarily JVM based

–Thrift API for defining and submitting topologies

–JSON based protocol for defining components in other languages

Page 2

© Hortonworks Inc. 2013

Motivation

• Process large amount of incoming data real time

• Classic use case is processing streams of tweets

–Calculate trending users

–Calculate reach of a tweet

• Data cleansing and normalization

• Personalization and recommendation

• Log processing

Page 3

© Hortonworks Inc. 2013

Lambda Architecture

Page 4

Source: http://swaroopch.com/2013/01/12/big-data-nathan-marz/

• Most useful when

– Batch & speed layers do essentially the same

computation

– Sample use case: KPI dashboard

• Less useful when

– When batch & speed layers

do different computation

– Sample use case: Real-time model scoring

© Hortonworks Inc. 2013

Basic Concepts

Page 5

Tuple: Most fundamental data structure

and is a named list of values that can be of any datatype

Streams: Groups of tuples

Spouts: Generate streams.

Bolts: Contain data processing, persistence and alerting logic. Can also

emit tuples for downstream bolts

Tuple Tree: First tuple and all the tuples

that were emitted by the bolts that

processed it

Topology: Group of spouts and bolts wired together into a workflow

© Hortonworks Inc. 2013

Architecture

Nimbus(Management server)• Similar to job tracker

• Distributes code around cluster

• Assigns tasks • Handles failures

Supervisor(Worker nodes):

• Similar to task tracker• Run bolts and spouts as ‘tasks’

ZooKeeper:• Cluster co-ordination

• Nimbus HA

• Stores cluster metrics

• Consumption related metadata for Trident topologies

© Hortonworks Inc. 2013

Relationship Between Supervisors, Workers, Executors

& Tasks

Page 7

Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

Each supervisor machine in storm has specific

Predefined ports to which a worker process is assigned

supervisor

© Hortonworks Inc. 2013

Tuple Routing

Page 8

Grouping type What it does When to use

Shuffle Grouping Sends tuple to a bolt in random round robin sequence

- Doing atomic operations eg. mathoperations.

Fields Grouping Sends tuples to a bolt based on one or or more field's in the tuple

- Segmentation of the incoming stream.- Counting tuples of a certain type.

All grouping Sends a single copy of each tuple to all instances of a receiving bolt

- Send some signal to all bolts like clear cache or refresh state etc.

- Send ticker tuple to signal bolts to save state etc.

Custom grouping Implement your own field grouping so tuples are routed based on custom logic

- Used to get max flexibility to change processing sequence, logic etc. based on different factors like data types, load, seasonality etc.

Direct grouping Source decides which bolt will receive tuple

- Depends.

Global grouping Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest ID)

- Global counts.

Fields grouping provides various ways to control tuple routing to bolts.

© Hortonworks Inc. 2013

Topology creation example

Page 9

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", kafkaSpout);

builder.setBolt("normalizer", new HashTagNormalizer(),2).shuffleGrouping("spout");

builder.setBolt("enumerator", new

HashTagEnumerator(),2).fieldsGrouping("normalizer", new Fields("hashtag"));

builder.setBolt("reporter", new ResultsReporter(),1).globalGrouping("enumerator");

Get Tweet Find Hashtags Report FindingsCount Hashtags

Kafka Spout"reader"

Bolt"normalizer"

Removes non-alphanumeric characters, extracts hashtag values and emits them.

Bolt"enumerator"

Keeps track of how

many instances of

each hashtag have

occurred.

Bolt"reporter"

Regularly creates reportand uploads it to Amazon S3.

© Hortonworks Inc. 2013

What happens on failure?

• Run everything with monitoring

–E.g. daemontools or monit

–Restarts Nimbus and Supervisors on failure

• Nimbus

–Stateless (kept in either ZooKeeper or on disk)

–Single Point of Failure, Sort Of

– Workers still function, but can’t be reassigned when a node fails

– Supervisors continue as normal

• Supervisor

–Stateless

• Entire Node

–Nimbus reassigns tasks on that machine after timeout

Page 10

© Hortonworks Inc. 2013

Guaranteed Processing

• Tuples from Spout are tagged with a message ID

• Each of these tuples can result in a tuple tree

• Once every tuple in the tuple tree is processed, the original tuple is considered to be processed.

• Requires two pieces from the user

–Explicitly anchoring an emitted tuple to the input tuple(s)

–Ack or fail every tuple.

• If a tuple isn’t processed quickly enough, a timeout value will cause a failure.

• Spouts like the Kafka spout can replay tuples on failure, either as explicitly indicated by bolts or from timeouts.

–At least once processing!

Page 11

© Hortonworks Inc. 2013

What is Trident?

• Provides exactly once processing semantics in Storm

• Core concept is to process a group of tuples as a batch rather than process tuple at a time like core Storm does.

• Higher level API for defining topologies.

• All Trident topologies under the covers are automatically converted into Spouts and Bolts.

Page 12

© Hortonworks Inc. 2013

Parallelism

• Three basic variables: # Slots, # Workers, # Tasks

–No general way to answer beyond profiling and adjusting.

• Can set the number of executors (threads)

• Can set the number of tasks

–Tasks are NOT parallel within an executor

–More than one task for executor is useful for rebalancing while the topology is running

• Number of workers

–Increase when bottlenecked on CPU and each worker has many tuples to process

Page 13

© Hortonworks Inc. 2013

Patterns – Streaming Joins

• Combine two or more data streams

• Unlike database join, streaming join has infinite input, and unclear semantics.

• Different types of joins for different use cases

• Partition input streams the same wayFields groupbuilder.setBolt("join", new

MyJoiner(), parallelism)

.fieldsGrouping("1", new Fields("joinfield1",

"joinfield2"))

.fieldsGrouping("2", new Fields("joinfield1",

"joinfield2"))

.fieldsGrouping("3", new Fields("joinfield1",

"joinfield2"));

Page 14

© Hortonworks Inc. 2013

Patterns – Batching

• For efficiency

–E.g. Elasticsearch bulk API

• Hold on to tuples in instance variable

• Process tuples

• Ack all the instance tuples

• When emitting, consider multi-anchored tuple to ensure reliability.

–Anchor to batched tuples to ensure all batched tuples are replayed.

Page 15

© Hortonworks Inc. 2013

Patterns – Streaming Top N

• Simplest way is to have a bolt that does global grouping on stream and maintains list in memory of top N items

–Doesn’t scale because whole stream goes through one task

• Alternative: Do many top N’s across partitions of stream

• Merge each partition top N to get global top N

• Use fields grouping to get partitioning

builder.setBolt("rank", new RankObjects(), parallelism)

.fieldsGrouping("objects", new Fields("value"));

builder.setBolt("merge", new MergeObjects())

.globalGrouping("rank");

Page 16