Hadoop Summit Europe 2014: Apache Storm Architecture
-
Upload
p-taylor-goetz -
Category
Software
-
view
1.800 -
download
3
description
Transcript of Hadoop Summit Europe 2014: Apache Storm Architecture
© Hortonworks Inc. 2011
P. Taylor Goetz Apache Storm Committer [email protected] @ptgoetz
Apache Storm Architecture and Integration
Real-Time Big Data
Shedding Light on Data
Shedding Light on Big Data
Shedding Light on Big Data In Real Time
What is Storm?
Storm is Streaming
Storm is StreamingKey enabler of the Lamda Architecture
Storm is Fast
Storm is FastClocked at 1M+ messages per second per node
Storm is Scalable
Storm is ScalableThousands of workers per cluster
Storm is Fault Tolerant
Storm is Fault TolerantFailure is expected, and embraced
Storm is Reliable
Storm is ReliableGuaranteed message delivery
Storm is ReliableExactly-once semantics
Conceptual Model
Tuple
{…}
Tuple
{…} • Core Unit of Data • Immutable Set of Key/Value
Pairs
Streams
{…} {…} {…} {…} {…} {…} {…}
Unbounded Sequence of Tuples
Spouts
Spouts
• Source of Streams • Wraps a streaming data source
and emits Tuples
{…}{…}
{…}{…}
{…}{…}
{…}
{…} {…} {…} {…} {…} {…} {…}
Spout APIpublic interface ISpout extends Serializable {!! void open(Map conf, !! TopologyContext context, !! ! ! SpoutOutputCollector collector);!! void close();! ! void activate();! ! void deactivate();!! void nextTuple();!! void ack(Object msgId);!! void fail(Object msgId);!}
Lifecycle API
Spout APIpublic interface ISpout extends Serializable {!! void open(Map conf, !! TopologyContext context, !! ! ! SpoutOutputCollector collector);!! void close();! ! void activate();! ! void deactivate();!! void nextTuple();!! void ack(Object msgId);!! void fail(Object msgId);!}
Core API
Spout APIpublic interface ISpout extends Serializable {!! void open(Map conf, !! TopologyContext context, !! ! ! SpoutOutputCollector collector);!! void close();! ! void activate();! ! void deactivate();!! void nextTuple();!! void ack(Object msgId);!! void fail(Object msgId);!}
Reliability API
Bolts
Bolts
• Core functions of a streaming computation
• Receive tuples and do stuff • Optionally emit additional
tuples
Bolts
• Write to a data store
Bolts
• Read from a data store
Bolts
• Perform arbitrary computation
Compute
{…}{…}
{…}{…}
{…}{…}
{…}
Bolts
• (Optionally) Emit additional streams
{…} {…} {…} {…} {…} {…} {…}
Bolt API
public interface IBolt extends Serializable {!! void prepare(Map stormConf, ! TopologyContext context, ! OutputCollector collector);!! void cleanup();!! ! void execute(Tuple input);!! !}
Lifecycle API
Bolt API
public interface IBolt extends Serializable {!! void prepare(Map stormConf, ! TopologyContext context, ! OutputCollector collector);!! void cleanup();!! ! void execute(Tuple input);!! !}
Core API
Bolt Output API
public interface IOutputCollector extends IErrorReporter {!! List<Integer> emit(String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void emitDirect(int taskId, ! String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void ack(Tuple input);!! ! void fail(Tuple input);!}
Core API
Bolt Output API
public interface IOutputCollector extends IErrorReporter {!! List<Integer> emit(String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void emitDirect(int taskId, ! String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void ack(Tuple input);!! ! void fail(Tuple input);!}
Reliability API
Topologies
Topologies
Topologies
• DAG of Spouts and Bolts • Data Flow Representation • Streaming Computation
Topologies
• Storm executes spouts and bolts as individual Tasks that run in parallel on multiple machines.
Stream Groupings
Stream Groupings
Stream Groupings determine how Storm routes Tuples between tasks in a topology
Stream Groupings
Shuffle!!
Randomized round-robin.
Stream Groupings
LocalOrShuffle!!
Randomized round-robin. (With a preference for intra-worker Tasks)
Stream Groupings
Fields Grouping!!
Ensures all Tuples with with the same field value(s) are always routed to the same task.
Stream Groupings
Fields Grouping!!
Ensures all Tuples with with the same field value(s) are always routed to the same task.
!(this is a simple hash of the field values,
modulo the number of tasks)
Physical View
Physical ViewZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
Worker* Worker* Worker* Worker*
Topology Deployment
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Topology Submitter uploads topology:!• topology.jar!• topology.ser!• conf.ser
$ bin/storm jar
Topology Deployment
Nimbus calculates assignments and sends to Zookeeper
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Topology Deployment
Supervisor nodes receive assignment information !via Zookeeper watches.
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Topology Deployment
Supervisor nodes download topology from Nimbus:!• topology.jar!• topology.ser!• conf.ser
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Topology Deployment
Supervisors spawn workers (JVM processes) to start the topology
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Worker Worker Worker Worker
Fault Tolerance
Fault Tolerance
Workers heartbeat back to Supervisors and Nimbus via ZooKeeper, !as well as locally.
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Worker Worker Worker Worker
Fault Tolerance
If a worker dies (fails to heartbeat), the Supervisor will restart it
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Worker Worker Worker WorkerX
Fault Tolerance
If a worker dies repeatedly, Nimbus will reassign the work to other!nodes in the cluster.
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Worker Worker Worker WorkerX
Fault Tolerance
If a supervisor node dies, Nimbus will reassign the work to other nodes.
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Worker Worker Worker WorkerXX
Fault Tolerance
If Nimbus dies, topologies will continue to function normally,!but won’t be able to perform reassignments.
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
TopologySubmitter
Worker Worker Worker Worker
X
ParallelismScaling a Distributed Computation
Parallelism
Worker (JVM)
Executor (Thread) Executor (Thread) Executor (Thread)
Task Task Task
1 Worker, Parallelism = 1
ParallelismWorker (JVM)
Executor (Thread) Executor (Thread) Executor (Thread)
Task Task Task
Executor (Thread)
Task
1 Worker, Parallelism = 2
ParallelismWorker (JVM)
Executor (Thread) Executor (Thread)
Task Task
Executor (Thread)
Task
Task
1 Worker, Parallelism = 2, NumTasks = 2
Parallelism
3 Workers, Parallelism = 1, NumTasks = 1
Worker (JVM)Worker (JVM)Worker (JVM)
Executor (Thread) Executor (Thread) Executor (Thread)
Task Task Task
Internal Messaging
Internal MessagingWorker Mechanics
Worker Internal Messaging
Worker Receive Thread
Worker Port
List<List<Tuple>>Receive Buffer
Executor Thread *
Inbound Queue Outbound Queue
Router Send Thread
Worker Transfer Thread
List<List<Tuple>>Transfer Buffer
To Other Workers
Task(Spout/Bolt)
Task(Spout/Bolt)
Task(s)(Spout/Bolt)
Reliable ProcessingAt Least Once
Reliable Processing
Bolts may emit Tuples Anchored to one received. Tuple “B” is a descendant of Tuple “A”
{A} {B}
Reliable Processing
Multiple Anchorings form a Tuple tree (bolts not shown)
{A} {B}
{C}
{D}
{E}
{F}
{G}
{H}
Reliable Processing
Bolts can Acknowledge that a tuple has been processed successfully.
{A} {B}
ACK
Reliable Processing
Acks are delivered via a system-level bolt
ACK
{A} {B}
Acker Bolt
ackack
Reliable Processing
Bolts can also Fail a tuple to trigger a spout to replay the original.
FAIL
{A} {B}
Acker Bolt
failfail
Reliable Processing
Any failure in the Tuple tree will trigger a replay of the original tuple
{A} {B}
{C}
{D}
{E}
{F}
{G}
{H}
X
X
Reliable Processing
How to track a large-scale tuple tree efficiently?
Reliable Processing
A single 64-bit integer.
XOR Magic
Long a, b, c = Random.nextLong();
XOR Magic
Long a, b, c = Random.nextLong();!!a ^ a == 0
XOR Magic
Long a, b, c = Random.nextLong();!!a ^ a == 0!!a ^ a ^ b != 0
XOR MagicLong a, b, c = Random.nextLong();!!a ^ a == 0!!a ^ a ^ b != 0!!a ^ a ^ b ^ b == 0
XOR Magic
Long a, b, c = Random.nextLong();!!a ^ (a ^ b) ^ c ^ (b ^ c) == 0
XOR Magic
Long a, b, c = Random.nextLong();!!a ^ (a ^ b) ^ c ^ (b ^ c) == 0
Acks can arrive asynchronously, in any order
Trident
Trident
High-level abstraction built on Storm’s core primitives.
TridentBuilt-in support for:
• Merges and Joins
• Aggregations
• Groupings
• Functions
• Filters
Trident
Stateful, incremental processing on top of any persistence store.
Trident
Trident is Storm
Trident
Fluent, Stream-oriented API
TridentFluent, Stream-Oriented API
TridentTopology topology = new TridentTopology();!FixedBatchSpout spout = new FixedBatchSpout(…);!Stream stream = topology.newStream("words", spout);!!stream.each(…, new MyFunction())! .groupBy()! .each(…, new MyFilter())! .persistentAggregate(…);!
User-defined functions
Trident
Micro-Batch Oriented
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Trident
Trident Batches are Ordered
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Batch #1 Batch #2
Trident
Trident Batches can be Partitioned
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Trident
Trident Batches can be Partitioned
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Partition Operation
Partition A
{…} {…}
{…}{…}
Partition B
{…} {…}
{…}{…}
Partition C
{…} {…}
{…}{…}
Partition D
{…} {…}
{…}{…}
Trident Operation Types
1. Local Operations (Functions/Filters)
2. Repartitioning Operations (Stream Groupings, etc.)
3. Aggregations
4. Merges/Joins
Trident Topologies
each
each
shuffle
Function
Filter
partition persist
Trident Toplogies
Partitioning operations define the boundaries between bolts, and thus network transfer
and parallelism
Trident Topologies
each
each
shuffle
Function
Filter
partition persist
Bolt 1
Bolt 2
shuffleGrouping()
Partitioning!Operation
Trident Batch Coordination
Trident Batch Coordination
Trident SpoutMaster Batch Coordinator User Logic
nextbatch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
commit
Controlling Deployment
Controlling Deployment
How do you control where spouts and bolts get deployed in a cluster?
Controlling Deployment
How do you control where spouts and bolts get deployed in a cluster?
Plug-able Schedulers
Controlling Deployment
How do you control where spouts and bolts get deployed in a cluster?
Isolation Scheduler
Wait… Nimbus, Supervisor, Schedulers… !
Doesn’t that sound kind of like resource negotiation?
Storm on YARN
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
MapReduce (batch)
Apache STORM (streaming)
HADOOP 2.0
Tez (interactive)
Multi Use Data Platform Batch, Interactive, Online, Streaming, …
Storm on YARN
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
MapReduce (batch)
Apache STORM (streaming)
HADOOP 2.0
Tez (interactive)
Multi Use Data Platform Batch, Interactive, Online, Streaming, …
Batch and real-time on the same cluster
Storm on YARN
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
MapReduce (batch)
Apache STORM (streaming)
HADOOP 2.0
Tez (interactive)
Multi Use Data Platform Batch, Interactive, Online, Streaming, …
Security and Multi-tenancy
Storm on YARN
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
MapReduce (batch)
Apache STORM (streaming)
HADOOP 2.0
Tez (interactive)
Multi Use Data Platform Batch, Interactive, Online, Streaming, …
Elasticity
Storm on YARN
Nimbus Resource Management, Scheduling
Supervisor Node and Process management
Workers Runs topology tasks
YARN RM Resource Management
Storm AM Manage Topology
Containers Runs topology tasks
YARN NM Process Management
Storm’s resource management system maps very naturally to the YARN model.
Storm on YARN
Nimbus Resource Management, Scheduling
Supervisor Node and Process management
Workers Runs topology tasks
YARN RM Resource Management
Storm AM Manage Topology
Containers Runs topology tasks
YARN NM Process Management
High Availability
Storm on YARN
Nimbus Resource Management, Scheduling
Supervisor Node and Process management
Workers Runs topology tasks
YARN RM Resource Management
Storm AM Manage Topology
Containers Runs topology tasks
YARN NM Process Management
Detect and scale around bottlenecks
Storm on YARN
Nimbus Resource Management, Scheduling
Supervisor Node and Process management
Workers Runs topology tasks
YARN RM Resource Management
Storm AM Manage Topology
Containers Runs topology tasks
YARN NM Process Management
Optimize for available resources
Shameless Plug
https://www.packtpub.com/storm-distributed-real-time-
computation-blueprints/book
Thank You!
Contributions welcome.
Join the storm community at:http://storm.incubator.apache.org
P. Taylor Goetz [email protected] @ptgoetz