Apache Storm Basics
-
Upload
joao-paulo-leonidas-fernandes-dias-da-silva -
Category
Data & Analytics
-
view
221 -
download
0
Transcript of Apache Storm Basics
Apache StormParallel Real Time Computation
What’s Storm
• It’s a distributed real time computation system
• It’s free and open source
Storm Applications
• Real time analytics• Online machine learning• Distributed RPC• Others
Storm Qualities• Broad set of use cases• Scalable• Guaranteed no data loss• Robust / Fault Tolerant• Programming language agnostic
Storm Architecture
Streams
• A stream is an unbounded sequence of tuples.
• Streams are defined with a schema that names the fields in the stream’s tuples.
Spouts
• Spouts - a spout is a source of streams for a given topology.
• It will read data from an external source and emit them into the topology as tuples.
Bolts
• A bolt is the processing element in the topology.
• Bolts can do simple stream transformations like: filtering, aggregations, functions, joins, etc.
Topologies
• A topology contains all the logic for the realtime application.
• A topology is a graph of spouts and bolts that are connected by stream groupings.
Tasks• Each spout or bolt executes as many tasks
across the cluster.• Each task corresponds to one thread of
execution.• Stream groupings define how to send
tuples from one set of tasks to another set of tasks.
Stream Groupings
• A stream grouping defines for a given bolt which streams it should receive as input.
• A stream grouping also defines how the stream’s tuples are partitioned among the bolt tasks.
Shuffle Grouping
• Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
Fields Grouping
• The stream is partitioned by the fields specified in the grouping.
• If the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task.
Global Grouping
• The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
Workers• Topologies execute across one or more
worker processes.• Each worker process is a physical JVM and
executes a subset of all the tasks for the topology.
• If the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks
A Basic StormTopology
A (not so) Basic StormTopology
Demo
Thanks!