Storm: The Real-Time Layer - GlueCon 2012

74
Storm Dan Lynn [email protected] @danklynn The Real-Time Layer Your Big Data’s Been Missing

description

Created by Nathan Marz at Twitter, Storm promises to help companies augment their batch-based big data processing systems with real-time computation.

Transcript of Storm: The Real-Time Layer - GlueCon 2012

Page 1: Storm: The Real-Time Layer  - GlueCon 2012

Storm

Dan [email protected]

@danklynn

The Real-Time Layer Your BigData’s Been Missing

Page 2: Storm: The Real-Time Layer  - GlueCon 2012

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & [email protected]

@danklynn

Page 3: Storm: The Real-Time Layer  - GlueCon 2012

Turn Partial Contacts Into Full Contacts

Page 4: Storm: The Real-Time Layer  - GlueCon 2012

Your data is old

Page 5: Storm: The Real-Time Layer  - GlueCon 2012

Old data is confusing

Page 6: Storm: The Real-Time Layer  - GlueCon 2012

First, there was the spreadsheet

Page 7: Storm: The Real-Time Layer  - GlueCon 2012

Next, there was SQL

Page 8: Storm: The Real-Time Layer  - GlueCon 2012

Then, there wasn’t SQL

Page 9: Storm: The Real-Time Layer  - GlueCon 2012

Batch Processing=

Stale Data

Page 10: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Computation

Page 11: Storm: The Real-Time Layer  - GlueCon 2012

Queues / Workers

Page 12: Storm: The Real-Time Layer  - GlueCon 2012

Queues / Workers

Workers

QueueMessagesMessagesMessagesMessagesMessagesMessagesMessages

Page 13: Storm: The Real-Time Layer  - GlueCon 2012

Queues / Workers

Workers

QueueMessagesMessagesMessagesMessagesMessagesMessagesMessages

Message routing can be complex

Page 14: Storm: The Real-Time Layer  - GlueCon 2012

Queues / Workers

Page 15: Storm: The Real-Time Layer  - GlueCon 2012

Brittle

Page 16: Storm: The Real-Time Layer  - GlueCon 2012

Hard to scale

Page 17: Storm: The Real-Time Layer  - GlueCon 2012

Queues / Workers

Page 18: Storm: The Real-Time Layer  - GlueCon 2012

Queues / Workers

Page 19: Storm: The Real-Time Layer  - GlueCon 2012

Queues / Workers

Page 20: Storm: The Real-Time Layer  - GlueCon 2012

Queues / WorkersRouting must bereconfigured whenscaling out

Page 21: Storm: The Real-Time Layer  - GlueCon 2012

Storm

Page 22: Storm: The Real-Time Layer  - GlueCon 2012

StormDistributed and fault-tolerant real-time computation

Page 23: Storm: The Real-Time Layer  - GlueCon 2012

StormDistributed and fault-tolerant real-time computation

Page 24: Storm: The Real-Time Layer  - GlueCon 2012

StormDistributed and fault-tolerant real-time computation

Page 25: Storm: The Real-Time Layer  - GlueCon 2012

StormDistributed and fault-tolerant real-time computation

Page 26: Storm: The Real-Time Layer  - GlueCon 2012

Key Concepts

Page 27: Storm: The Real-Time Layer  - GlueCon 2012

TuplesOrdered list of elements

Page 28: Storm: The Real-Time Layer  - GlueCon 2012

TuplesOrdered list of elements

("search-01384", "e:[email protected]")

Page 29: Storm: The Real-Time Layer  - GlueCon 2012

StreamsUnbounded sequence of tuples

Page 30: Storm: The Real-Time Layer  - GlueCon 2012

StreamsUnbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple

Page 31: Storm: The Real-Time Layer  - GlueCon 2012

SpoutsSource of streams

Page 32: Storm: The Real-Time Layer  - GlueCon 2012

SpoutsSource of streams

Page 33: Storm: The Real-Time Layer  - GlueCon 2012

SpoutsSource of streams

Tuple Tuple Tuple Tuple Tuple Tuple

Page 34: Storm: The Real-Time Layer  - GlueCon 2012

Spouts can talk with

some images from http://commons.wikimedia.org

Page 35: Storm: The Real-Time Layer  - GlueCon 2012

Spouts can talk with

some images from http://commons.wikimedia.org

•Queues

Page 36: Storm: The Real-Time Layer  - GlueCon 2012

Spouts can talk with

some images from http://commons.wikimedia.org

•Queues

•Web logs

Page 37: Storm: The Real-Time Layer  - GlueCon 2012

Spouts can talk with

some images from http://commons.wikimedia.org

•Queues

•Web logs

•API calls

Page 38: Storm: The Real-Time Layer  - GlueCon 2012

Spouts can talk with

some images from http://commons.wikimedia.org

•Queues

•Web logs

•API calls

•Event data

Page 39: Storm: The Real-Time Layer  - GlueCon 2012

BoltsProcess tuples and create new streams

Page 40: Storm: The Real-Time Layer  - GlueCon 2012

Bolts

Tuple Tuple Tuple Tuple Tuple Tuple

some images from http://commons.wikimedia.org

TupleTuple

TupleTuple

TupleTuple

TupleTuple

TupleTuple

TupleTuple

Page 41: Storm: The Real-Time Layer  - GlueCon 2012

Bolts

some images from http://commons.wikimedia.org

Page 42: Storm: The Real-Time Layer  - GlueCon 2012

Bolts

some images from http://commons.wikimedia.org

•Apply functions / transforms

Page 43: Storm: The Real-Time Layer  - GlueCon 2012

Bolts

some images from http://commons.wikimedia.org

•Apply functions / transforms•Filter

Page 44: Storm: The Real-Time Layer  - GlueCon 2012

Bolts

some images from http://commons.wikimedia.org

•Apply functions / transforms•Filter•Aggregation

Page 45: Storm: The Real-Time Layer  - GlueCon 2012

Bolts

some images from http://commons.wikimedia.org

•Apply functions / transforms•Filter•Aggregation•Streaming joins

Page 46: Storm: The Real-Time Layer  - GlueCon 2012

Bolts

some images from http://commons.wikimedia.org

•Apply functions / transforms•Filter•Aggregation•Streaming joins•Access DBs, APIs, etc...

Page 47: Storm: The Real-Time Layer  - GlueCon 2012

TopologiesA directed graph of Spouts and Bolts

Page 48: Storm: The Real-Time Layer  - GlueCon 2012

This is a Topology

some images from http://commons.wikimedia.org

Page 49: Storm: The Real-Time Layer  - GlueCon 2012

This is also a topology

some images from http://commons.wikimedia.org

Page 50: Storm: The Real-Time Layer  - GlueCon 2012

TasksProcesses which execute Streams or Bolts

Page 51: Storm: The Real-Time Layer  - GlueCon 2012

Running a Topology

$ storm jar my-code.jar com.example.MyTopology arg1 arg2

Page 52: Storm: The Real-Time Layer  - GlueCon 2012

Storm Cluster

Nathan Marz

Page 53: Storm: The Real-Time Layer  - GlueCon 2012

Storm Cluster

Nathan Marz

If this wereHadoop...

Page 54: Storm: The Real-Time Layer  - GlueCon 2012

Storm Cluster

Nathan Marz

Job Tracker

If this wereHadoop...

Page 55: Storm: The Real-Time Layer  - GlueCon 2012

Storm Cluster

Nathan MarzTask Trackers

If this wereHadoop...

Page 56: Storm: The Real-Time Layer  - GlueCon 2012

Storm Cluster

Nathan MarzCoordinates everything

But it’s not Hadoop

Page 57: Storm: The Real-Time Layer  - GlueCon 2012

Example:Streaming Word Count

Page 58: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Page 59: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Page 60: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

Page 61: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

splitsentence.py

Page 62: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

Page 63: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

Page 64: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

public static class WordCount extends BaseBasicBolt {    Map<String, Integer> counts = new HashMap<String, Integer>();

    @Override    public void execute(Tuple tuple, BasicOutputCollector collector) {        String word = tuple.getString(0);        Integer count = counts.get(word);        if(count==null) count = 0;        count++;        counts.put(word, count);        collector.emit(new Values(word, count));    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word", "count"));    }}

WordCount.java

Page 65: Storm: The Real-Time Layer  - GlueCon 2012

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

Groupings control how tuples are routed

Page 66: Storm: The Real-Time Layer  - GlueCon 2012

Shuffle groupingTuples are randomly distributed across all of the

tasks running the bolt

Page 67: Storm: The Real-Time Layer  - GlueCon 2012

Fields groupingGroups tuples by specific named fields and routes

them to the same task

Page 68: Storm: The Real-Time Layer  - GlueCon 2012

Fields groupingGroups tuples by specific named fields and routes

them to the same task

Analogous to Hadoop’s

partitioning behavior

Page 69: Storm: The Real-Time Layer  - GlueCon 2012

Distributed RPC

Page 70: Storm: The Real-Time Layer  - GlueCon 2012

Before Distributed RPC,time-sensitive queries relied on a

pre-computed index

Page 71: Storm: The Real-Time Layer  - GlueCon 2012

What if you didn’t need an index?

Page 72: Storm: The Real-Time Layer  - GlueCon 2012

Distributed RPC

Page 73: Storm: The Real-Time Layer  - GlueCon 2012

Try it out!

@stormprocessor

https://github.com/nathanmarz/storm-starter

http://github.com/nathanmarz/storm

Huge thanks to Nathan Marz - @nathanmarz