Storm - As deep into real-time data processing as you can get in 30 minutes.

Post on 10-May-2015

15.815 views 1 download

Tags:

description

My slides from GlueCon 2013

Transcript of Storm - As deep into real-time data processing as you can get in 30 minutes.

Storm

Dan Lynndan@fullcontact.com

@danklynn

As deep into real-time data processing as you can get**in 30 minutes.

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & Co-Founderdan@fullcontact.com

@danklynn

Turn Partial Contacts Into Full Contacts

Storm

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

THE HARD WAY

Queues

Workers

THE HARD WAY

Key Concepts

TuplesOrdered list of elements

TuplesOrdered list of elements

("search-01384", "e:dan@fullcontact.com")

StreamsUnbounded sequence of tuples

StreamsUnbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple

SpoutsSource of streams

SpoutsSource of streams

SpoutsSource of streams

Tuple Tuple Tuple Tuple Tuple Tuple

Spouts can talk with

some  images  from  h,p://commons.wikimedia.org

•Queues

•Web  logs

•API  calls

•Event  data

BoltsProcess tuples and create new streams

Bolts

some  images  from  h,p://commons.wikimedia.org

•Apply  funcBons  /  transforms•Filter•AggregaBon•Streaming  joins•Access  DBs,  APIs,  etc...

Bolts

Tuple Tuple Tuple Tuple Tuple Tuple

some  images  from  h,p://commons.wikimedia.org

TupleTuple

TupleTuple

TupleTuple

TupleTuple

TupleTuple

TupleTuple

TopologiesA directed graph of Spouts and Bolts

This is a Topology

some  images  from  h,p://commons.wikimedia.org

This is also a topology

some  images  from  h,p://commons.wikimedia.org

TasksExecute Streams or Bolts

Running a Topology

$ storm jar my-code.jar com.example.MyTopology arg1 arg2

Storm Cluster

Nathan  Marz

Storm Cluster

Nathan  Marz

If this wereHadoop...

Storm Cluster

Nathan  Marz

Job Tracker

If this wereHadoop...

Storm Cluster

Nathan  MarzTask Trackers

If this wereHadoop...

Storm Cluster

Nathan  Marz

Coordinates everything

But it’s not Hadoop

Example:Streaming Word Count

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

splitsentence.py

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

Streaming Word Count

public static class WordCount extends BaseBasicBolt {    Map<String, Integer> counts = new HashMap<String, Integer>();

    @Override    public void execute(Tuple tuple, BasicOutputCollector collector) {        String word = tuple.getString(0);        Integer count = counts.get(word);        if(count==null) count = 0;        count++;        counts.put(word, count);        collector.emit(new Values(word, count));    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word", "count"));    }}

WordCount.java

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

Groupings control how tuples are routed

Shuffle groupingTuples are randomly distributed across all of the

tasks running the bolt

Fields groupingGroups tuples by specific named fields and routes

them to the same task

Fields groupingGroups tuples by specific named fields and routes

them to the same task

Analogous to Hadoop’s

partitioning behavior

Trending Topics

Twitter Trending Topics

TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)

(word)

RollingCountsBoltparallelism = n

(word, count)

IntermediateRankingsBoltparallelism = n

(rankings)

(tweets)

(JSON rankings)

RankingsReportBoltparallelism = 1

TotalRankingsBoltparallelism = 1

(rank

ings)

Live Coding!

Twitter Trending Topics

TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)

(word)

RollingCountsBoltparallelism = n

(word, count)

IntermediateRankingsBoltparallelism = n

(rankings)

(tweets)

(JSON rankings)

RankingsReportBoltparallelism = 1

TotalRankingsBoltparallelism = 1

(rank

ings)

Tips

loggly.com

Graylog2logstash

Use a log aggregator

"$topologyName-$buildNumber"

Rolling Deploys

1.  Launch  new  topology

2.  Wait  for  it  to  be  healthy

3.  Kill  the  old  one

Rolling Deploys

These are under active development

Rolling Deploys

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology

Tune your parallelism

Tune your parallelismSupervisor

Worker  Process  (JVM)

Executor  (thread)

Task

Task

Executor  (thread)

Task

Task

Worker  Process  (JVM)

Executor  (thread)

Task

Task

Executor  (thread)

Task

Task

Parallelism hints control the number of Executors

collector.emit(new Values(word, count));

see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology

Anchor your tuples (or not)

collector.emit(tuple, new Values(word, count));

But Dan, you left out Trident!

if (storm == hadoop) { trident = pig / cascading}

A little taste of Trident TridentState  urlToTweeters  =              topology.newStaticState(getUrlToTweetersState());TridentState  tweetersToFollowers  =              topology.newStaticState(getTweeterToFollowersState());

topology.newDRPCStream("reach")    .stateQuery(urlToTweeters,  new  Fields("args"),  new  MapGet(),                      new  Fields("tweeters"))    .each(new  Fields("tweeters"),  new  ExpandList(),  new  Fields("tweeter"))    .shuffle()    .stateQuery(tweetersToFollowers,  new  Fields("tweeter"),  new  MapGet(),                        new  Fields("followers"))      .parallelismHint(200)    .each(new  Fields("followers"),  new  ExpandList(),  new  Fields("follower"))    .groupBy(new  Fields("follower"))    .aggregate(new  One(),  new  Fields("one"))    .parallelismHint(20)    .aggregate(new  Count(),  new  Fields("reach"));

h,ps://github.com/nathanmarz/storm/wiki/Trident-­‐tutorial