Real time stream processing presentation at General Assemb.ly

Real Time Stream Processing

Varun Vijayaraghavan

Why real time?

We live in a world of continuous data.

Server

Page Views

Social Media Event (Image, Comments, etc)

Sensor Data

Instant Messaging

Market Data

Why real time?

● What’s true right now - may not have been true in the past.

● Competition is fierce. We need to act before others to win.

● In some cases, “too late” can lead to losses you would rather avoid. :(

● Need to make decisions NOW!

What does such a system look like?

Continuous Data Source Data Ingestion Real Time Stream

ProcessorsInstant Visualization / Insights / Alerts

m1, m2 ...

Properties of real time stream processing systems

● Fast! It needs to keep up.● Scalable - Expand your cluster as data and computation needs grow

larger.● Fault tolerant. It should be able to recover from failure.

Important practical concerns:● Easy to setup and write applications with.● Needs to be battle tested. (Any distributed system is extremely hard to get

right!)● Needs to have excellent monitoring and tooling capability.

Practical use cases● “Real time” predictive analytics for news media (what I

do :)○ Analyze clicks, page views, user behavior, social streams, video views

etc. to provide instant insights.● Cloud based home security systems

○ Analyze sensor data like temperature, video streams, motion sensors etc. to determine threats

● Trading shares○ Use real time market data (and other sources) to make instant

decisions (buy, sell, long, short etc.)○ Check out reactivetrader.com

http://reactivetrader.com

A small aside on parallelismFrom wikipedia: “Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently.”

Simple example: Think crawling / scraping thousands of web pages one after another vs doing them parallely.

Two types of parallelism: Task Parallelism (Apache Storm) and Data Parallelism (Hadoop, Apache Spark). Although most distributed stream processing systems are a combination of both.

Task Parallelism

Streaming Data Source

Task 1 Task 2

Task 3

Task 4

Data ParallelismStreaming Data Source

Data 1 Data 2 Data n

Task Task Task

Apache Storm● Apache Storm is a distributed, real time, stream processing

system. It uses task based parallelism.● From the storm wiki: “Storm exposes a set of primitives for

doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm's primitives greatly ease the writing of parallel realtime computation”.

● It’s fast (written in jvm), scalable, robust and fault tolerant. Also, pretty easy to setup and has great tooling.

Components of StormSpout: The data source (twitter posts, page views data etc).

Bolt: A processing task. (word-counter, category-classifier etc)

Topology: This is a graph that consists of one or more spouts (data source) and one or more bolts. Each bolt and spout can have several “workers” executing them in parallel.

Tuple: The “message” or “data” that’s passed between bolts and spouts. For instance: (“social-network”: “twitter”, “type”: “retweet”, “post”: “Hi Storm!”, “user_id”: “asdf123”, “post_id”: “1234”, “posted_at”: 2012-07-14T01:00:00)

Storm Architecture

Shuffle Grouping:

“Tuples” from the previous bolt or spout can be processed by any instance of the current bolt.

Stream Grouping:

“Tuples” from the previous spout or bolt will always go to the same instance of the bolt or spout - depending on stream grouping field.

Example: If you group by user_id - all posts from that user will always be processed by the same instance.

Example: Word counts topology

RandomSentenceSpout

SplitSentenceBolt

WordCountBolt

PrinterBolt

Topology that takes a stream of sentences and keeps printing out the number of occurrences of each word.

Example: RandomSentence Spoutclass RandomSentenceSpout {

public void nextTuple() {

Utils.sleep(100);

String[] sentences = new String[]{

"the cow jumped over the moon",

"an apple a day keeps the doctor away",

"four score and seven years ago",

"snow white and the seven dwarfs",

"i am at two with nature"

};

String sentence = sentences[_rand.nextInt(sentences.length)];

collector.emit(new Values(sentence)); // Sends the sentence to the next bolt

}

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("sentence")); // Declares what this spout emits

}

}

Example: SplitSentence Boltpublic class SplitSentence {

public void execute(Tuple tuple, BasicOutputCollector collector) {

String sentence = tuple.getStringByField("sentence");

String[] words = sentence.split(" ");

for(String word: words) {

collector.emit(new Values(word));

}

}


declarer.declare(new Fields("word"));

}

}

Example: WordCount Boltpublic class WordCount extends BaseBasicBolt {

Map<String, Integer> counts = new HashMap<String, Integer>();


String word = tuple.getString(0);

Integer count = counts.get(word);

if (count == null)

count = 0;

count = count + 1;

counts.put(word, count);

collector.emit(new Values(word, count));

}


declarer.declare(new Fields("word", "count"));

}

}

Example: Printer Boltpublic class PrinterBolt {


String word = tuple.getStringByField("word");

Integer count = tuple.getIntegerByField("count");

System.out.println("Word count for " + word + ": " + count);

}

}

Example: WordCount topologypublic class WordCountTopology {

public static void main(String[] args) throws Exception {

//...

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);

builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");

builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

builder.setBolt("print", new PrinterBolt(), 8).shuffleGrouping("count");

// ...

}

Let’s run this!

Lambda Architecture● Architecture proposed by creator of Storm (nathanmarz)● Two processing layers:

○ Speed layer■ This is the real time processing layer performed by a framework

like Storm.■ This helps you give real time insights, but is complex and can

potentially drop data.○ Batch layer

■ This is “after the fact” processing layer.■ This will give you “delayed” insights, but it can be made as

reliable as possible/

Lambda Architecture

Continuous Data Source

Speed Layer (eg Storm)

Permanent Archive

Batch Layer (eg Hadoop)

DatabaseInstant / Delayed visualization and alerts

Apache Spark● Apache Spark is a fast and general-purpose cluster

computing system. It has mapreduce like functionality.● It is a batch based data parallel system - where the data

is distributed to the workers in the cluster, and operations can be applied on them (similar to Hadoop).

● Spark Streaming using microbatches (~1s of data)● Spark Streaming + Spark = Lambda architecture with

the same codebase.

Spark Streaming: Simple Example//Initialize connections and streams

...

// Read the lines

val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words

val words = lines.flatMap(_.split(" "))

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD

wordCounts.print()

// Execute and close streams....

Questions?

Real time stream processing presentation at General Assemb.ly

Engineering

Transcript of Real time stream processing presentation at General Assemb.ly