Real time stream processing presentation at General Assemb.ly

25
Real Time Stream Processing Varun Vijayaraghavan

Transcript of Real time stream processing presentation at General Assemb.ly

Page 1: Real time stream processing presentation at General Assemb.ly

Real Time Stream Processing

Varun Vijayaraghavan

Page 2: Real time stream processing presentation at General Assemb.ly

Why real time?

We live in a world of continuous data.

Server

Page Views

Social Media Event (Image, Comments, etc)

Sensor Data

Instant Messaging

Market Data

Page 3: Real time stream processing presentation at General Assemb.ly

Why real time?

● What’s true right now - may not have been true in the past.

● Competition is fierce. We need to act before others to win.

● In some cases, “too late” can lead to losses you would rather avoid. :(

● Need to make decisions NOW!

Page 4: Real time stream processing presentation at General Assemb.ly

What does such a system look like?

Continuous Data Source Data Ingestion Real Time Stream

ProcessorsInstant Visualization / Insights / Alerts

m1, m2 ...

Page 5: Real time stream processing presentation at General Assemb.ly

Properties of real time stream processing systems

● Fast! It needs to keep up.● Scalable - Expand your cluster as data and computation needs grow

larger.● Fault tolerant. It should be able to recover from failure.

Important practical concerns:● Easy to setup and write applications with.● Needs to be battle tested. (Any distributed system is extremely hard to get

right!)● Needs to have excellent monitoring and tooling capability.

Page 6: Real time stream processing presentation at General Assemb.ly

Practical use cases● “Real time” predictive analytics for news media (what I

do :)○ Analyze clicks, page views, user behavior, social streams, video views

etc. to provide instant insights.● Cloud based home security systems

○ Analyze sensor data like temperature, video streams, motion sensors etc. to determine threats

● Trading shares○ Use real time market data (and other sources) to make instant

decisions (buy, sell, long, short etc.)○ Check out reactivetrader.com

Page 7: Real time stream processing presentation at General Assemb.ly

A small aside on parallelismFrom wikipedia: “Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently.”

Simple example: Think crawling / scraping thousands of web pages one after another vs doing them parallely.

Two types of parallelism: Task Parallelism (Apache Storm) and Data Parallelism (Hadoop, Apache Spark). Although most distributed stream processing systems are a combination of both.

Page 8: Real time stream processing presentation at General Assemb.ly

Task Parallelism

Streaming Data Source

Task 1 Task 2

Task 3

Task 4

Page 9: Real time stream processing presentation at General Assemb.ly

Data ParallelismStreaming Data Source

Data 1 Data 2 Data n

Task Task Task

Page 10: Real time stream processing presentation at General Assemb.ly

Apache Storm● Apache Storm is a distributed, real time, stream processing

system. It uses task based parallelism.● From the storm wiki: “Storm exposes a set of primitives for

doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm's primitives greatly ease the writing of parallel realtime computation”.

● It’s fast (written in jvm), scalable, robust and fault tolerant. Also, pretty easy to setup and has great tooling.

Page 11: Real time stream processing presentation at General Assemb.ly

Components of StormSpout: The data source (twitter posts, page views data etc).

Bolt: A processing task. (word-counter, category-classifier etc)

Topology: This is a graph that consists of one or more spouts (data source) and one or more bolts. Each bolt and spout can have several “workers” executing them in parallel.

Tuple: The “message” or “data” that’s passed between bolts and spouts. For instance: (“social-network”: “twitter”, “type”: “retweet”, “post”: “Hi Storm!”, “user_id”: “asdf123”, “post_id”: “1234”, “posted_at”: 2012-07-14T01:00:00)

Page 12: Real time stream processing presentation at General Assemb.ly

Storm Architecture

Shuffle Grouping:

“Tuples” from the previous bolt or spout can be processed by any instance of the current bolt.

Stream Grouping:

“Tuples” from the previous spout or bolt will always go to the same instance of the bolt or spout - depending on stream grouping field.

Example: If you group by user_id - all posts from that user will always be processed by the same instance.

Page 13: Real time stream processing presentation at General Assemb.ly

Example: Word counts topology

RandomSentenceSpout

SplitSentenceBolt

WordCountBolt

PrinterBolt

Topology that takes a stream of sentences and keeps printing out the number of occurrences of each word.

Page 14: Real time stream processing presentation at General Assemb.ly

Example: RandomSentence Spoutclass RandomSentenceSpout {

public void nextTuple() {

Utils.sleep(100);

String[] sentences = new String[]{

"the cow jumped over the moon",

"an apple a day keeps the doctor away",

"four score and seven years ago",

"snow white and the seven dwarfs",

"i am at two with nature"

};

String sentence = sentences[_rand.nextInt(sentences.length)];

collector.emit(new Values(sentence)); // Sends the sentence to the next bolt

}

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("sentence")); // Declares what this spout emits

}

}

Page 15: Real time stream processing presentation at General Assemb.ly

Example: SplitSentence Boltpublic class SplitSentence {

public void execute(Tuple tuple, BasicOutputCollector collector) {

String sentence = tuple.getStringByField("sentence");

String[] words = sentence.split(" ");

for(String word: words) {

collector.emit(new Values(word));

}

}

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("word"));

}

}

Page 16: Real time stream processing presentation at General Assemb.ly

Example: WordCount Boltpublic class WordCount extends BaseBasicBolt {

Map<String, Integer> counts = new HashMap<String, Integer>();

public void execute(Tuple tuple, BasicOutputCollector collector) {

String word = tuple.getString(0);

Integer count = counts.get(word);

if (count == null)

count = 0;

count = count + 1;

counts.put(word, count);

collector.emit(new Values(word, count));

}

public void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("word", "count"));

}

}

Page 17: Real time stream processing presentation at General Assemb.ly

Example: Printer Boltpublic class PrinterBolt {

public void execute(Tuple tuple, BasicOutputCollector collector) {

String word = tuple.getStringByField("word");

Integer count = tuple.getIntegerByField("count");

System.out.println("Word count for " + word + ": " + count);

}

}

Page 18: Real time stream processing presentation at General Assemb.ly

Example: WordCount topologypublic class WordCountTopology {

public static void main(String[] args) throws Exception {

//...

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);

builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");

builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

builder.setBolt("print", new PrinterBolt(), 8).shuffleGrouping("count");

// ...

}

Page 19: Real time stream processing presentation at General Assemb.ly

Let’s run this!

Page 20: Real time stream processing presentation at General Assemb.ly

Page 21: Real time stream processing presentation at General Assemb.ly

Lambda Architecture● Architecture proposed by creator of Storm (nathanmarz)● Two processing layers:

○ Speed layer■ This is the real time processing layer performed by a framework

like Storm.■ This helps you give real time insights, but is complex and can

potentially drop data.○ Batch layer

■ This is “after the fact” processing layer.■ This will give you “delayed” insights, but it can be made as

reliable as possible/

Page 22: Real time stream processing presentation at General Assemb.ly

Lambda Architecture

Continuous Data Source

Speed Layer (eg Storm)

Permanent Archive

Batch Layer (eg Hadoop)

DatabaseInstant / Delayed visualization and alerts

Page 23: Real time stream processing presentation at General Assemb.ly

Apache Spark● Apache Spark is a fast and general-purpose cluster

computing system. It has mapreduce like functionality.● It is a batch based data parallel system - where the data

is distributed to the workers in the cluster, and operations can be applied on them (similar to Hadoop).

● Spark Streaming using microbatches (~1s of data)● Spark Streaming + Spark = Lambda architecture with

the same codebase.

Page 24: Real time stream processing presentation at General Assemb.ly

Spark Streaming: Simple Example//Initialize connections and streams

...

// Read the lines

val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words

val words = lines.flatMap(_.split(" "))

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD

wordCounts.print()

// Execute and close streams....

Page 25: Real time stream processing presentation at General Assemb.ly

Questions?