Real time stream processing presentation at General Assemb.ly
-
Upload
varun-vijayaraghavan -
Category
Engineering
-
view
138 -
download
0
Transcript of Real time stream processing presentation at General Assemb.ly
Real Time Stream Processing
Varun Vijayaraghavan
Why real time?
We live in a world of continuous data.
Server
Page Views
Social Media Event (Image, Comments, etc)
Sensor Data
Instant Messaging
Market Data
Why real time?
● What’s true right now - may not have been true in the past.
● Competition is fierce. We need to act before others to win.
● In some cases, “too late” can lead to losses you would rather avoid. :(
● Need to make decisions NOW!
What does such a system look like?
Continuous Data Source Data Ingestion Real Time Stream
ProcessorsInstant Visualization / Insights / Alerts
m1, m2 ...
Properties of real time stream processing systems
● Fast! It needs to keep up.● Scalable - Expand your cluster as data and computation needs grow
larger.● Fault tolerant. It should be able to recover from failure.
Important practical concerns:● Easy to setup and write applications with.● Needs to be battle tested. (Any distributed system is extremely hard to get
right!)● Needs to have excellent monitoring and tooling capability.
Practical use cases● “Real time” predictive analytics for news media (what I
do :)○ Analyze clicks, page views, user behavior, social streams, video views
etc. to provide instant insights.● Cloud based home security systems
○ Analyze sensor data like temperature, video streams, motion sensors etc. to determine threats
● Trading shares○ Use real time market data (and other sources) to make instant
decisions (buy, sell, long, short etc.)○ Check out reactivetrader.com
A small aside on parallelismFrom wikipedia: “Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently.”
Simple example: Think crawling / scraping thousands of web pages one after another vs doing them parallely.
Two types of parallelism: Task Parallelism (Apache Storm) and Data Parallelism (Hadoop, Apache Spark). Although most distributed stream processing systems are a combination of both.
Task Parallelism
Streaming Data Source
Task 1 Task 2
Task 3
Task 4
Data ParallelismStreaming Data Source
Data 1 Data 2 Data n
Task Task Task
Apache Storm● Apache Storm is a distributed, real time, stream processing
system. It uses task based parallelism.● From the storm wiki: “Storm exposes a set of primitives for
doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm's primitives greatly ease the writing of parallel realtime computation”.
● It’s fast (written in jvm), scalable, robust and fault tolerant. Also, pretty easy to setup and has great tooling.
Components of StormSpout: The data source (twitter posts, page views data etc).
Bolt: A processing task. (word-counter, category-classifier etc)
Topology: This is a graph that consists of one or more spouts (data source) and one or more bolts. Each bolt and spout can have several “workers” executing them in parallel.
Tuple: The “message” or “data” that’s passed between bolts and spouts. For instance: (“social-network”: “twitter”, “type”: “retweet”, “post”: “Hi Storm!”, “user_id”: “asdf123”, “post_id”: “1234”, “posted_at”: 2012-07-14T01:00:00)
Storm Architecture
Shuffle Grouping:
“Tuples” from the previous bolt or spout can be processed by any instance of the current bolt.
Stream Grouping:
“Tuples” from the previous spout or bolt will always go to the same instance of the bolt or spout - depending on stream grouping field.
Example: If you group by user_id - all posts from that user will always be processed by the same instance.
Example: Word counts topology
RandomSentenceSpout
SplitSentenceBolt
WordCountBolt
PrinterBolt
Topology that takes a stream of sentences and keeps printing out the number of occurrences of each word.
Example: RandomSentence Spoutclass RandomSentenceSpout {
public void nextTuple() {
Utils.sleep(100);
String[] sentences = new String[]{
"the cow jumped over the moon",
"an apple a day keeps the doctor away",
"four score and seven years ago",
"snow white and the seven dwarfs",
"i am at two with nature"
};
String sentence = sentences[_rand.nextInt(sentences.length)];
collector.emit(new Values(sentence)); // Sends the sentence to the next bolt
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("sentence")); // Declares what this spout emits
}
}
Example: SplitSentence Boltpublic class SplitSentence {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String sentence = tuple.getStringByField("sentence");
String[] words = sentence.split(" ");
for(String word: words) {
collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
Example: WordCount Boltpublic class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count = count + 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
Example: Printer Boltpublic class PrinterBolt {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getStringByField("word");
Integer count = tuple.getIntegerByField("count");
System.out.println("Word count for " + word + ": " + count);
}
}
Example: WordCount topologypublic class WordCountTopology {
public static void main(String[] args) throws Exception {
//...
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
builder.setBolt("print", new PrinterBolt(), 8).shuffleGrouping("count");
// ...
}
Let’s run this!
…
Lambda Architecture● Architecture proposed by creator of Storm (nathanmarz)● Two processing layers:
○ Speed layer■ This is the real time processing layer performed by a framework
like Storm.■ This helps you give real time insights, but is complex and can
potentially drop data.○ Batch layer
■ This is “after the fact” processing layer.■ This will give you “delayed” insights, but it can be made as
reliable as possible/
Lambda Architecture
Continuous Data Source
Speed Layer (eg Storm)
Permanent Archive
Batch Layer (eg Hadoop)
DatabaseInstant / Delayed visualization and alerts
Apache Spark● Apache Spark is a fast and general-purpose cluster
computing system. It has mapreduce like functionality.● It is a batch based data parallel system - where the data
is distributed to the workers in the cluster, and operations can be applied on them (similar to Hadoop).
● Spark Streaming using microbatches (~1s of data)● Spark Streaming + Spark = Lambda architecture with
the same codebase.
Spark Streaming: Simple Example//Initialize connections and streams
...
// Read the lines
val lines = ssc.socketTextStream("localhost", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD
wordCounts.print()
// Execute and close streams....
Questions?