Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the...

43
Coroutines & Data Stream Processing An application of an almost forgotten concept in distributed computing Zbyněk Šlajchrt, [email protected], @slajchrt

Transcript of Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the...

Page 1: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Coroutines & Data Stream ProcessingAn application of an almost forgotten concept in distributed computing

Zbyněk Šlajchrt, [email protected], @slajchrt

Page 2: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

"Strange" Iterator ExampleCoroutinesMapReduce & CoroutinesClockworkConclusionQ&A

Agenda

Page 3: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Joining Multiple Iterators To One

Page 4: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Joining Iterators - A Traditional Approach

Page 5: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Joining Iterators - A Less-Traditional Approach

Page 6: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Joining Iterators To One - A Pipe

Page 7: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Joining Iterators To One - Coroutines

de.matthiasmann.continuations.CoIterator

Page 8: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Coroutines

Page 9: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

▪ A generalization of subroutines▪ AKA green threads, co-expressions, fibers,

generators

▪ Some may remember the old windows event loop▪ Behavior similar to that of subroutines ▪ Coroutines can call other coroutines▪ Execution may but may not later return to the point

of invocation▪ Often demonstrated on Producer/Consumer

scenario

Coroutines - Introduction

Page 10: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Coroutines Conceptual Example - Producer/Consumer (http://en.wikipedia.org/wiki/Coroutines)

Page 11: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Programming Language Support

▪ None of the top TIOBE languages (Java, C, C++, PHP, Basic)

▪ Go, Icon, Lua, Perl, Prolog, Ruby, Tcl, Simula, Python, Modula-2 …

Page 12: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Coroutines in Java

▪ Buffers, phases (BufferedJoinedIterator)

▪ Emulations by threads (PipeJoinedIterator)

▪ Byte-code manipulation (CoJoinedIterator)

▪ A need for JVM support, some future JSR?

▪ For some implementations see the references

▪ In this presentation I use Continuations library

developed by Matthias Mann (http://

www.matthiasmann.de/content/view/24/26/)

See: http://ssw.jku.at/General/Staff/LS/coro/CoroIntroduction.pdf

Page 13: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Coroutines Use Case Scenarios

▪ Iterators▪ Producer/Consumer chains▪ State machines▪ Visitors with loops instead of callbacks▪ Pull parsers▪ Loggers▪ Observers, listeners, notifications▪ Generally capable of converting PUSH algorithms

to PULL

Page 14: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

MapReduce & Coroutines

Page 15: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

▪ PULL, batch processing, offline▪ Two-phase computational mode: Map and Reduce▪ Map - filters, cleans or parses the input records▪ Reduce - aggregates the records obtained from Map▪ Easily distributable ▪ Inspired by functional programming▪ Benefits - scalable, thread-safe (no race conditions), simple

computational model▪ Rich libraries of algorithms - e.g. machine learning (Mahout)

Map-Reduce Overview

Page 16: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Map-Reduce Overview - Word Count example (http://en.wikipedia.org/wiki/MapReduce)

Page 17: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

▪ Could we write the same code for data streams?

▪ We could keep thinking in MR paradigm

▪ Perhaps, it would inherit the nice MR properties

▪ Sadly, streaming algorithms are inherently PUSH,

incremental or event-driven, i.e. callbacks instead of loops

▪ WordCount MR solution has 2 simple loops

▪ It is a typical Producer/Consumer problem

▪ Coroutines should help us!

Map-Reduce and Data Streams

Page 18: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

ClockworkAdoption of MR to stream data processing

See https://github.com/avast-open/clockwork

Page 19: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Clockwork - Adoption of MR for data stream processing

▪ Adoption of MR paradigm to data stream processing▪ Built on top of coroutines (Producer/Consumer)▪ Goes further than the original MR concepts▪ Easy for anyone familiar with Hadoop (or other) ▪ Open-sourced recently by AVAST▪ The core is practically production-ready (RC)▪ A lot of work to be done (networking, RPC)

Page 20: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Transformer 1Transformer 2Transformer 3

…Transformer NAccumulator

Execution:Input

Clockwork Execution Model

Page 21: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Reducer(loop)

Mapper(function)

Tank Partitioner

Transformer

Accumulator

Clockwork Execution Model - Transformers and Accumuators

Page 22: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Execution B

Execution A Execution A

Clockwork Distributed Execution

Page 23: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Reducer

Mapper▪ filtering▪ transforming▪ expanding

▪ aggregations▪ a coroutine component (loop)

Tank

Partitioner

▪ key-value storage▪ key-values storage▪ flushing buffer

▪ routing output to another nodes▪ partitioning▪ broadcasting

Clockwork Components

Page 24: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

WordSplitter

WordCoun

ter

TableAccum

Construction:

Feeding:

Word Counter in Clockwork

Note: The program must run with -javaagent:Continuations.jaror transformed by means of the Continuation ANT task. See http://www.matthiasmann.de/content/view/24/26/

Page 25: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Word Counter Mapper And Reducer

Page 26: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

WordSplitter

Counting

TableAccum

"Counting words cannot be easier"

words cannot be easier

WordCounter reducer instances (one coroutine instance per key)

Reducer Per-Key Instances

Page 27: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

WordSplitter

WordCoun

ter

TableAccum

Word Counter on One Node

Page 28: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

WordSplitter

Router

WordCoun

ter

TableAccum

WordSplitter

Router

f f

2*f

Distributed Word Counter- Many Mappers One Reducer

Page 29: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

WordSplitter

WordCoun

ter

Router

WordSplitter

WordCoun

ter

Router

WordCoun

ter

TableAccum

f f

a*f0 < a <= 2

Distributed Word Counter- Many Mappers with Combiners One Reducer

Page 30: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

WordSplitter

WordCoun

ter

Partitioner

WordSplitter

WordCoun

ter

Partitioner

WordCoun

ter

TableAccum

f f

a*f0 < a <= 2

WordCoun

ter

TableAccum

Distributed Word Counter- Many Mappers with Combiners Many Reducers

Page 31: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Reduce-only Setup - The "Nerdiest" clock in the world

MinuteWheel

HourWheel

SecWheel

1-msec ticks

DayWheel

DummyAccum

Construction:

Feeding:

Page 32: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Reduce-only Setup - Reducer as a Clock Wheel

Page 33: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

HttpRequestDecoder

MyHandler

HttpResponseEncoder

ChannelWriter

Construction:

Feeding:

Map-only Setup - HTTP pipeline

Type-safe execution construction:

Page 34: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

HttpRequestDecoder

MyHandler

HttpResponseEncoder

ChannelWriter

Construction:

Feeding:

HttpChunkAggregator

Many-Maps-One-Reduce Setup - HTTP pipeline

Page 35: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Naive Bayes Classification Map Reduce Job (http://nickjenkin.com/blog/?p=85)

Page 36: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Naive Bayes Classification - Mapper

Page 37: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Naive Bayes Classification - Reducer

Page 38: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

InstanceMapper

InstanceReducer

StatAccumulator

learning - (weight, height, foot) -> sex

guessing - (weight, height, foot) -> ?

Naive Bayes Classification - Deployment

Page 39: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Conclusion

Page 40: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

▪ Distributed stream processing easier

▪ Some techniques known from offline MR can be

adopted more or less directly

▪ Requires incremental algorithms and models

▪ Many deployment options, flushing strategies

▪ A lot of work to be done: communication protocol,

machine learning and statistics algorithms,

management tools, documentation ...

Conclusion

Page 41: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

Thanks for your attention!

Q&A

Page 42: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

This presentation deals with the concept of coroutines and its applicability in the world of stream data processing. Although it is rarely used in the todays applications, the coroutines have been here since the early days of digital computing. Surprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big data processing. In contrast to the traditional map-reduce concept, which is designed for offline job processing, the coroutines&map-reduce hybrid is primarily targeted at real-time event processing. Clockwork, an open-source library developed at Avast, combines these two concepts and allows a programmer to write a real-time stream analysis as if he wrote a traditional map-reduce job for Hadoop, for instance. The presentation is focused mainly on coding and samples and will show how to program applications ranging from simple real-time statistics to more advanced tasks.

Abstract

Page 43: Coroutines & Data Stream Processing - JavaSurprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big

References

1. http://www.matthiasmann.de/content/view/24/26/

2. http://ssw.jku.at/General/Staff/LS/coro/CoroIntroduction.pdf

3. http://en.wikipedia.org/wiki/Coroutine

4. http://nickjenkin.com/blog/?p=85

5. http://code.google.com/p/moa/

6. http://www2.research.att.com/~marioh/papers/vldb08-2.pdf

7. http://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759