Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big...
Transcript of Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big...
![Page 1: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/1.jpg)
Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas Behrend
![Page 2: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/2.jpg)
Data Stream • A data stream is a sequence of data tuples. • Think of standard tuples of relational databases. • With time information (timestamps) • One after the other, or in batches, they are generated.
• That means, Data is moving! Continuously generated
(assumed infinite!) • Potentially high pace.
• System has to process data without first storing everything (how would that be possible anyway if stream is infinite?!)
![Page 3: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/3.jpg)
Sensor Networks as Data Streams Origin • E.g., in Environmental Monitoring
StationStream(timestamp, humidity, solarRadiation, windSpeed, snowHeight)
• Various application
scenarios: – avalanche risk level
computation – insights for agriculture – air pollution (urban)
monitoring
![Page 4: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/4.jpg)
Sample Application • The Pothole Patrol • Detecting and reporting the surface conditions
of roads; using sensors in vehicles • Using 3-axis accelerometer+GPS + learning
Eriksson et al. The Pothole Patrol: Using a Mobile Sensor Network for Road Surface Monitoring. MobiSys 2008.
![Page 5: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/5.jpg)
Sample Application • Environmental monitoring • Sensor data management
and meta data sharing. • Across many different types
of measurement: (hydrology, alpine monitoring, atmospheric phenomena, earthquakes, …)
• Also higher level applications like putting sensors and interpretations on maps, computing statistics over streams.
http://www.swiss-experiment.ch
![Page 6: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/6.jpg)
Earthquake News on Twitter
![Page 7: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/7.jpg)
Earthquake News on Twitter
![Page 8: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/8.jpg)
Earthquake News on Twitter
![Page 10: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/10.jpg)
Classic Example: Stock Market
• Real-time analysis of stock marked changes • Computing statistics over streams, e.g., for
decision support • Opportunities for reacting in real-time • Even with fully automated means: algorithmic
trading
![Page 11: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/11.jpg)
So Far: Databases/NoSQL Datastores
• Data is changing, yes, but this is more due to inserts and update to stored data items
• Historic data is kept • Queries operate on full data (tables) • MapReduce is extreme, Write-once & Read-
many times • Data warehousing, too: periodically loading
data in store for deep(er) analytics • Data mining
![Page 12: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/12.jpg)
Traditional Data Management …
• At query time, data is accessed as a whole • Data is persistently stored • Queries are ad-hoc (mainly)
DATA Base/Store
Query & Results Insert
Update
Delete
![Page 13: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/13.jpg)
Traditional Data Management vs. Data Stream Mgmt
Set of queries
DATA STREAM
• Data is moving! Continuously generated (assumed infinite!) • At high pace • Queries are (mainly) continuous (aka. standing). Registered
once, observed “forever”. • Answer to queries in (near) real-time required (often) • Probabilistic methods for efficiency or considering only part of
the stream (sliding window)
![Page 14: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/14.jpg)
DBMS vs. DSMS Database management system (DBMS)
Data stream management system (DSMS)
Persistent data (relations) Volatile data streams Random access Sequential access One-time queries Continuous queries (theoretically) unlimited secondary storage
Limited main memory
Only the current state is relevant
Consideration of the order of the input
Relatively low update rate
Potentially extremely high update rate
Little or no time requirements Real-time requirements
Assumes exact data
Assumes outdated/inaccurate data
Plannable query processing
Variable data arrival and data characteristics
http://en.wikipedia.org/wiki/Data-stream_management_system
![Page 15: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/15.jpg)
Data Stream Model • Stream of data items is unbounded (available
memory is not) • No way to store entire stream (how could we,
its (probably) not ending) • To compute query results, need to devise
algorithm with little memory consumption
![Page 16: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/16.jpg)
Overview of Data Stream Topics • Synopses:
– concise representations of stream content – tailored to tasks, e.g., counting distinct elements – usually not exact, but approximations (estimators) of
true values. – generally useful for representing data compactly – We will look at some of them today
• (Sliding) Windows: – focus of certain recent subset of data – computation of functions/joins over window(s)
content – Will look at CQL language: think “SQL” for streaming
data
![Page 17: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/17.jpg)
Data Stream Mining: Teasers
• I tell you integer numbers between 1 and N • I will tell all but one number
• After N-1 numbers I ask: which number was missing?
481 324 122 412 871 231 849 447 641 …
![Page 18: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/18.jpg)
Data Stream Mining: Teasers (Cont’d) • Keep Boolean array of length N:
– Mark position for observed number – Size required: N – Computation at end: N to find missing number
![Page 19: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/19.jpg)
Data Stream Mining: Teasers (Cont’d) • Keep Boolean array of length N:
– Mark position for observed number – Size required: N – Computation at end: N to find missing number
• Much better: – keep sum of numbers: S – Missing number is N*(N+1)/2 - S
![Page 20: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/20.jpg)
Counting Occurrences • Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, … • How often does a2 occur?
• How to implement?
![Page 21: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/21.jpg)
Counting Occurrences • Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, … • How often does a2 occur?
• How to implement? • Keep counter for each id • Required space #ids (=N) • Not feasible of N is very large
![Page 22: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/22.jpg)
Probabilistic Count'g:Count-Min Sketch
Cormode, Muthukrishnan (2004). An Improved Data Stream Summary: The Count- Min Sketch and its Applications. J. Algorithms 55: 29–38.
• Keep 2-dim array (h, r) • h hash functions hi that map to range 0…(r-1)
0 1 2 3 4 5
• Arriving item x. • For each j: array[j, hj(x)]++
h1
h2
h3
h4
![Page 23: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/23.jpg)
Count-Min Sketch: Insert Example 0 1 2 3 4 5
h1
h2
h3
h4
a, b, a, a, c, a, c, ….
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Data stream is Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
![Page 24: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/24.jpg)
Count-Min Sketch: Insert Example 1
1
1
1
0 1 2 3 4 5 h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Data stream is Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, …. red = inserted
![Page 25: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/25.jpg)
Count-Min Sketch: Insert Example 1 1
2
1 1
1 1
0 1 2 3 4 5 h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Data stream is Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, …. red = inserted
![Page 26: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/26.jpg)
Count-Min Sketch: Insert Example 1 2
3
2 1
2 1
0 1 2 3 4 5 h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Data stream is Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, …. red = inserted
![Page 27: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/27.jpg)
Count-Min Sketch: Insert Example 1 3
4
3 1
3 1
0 1 2 3 4 5 h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Data stream is Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, …. red = inserted
![Page 28: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/28.jpg)
Count-Min Sketch: Insert Example 1 1 3
1 4
4 1
3 2
0 1 2 3 4 5 h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Data stream is Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
Imagine that continues now a bit, then we might end up with ……
a, b, a, a, c, a, c, …. red = inserted
![Page 29: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/29.jpg)
Count-Min Sketch: Counting
• How often did we see item a? • Recall the hash function values for a:
h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2
![Page 30: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/30.jpg)
Count-Min Sketch: Counting
Is this estimator generally underestimating or overestimating or can’t we say anything about that?
![Page 31: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/31.jpg)
Count-Min Sketch: Counting
• Estimate is never underestimating • Overestimation probabilistically bounded
![Page 32: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/32.jpg)
Continous Queries
![Page 33: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/33.jpg)
Data Stream Model • A stream S is a (possibly) infinite bag (multiset)
of elements <s,τ> where s is a tuple belonging to the schema of S and τ is the timestamp of the element.
• Think: tuples of a relational DBMS extended
with timestamp, streaming in.
![Page 34: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/34.jpg)
Data Streams: Example • Monitoring of highway traffic:
PosSpeedStr(vehicleId, speed, xPos, dir, hwy)
• E.g., for: – congestion
prediction/warning – estimates of travel time – toll collection – ticket for too fast driving
![Page 35: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/35.jpg)
Data Streams: Example • Environmental Monitoring StationStream(humidity, solarRadiation, windSpeed,
snowHeight) • Various application
scenarios: – avalanche risk level
computation – insights for agriculture – air pollution (urban)
monitoring
![Page 36: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/36.jpg)
Continuous Queries • In contrast to ad-hoc, single time queries in
(relational) DBMS. • Queries over Streams are considered continuous:
registered once, run “forever”: – “want to stay updated to avalanche risk, not just
check once” • Also called standing queries or subscriptions (in
publish/subscribe context) • For instance:
– Compute average temperature. – Select all orders of stock “Apple” with quantity larger
than 100.
![Page 37: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/37.jpg)
What and How can we Compute DB-Style Queries?
• How to compute average values over an infinite stream? Block forever?
• How to join infinite streams if join partners can
arbitrarily arrive (or not)?
![Page 38: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/38.jpg)
What and How can we Compute DB-Style Queries?
• How to compute average values over an infinite stream? Block forever?
• How to join infinite streams if join partners can
arbitrarily arrive (or not)?
• Idea: keep window that renders a continuous
(infinite) stream a snapshot/static relation
![Page 39: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/39.jpg)
Sliding Window Concept • Focus attention to latest values of stream • Allows computation of aggregates • Joins are computed across windows overlaid of
other (or same) streams
time
past data
current data
future data
current window, defines data
![Page 40: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/40.jpg)
Sliding Window: Example
• Window of size W – based on time (=> time-
based) – or number of tuples inside
(count-based)
• Shifted every t by B
18.3°C 13.5°C 27.0°C 11.6°C 29.6°C 39.7°C 24.2°C 11.5°C 12.7°C 27.9°C ….
![Page 41: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/41.jpg)
Sliding Window Aggregates
• Output average for each window when it slides.
• Here: – 17.6°C – 26.3°C – 19.1°C
18.3°C 13.5°C 27.0°C 11.6°C 29.6°C 39.7°C 24.2°C 11.5°C 12.7°C 27.9°C ….
![Page 42: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/42.jpg)
Sliding Window Joins
• Join is executed over individual window contents.
window 2
window 1 stream 1
stream 2
![Page 43: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/43.jpg)
Types of Sliding Windows • Time based Window
– window contains tuples within a certain time range; e.g., Twitter Tweets of the last 10 minutes, stock market values of the last 10 seconds
– size can arbitrarily change if input rate changes
• Count-based Window – window contains at any time a fixed amount of items,
say, the last 100 Tweets or 10000 last stock trades – newly arriving items kick out older ones (once window is
filled up), depending on strategy (next slide)
![Page 44: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/44.jpg)
Types of Sliding Windows (Cont’d) • Sliding Window: move window on certain
ticks/time, continuous or in blocks
• Tumbling Window: create new window for each
time range of size W (i.e., non overlapping)
• At each slide/”tumple” a function can be applied
to window content and the result outputted • This is also called “trigger”.
![Page 45: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/45.jpg)
Overview of DSMSs
• STREAM (Stanford University), Aurora (Brandeis/Brown/MIT), TelegraphCQ (UC Berkely), Cayuga (Cornell), PIPES (Uni Marburg), …
• Large interest also from companies/startups:
Oracle Microsoft, IBM, Streambase • Lately open-source product for big data
distributed streams: Yahoo! S4, Twitter Storm (will see in more detail later)
![Page 47: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/47.jpg)
STREAM • Stanford Stream Data Manager • “General purpose” DSMS for streams and stored
data
• CQL: Declarative query language to phrase
continuous queries (SQL like).
Arvind Arasu et al. : STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull. 26(1): 19-26 (2003)
![Page 48: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/48.jpg)
Continuous Query Language – CQL SQL with:
– Streams – Windows – New semantics (stream)
• Three relation-to-stream operators: Istream, Dstream, Rstream
– Sampling
Slide based on material from Jennifer Widom.
within the STREAM framework
A. Arasu, S. Babu, J. Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution. http://ilpubs.stanford.edu:8090/758/1/2003- 67.pdf
![Page 49: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/49.jpg)
Example Query 1 • Two streams:
– Orders (orderID, customer, cost) – Fulfillments (orderID, clerk)
• Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
SELECT sum(O.cost) FROM Orders O [Range 1 Day], Fulfillments F [Range 1 Day] WHERE O.orderID = F.orderID and F.clerk = “Sue”
and O.customer = “Joe”
![Page 50: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/50.jpg)
Example Query 2 • Using a 10% sample of the fulfillments stream,
take the 5 most recent fulfillments for each clerk and return the maximum cost
SELECT F.clerk, max(O.cost) FROM orders O,
fulfillments F [PARTITION BY clerk ROW 5] 10% SAMPLE WHERE O.orderID = F.orderID GROUP BY F.clerk
![Page 51: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/51.jpg)
CQL: Relations and Streams • T: discrete, ordered time domain
• A relation R is a mapping from time T to bag
of tuples belonging to the schema of R. • That is, R(t) varies over time
• A stream is a set of (tuple, timestamp)
elements
![Page 52: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/52.jpg)
Streams Relations
Streams Relations
Window specification
Any relational query language
Special operators:
Istream, Dstream, Rstream
Slide based on material from Jennifer Widom.
![Page 53: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/53.jpg)
Slide based on material from Jennifer Widom.
Stream Relation • S [W] is a relation: at time T it contains all tuples in
window W applied to stream S, up to time T.
• When W = ∞, it contains all tuples in stream S up to time T
• Ways to construct these windows “[W]”
– Time-based – Tuple-based – Partitioned
![Page 54: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/54.jpg)
Time-Based Window • S [Range T]
– S [Now] – S [Range Unbounded]
Examples: • PosSpeedStr [RANGE 30 Seconds] • PosSpeedStr [NOW] • PosSpeedStr [RANGE Unbounded]
Note: variable number of records in the window
Stream with vehicle data on highway: PosSpeedStr(vehicleId,speed,xPos,dir,hwy)
Slide based on material from Jennifer Widom.
![Page 55: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/55.jpg)
Tuple-Based Window • S [Rows N]
– If tuples form a partial order, ties are broken arbitrarily
– [Rows Unbounded]
Example: • PosSpeedStr [ROWS 1]
Stream with vehicle data on highway:
Slide based on material from Jennifer Widom.
PosSpeedStr(vehicleId,speed,xPos,dir,hwy)
![Page 56: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/56.jpg)
Partitioned Windows • S [Partition By A1,...,Ak Rows N]
1. Logically partition S into substreams (compare to SQL GROUP By)
2. Compute a tuple sliding window 3. Take union
Example: • PosSpeedStr [PARTITION BY vehicleId ROWS 1]
Stream with vehicle data on highway:
Slide based on material from Jennifer Widom.
PosSpeedStr(vehicleId,speed,xPos,dir,hwy)
![Page 57: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/57.jpg)
Relation Relation
Slide based on material from Jennifer Widom.
• With previous window transform we get a relation, now we can apply
• any query expressed in SQL – just that we deal now with time-varying relations
Example: • SELECT distinct vehicleId
FROM PosSpeedStr [RANGE 30 Seconds] Computes the active vehicles
![Page 58: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/58.jpg)
Slide based on material from Jennifer Widom.
Relation Stream • Istream(R) contains a stream element (r,t)
whenever r in R(t) \ R(t-1) “Insert stream” • Dstream(R) contains a stream element (r,t)
whenever r in R(t-1) \ R(t) “Delete stream” • Rstream(R) contains a stream element (r,t)
whenever r in R(t) “Relation stream”
Bag (Multiset) semantics
![Page 59: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/59.jpg)
Istream, Dstream, and Rstream • Istream(R): contains all tuples in R that are new
within the last time period, i.e., insert stream • Dstream(R): contains all tuples in R which
where in the stream before the last period (and not anymore in now), i.e., delete stream
• Rstream(R): contains all tuples in R
Note: Istream and Dstream are expressible with Rstream and suitable selections/windows. How?
![Page 60: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/60.jpg)
Relation Stream: Examples SELECT Istream(*) FROM PosSpeedStr [RANGE Unbounded] WHERE speed > 65
SELECT Rstream(*) FROM PosSpeedStr [NOW] WHERE speed > 65
sliding window that contains only the last (now) tuples; from that instant in time
Slide based on material from Jennifer Widom.
![Page 61: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/61.jpg)
Slide based on material from Jennifer Widom.
Query Results at Time T • Use all relations at time T • Use all streams up to T, converted to relations • Compute relational results • Convert result to streams if desired
![Page 62: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/62.jpg)
Slide based on material from Jennifer Widom.
Examples SELECT F.clerk, max(O.cost) FROM O [∞], F [Rows 1000] WHERE O.orderID = F.orderID GROUP BY F.clerk
• At time T: entire stream O and last 1000 tuples of F as relations
• Evaluate query, update result relation at T
Orders (orderID, customer, cost)Fulfillments (orderID, clerk)
![Page 63: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/63.jpg)
Slide based on material from Jennifer Widom.
Examples (Cont’d)
SELECT Istream(F.clerk, max(O.cost)) FROM O [∞], F [Rows 1000] WHERE O.orderID = F.orderID GROUP BY F.clerk
• At time T: entire stream O and last 1000 tuples of F as relations
• Evaluate query, update result relation at T • Streamed result: New result (<clerk, max>, T), whenever
<clerk, max> changes from T-1
Orders (orderID, customer, cost)Fulfillments (orderID, clerk)
![Page 64: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/64.jpg)
Examples (Cont’d) • What is the following query doing? SELECT Istream(Avg(A)) FROM S [Range 5 seconds] Emit 5-second moving average on every timestep, but output is generated only if average changes (Istream!)
• To emit a result on every timestep SELECT Rstream(Avg(A)) FROM S [Range 5 seconds]
• To emit a result on every second SELECT Rstream(Avg(A)) FROM S
[Range 5 seconds Slide 1 second]
Slide based on material from Jennifer Widom.
![Page 65: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/65.jpg)
Slide based on material from Jennifer Widom.
Query Execution in STREAM • When a continuous query is registered, generate a
query execution plan – New plan merged with existing plans – Users can also create & manipulate plans directly
• Plans composed of three main components: – Operators – Queues (input and inter-operator) – State (windows, operators requiring history)
• Global scheduler for plan execution
![Page 66: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/66.jpg)
More Topics • Seen only formal model and standard
concepts of data stream management systems • There is of course much more to it • Implementation, optimization (e.g.,
equivalences), load shedding, ... • Would be an own entire lecture by itself. • Next, distributed data stream management
systems
![Page 67: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/67.jpg)
Query Processing • Many problems to be addressed resemble
conceptually the same issues that arise in traditional RDBMS
• Goals of DSMS are different in many aspects, though. – Continuous queries – Push-based data model – Aim at real-time processing – Need for memory efficient algorithms – Handle overload to guarantee real-time processing;
load shedding – Sharing of intermediate results (multi query
optimization)
![Page 68: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/68.jpg)
Implementation and Processing • Query is compiled into query execution plan
(similar to what is known from RDBMS lectures)
• Recall differences from DBMS and DSMS; data is
actively streaming in.
• What does this imply for the implementation?
![Page 69: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/69.jpg)
Push vs. Pull • Two fundamentally different ways operators
(nodes in a query plan) interact
• Pull: Consuming operator actively retrieves results of producer.
• Push: Producer push results to consumer.
![Page 70: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/70.jpg)
Pull • We all know that from DBMS (think JDBC or
operator trees) or Java Iterators ResultSet rset = Statement.executeQuery(“Select * from ….”); while (rset.next()) {
rset.getInteger(1); …
}
SELECT c.plate, p.lastname FROM people p JOIN cars c ON p.id=c.owner WHERE c.plate LIKE ‘KL-%’ left.id=right.owner
SCAN people
SCAN cars
σ π
plate LIKE ‘KL-%’
plate, lastname
“OPEN, NEXT, CLOSE”
![Page 71: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/71.jpg)
Push • Stream processing is by design mainly data-
driven • Operators register at other operators • When new tuples are generated, they are
actively pushed to registered operators
• Creating a directed acyclic graph (DAG), e.g.,
called topology in later system
![Page 72: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/72.jpg)
STREAM: Simple Query Plan Q1 Q2
State4 ⋈ State3 σ
Stream1 Stream2
Stream3
State1 State2 ⋈
Slide courtesy of Jennifer Widom.
![Page 73: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/73.jpg)
Query Plans in STREAM • Operators
– do the actual processing; – e.g., join, selection, window, …
• Queues – connect operators
• Synopses – store operator states. For
instance, the hash table of a hash-based join
σ
State1 ⋈
……
……
![Page 74: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/74.jpg)
Queues • A queue connects a tuple producing
operator OP and its consuming operator OC
• Conceptually FIFO buffer • Elements inserted and retrieved in
timestamp order
• Shared Queues: multiple consumers for one producer possible
OC
OP
![Page 75: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/75.jpg)
Operator Decoupling • Queues allow decoupling of operators • Consumers read from queue • Producers write to queue
![Page 76: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/76.jpg)
Distributed DSMS • Conceptually, distributed data stream
management systems behave/look like centralized ones
• STREAM (seen before) • Borealis (Brandeis U, Brown U, MIT) • Global Sensor Networks (EPFL) • …
Abadi et al. : The Design of the Borealis Stream Processing Engine. CIDR 2005: 277-289
Karl Aberer et al.: Infrastructure for Data Processing in Large-Scale Interconnected Sensor Networks. MDM 2007: 198-205
![Page 77: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/77.jpg)
Distributed DSMS (Cont’d) • In spirit of the beginning of the lecture on
MapReduce / NoSQL, we look at very recent distributed DSMS for big data (stream) processing – Yahoo! S4 (now Apache) – Twitter (Apache) Storm
• Many concepts are also generic. Conceptually, e.g., the operator interfaces and topologies.
![Page 78: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/78.jpg)
(Generic) Aims • Guaranteed data processing • Fault tolerance • Horizontal scalability • Enable high-level programming
• Sounds like MapReduce/Hadoop? Well …
![Page 79: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/79.jpg)
Apache Storm • Sometimes referred to as “the realtime
Hadoop” • Fault tolerant, distributed stream processing
system. Developed by N. Marz (now Twitter) in 2011
• Widely used by companies • Data stream operators are (can) be put on
different nodes; replicated operators of same kind for scalability.
![Page 80: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/80.jpg)
Trident • Guess what? There is a high-level abstraction
on top of Storm.
TridentTopology topology = new TridentTopology(); TridentState wordCounts =
topology.newStream("spout1", spout) .each(new Fields("sentence"),
new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count")) .parallelismHint(6);
https://github.com/nathanmarz/storm/wiki/Trident-tutorial
![Page 81: Big Data Management and NoSQL Databasespages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS... · Big Data Management and NoSQL Databases Lecture 13. Data Stream Management PD Dr. Andreas](https://reader034.fdocuments.us/reader034/viewer/2022050605/5facf9463d48d04192101cf1/html5/thumbnails/81.jpg)
Literature • Arvind Arasu, Shivnath Babu, Jennifer Widom: The CQL continuous query language:
semantic foundations and query execution. VLDB J. 15(2): 121-142 (2006) • Arvind Arasu et al. : STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull.
26(1): 19-26 (2003) • http://infolab.stanford.edu/~widom/cql-talk.pdf • Alan J. Demers, Johannes Gehrke, Biswanath Panda, Mirek Riedewald, Varun Sharma,
Walker M. White: Cayuga: A General Purpose Event Monitoring System. CIDR 2007: 412-422
• Jürgen Krämer, Bernhard Seeger: Semantics and implementation of continuous sliding window queries over data streams. ACM Trans. Database Syst. 34(1) (2009)