STREAM The Stanford Data Stream Management System.
-
Upload
bartholomew-williams -
Category
Documents
-
view
231 -
download
4
Transcript of STREAM The Stanford Data Stream Management System.
STREAM
The Stanford Data Stream Management System
Presentation Structure
• Introduction
• CQL: Continuous Query Language– Abstract Semantics– Data Types– Operators
• Query Plan & Execution
Introduction
• The system is designed for limited resource environments where streams may be rapid, and query loads may vary over time.
CQL: Continuous Query Language
• For simple continuous queries over streams, it can be sufficient to use a relational query language such as SQL.
• However, for more complex queries this can quickly become very unclear.
Abstract Semantics
• 2 Data Types:– Streams
– Relations
• defined on a discrete, ordered time domain • 3 Types of Operators
Streams
A stream S is an unbounded bag (multiset) of pairs <s,t>, where s is a tuple and t is the timestamp that denotes the logical arrival time of tuple s on stream S.
Relations
A relation R is a time-varying bag of tuples. The bag of tuples at time t is denoted R(t), and we call R(t) an instantaneous relation.
Note that tuples in R(t) have no time-stamp.
Operator Diagram
Operator Classes
• A relation-to-relation operator takes one or more relations as input and produces a relation as output.
• A stream-to-relation operator takes a stream as input and produces a relation as output.
• A relation-to-stream operator takes a relation as input and produces a stream as output.
• Stream-to-stream operators are absent they are composed from operators of the above classes.
Query Structure
• A continuous query Q is a tree of operators belonging to the aforementioned classes.
• The inputs of Q are the streams and relations that are input to the leaf operators.
• The output of Q is the output of the root operator.
• The output is either a stream or a relation, depending on the class of the root operator.
Output Timestamp
• Since at time t, an operator of Q logically depends on its inputs up to t.
• The operator produces new outputs corresponding to t
• tuples of S with timestamp t if the output is a stream S, or instantaneous relation R(t) if the output is a relation R
Relation-to-Relation Operators
• CQL uses SQL constructs to express its relation-to-relation operators
• i.e. SELECT ... FROM …
Class Operator Diagram
Performs duplicate eliminationrelation-to-relationduplicate-eliminate
Performs grouping and aggregationrelation-to-relationaggregate
Antisemijoin of two input relationsrelation-to-relationantisemijoin
Bag Intersectionrelation-to-relationintersect
Bag Differencerelation-to-relationexcept
Bag Unionrelation-to-relationunion
Multiway join from [22]relation-to-relationmjoin
Joins two input relationsrelation-to-relationbinary-join
Duplicate-Preserving Projectionrelation-to-relationproject
Filters elements based on predicate(s)relation-to-relationselect
DescriptionOperator TypeName
Stream-to-Relation Operators
• Based on a sliding window principle.
• 3 Types of Windows:– Tuple-based window– Time-based window– Partitioned Widow
Tuple-based Window
• A tuple-based sliding window on a stream S takes an integer N > 0 as a parameter and produces a relation R. At time t, R(t)contains the N tuples of S with the largest timestamps < t.
• Example: R(14) [Rows 5]
Time-based window
• A time-based sliding window on a stream S takes a time interval w as a parameter and produces a relation R. At time t, R(t) contains all tuples of S with timestamps between t-w and t.
• Example: R(9) [Range 4]
Partitioned Window
• A partitioned sliding window on a stream S takes an integer N and a set of attributes {A1, ..., Ak } of S as parameters, and is specified by following S with [Partition By A1,...,Ak Rows N]." It logically partitions S into different sub streams.
• HINT: Rows N will be used a tuple-based window on the substreams.
Relation-to-Stream Operators
• 3 Relation-to-stream operators:
• Istream (for Insert Stream)
• Dstream (for Delete Stream)
• Rstream (for Relation Stream)
R-to-S Operators
• IS: Applied to a relation R contains <s,t> whenever tuple s is in R(t) − R(t − 1)– i.e., whenever s is inserted into R at time t.
• DS: Applied to a relation R contains <s,t> whenever tuple s is in R(t − 1) − R(t)– i.e., whenever s is deleted from R at time t.
• RS: Applied to a relation R contains <s,t> whenever tuple s is in R(t)– i.e., every current tuple in R is streamed at
every time instant.
Example 1 CQL Query
• Select Istream(*) From S [Rows Unbounded] Where S.A > 10
– S[Rows Unbounded] (stream-to-relation)
– S.A > 10 (relation-to-relation)
– IStream(*) (relation-to-stream)
Example 2 CQL Query
• Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10– S1 [Rows 1000] (Stream-to-Relation)
– S2 (Range 2min] (Stream-to-Relation)
– S1.A = S2.A (Relation-to-relation)
– S1.A > 10 (Relation-to-Relation)
– * (Relation-to-Relation)
Example 3 CQL Query
• Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A
– S[Now] (Stream-To-Relation)– R (Stream-To-Relation)
• assumes [Rows Unbounded]
– S.A = R.A (Relation-To-Relation)– RStream(S.A, R.B) (Relation-To-Stream)
Query Plans & Execution
• When a continuous query is to be executed within STREAM, a query plan is compiled from it.
• Query plans are composed of:– Operators (to perform the actual processing)– Queues (buffer tuples as they move between
operators)– Synposes (which I will not discuss)
Operators
• In order to allow processing, each timestamped tuple is additionally flaged for 'insertion' or 'deletion' (+ or -)
• Streams only include + elements, while relations may include both + and − elements
• Each query plan operator reads from one or more input queues, processes the input based on its semantics, and writes any output to an output queue.
Queues
• A queue in a query plan connects its “producing" plan operator OP to its “consuming" operator OC. At any time a queue contains a (possibly empty) collection of elements representing a portion of a stream or relation.
• Furthermore, the system requires all queues to enforce non-decreasing timestamps, to allow for all possible operations. (Very Important)
Queue Diagram
Query Plan (Example 2)
• Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10
Query Plan (Example 1)
• Select Istream(*) From S [Rows Unbounded] Where S.A > 10
• Q1: Stream Queue
• SW: all of Q1 copied
• Q2: Relation
• Sel: on S.A > 10
• Q3: Relation
• I-S: R-to-S
• Q4: Stream
Query Plan (Example 3)
• Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A
Query Plan Scheduling
• When a query plan is executed, a scheduler selects operators in the plan to execute in turn.
• The semantics of each operator depends only on the timestamps of the elements it processes, not on system or “wall-clock" time.
• Thus, the order of execution has no effect on the data in the query result, although it can affect other properties such as latency and resource utilization.
Execution Example
• The first seq-window (now just called SW1) reads (s,r,+)
• SW1 stores the tuple in its own buffer
• If buffer is full, more than 1000 elements, it removes oldest element called s'.
• SW1 writes to q3 (s,r,+) and (s',r,-)
• SW2 works similary.
• Binary-Join (now called BJ3) reads (s,r,+) from q3
• Stores it in buffer 1
• Joins tuple with all elements of buffer 2
• Outputs (st,r,+) for t in buffer 2
Execution (Part 2)
• BJ3 processes all its input queues in non-decreasing order.
• The Select Operator simply checks its input elements against its predicate and outputs those that pass.