1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo.

1

Continuous Query Languages for DSMS

CS240B Notesby

Carlo Zaniolo

3

CQLs for DSMS

Most of DSMS projects use SQL for continuous queries—for good reasons, since Many applications span data streams and DB tables A CQL based on SQL will be easier to learn & use Moreover: the fewer the differences the better!

But DSMS were designed for persistent data and transient queries---not for persistent queries on transient data

Adaptation of SQL and its enabling technology presents many research challenges

Lack of expressive power—even worse now since only nonblocking operators are allowed.

4

Continuous Query Graph: many components—arbitrary DAGs

Source σ ∑1 Sink∑2

Source

SinkO2

SinkO3

O1

Source1

U

Source2 σ

Sink

Source1

U

Source2 σ

∑1 Sink

∑2 Sink

5

Relational Algebra Operators

Stored data Selection, Projection Union

Join (including X) on tables

Set Difference

Aggregates: Traditional Blocking aggregates OLAP functions on windows or

unlimited preceding

Data Streams ... same Union by Sort-Merging on timestamps Join of Stream with table Window joins on streams (timestamps

merged into 1 column) No stream difference (blocking—diff of

stream with table OK).

Aggregates: No blocking aggregate OLAP functions on windows or

unlimited preceding Slides, and tumbles.

6

Bolts and Nuts

create stream bids(bid#, item, offer, Time)

create stream mybids as (select bid#, offer, Time from bids where item=bolt union select bid#, offer, Time from bids where item=nut)

Result same as: select bid#, offer, Time where item= bolt or item=nut

7

Joins We could create a stream called interesting bids by say joining bids with

the ‘interesting_items’ table.

We next find the bolt bids for which there was a nut bid offered in the last 5 minutes for the same price.

create stream selfjoinbids as (select S1.bid#, S1.offer, S2.bid#, S2.Time from bids as S1, bids as S2 [window of 5 minutes] where S1.item=bolt and S2.item=nut and S1.offer=S2.offer)

The window condition implies that S1.Time >= S2.Time and S2.Time >= S1.Time-5 minutes.

Windows on both streams are used very often.

\

8

Processing Unions

Union: When tuples are present at all inputs, select one with minimal timestamp and

Production: add this tuple to the output, andConsumption: remove it from the input.

Source1

U

Source2 σ

Sink

σ

9

Window Joins

Window Join of Stream A and Stream B:When tuples are present at both inputs, and the timestamp of A is less or equal than that of B, then perform the following operations (symmetric operations are performed if timestamp of B is less or equal than that of A):

Production: compute the join of the tuple in A with the tuples in W(B) and add the resulting tuples to output buffer (these tuple have the same timestamp a the tuple in A)

Consumption: the current tuple in A is removed from the input and added to the window buffer W(A) (from which the expired tuples are also removed)

SourceA

join

SourceB σ

Sink

σ A

B

10

Relational Algebra Operators

Stored data Selection, Projection Union

Join (including X) on tables

Set Difference

Aggregates: Traditional Blocking

aggregates OLAP functions on windows

or unlimited preceding

Data Streams ... same Union by Sort-Merging on timestamps

Join of Stream with table Window joins on streams (timestamps

merged into 1 column) No stream difference (blocking—diff of

stream with table OK).

Aggregates: No blocking aggregate OLAP functions on windows

or unlimited preceding Slides, and tumbles.

Including UDAs

11

User-Defined Aggregates:Max Power via Min SQL Extensions

Windows (logical, physical, slides, tumbles,…): flexible synopses that solve the blocking problem for aggregates DSMS only support these constructs on built-in

aggregates ESL is the first to support the complete integration

of these two User Defined Aggregates (UDAs) —the key to

power and extensibility, and And thus can support data mining, XML, sequences not supported by other DSMS

One framework for aggregates and windows, whether they are built-ins or user-defined, and independent on the language used to define them.

12

Defining Traditional Aggregates

Specification consists of 3 blocks of code--- Written in an external PL (as DBMS and other DSMS do), or

In SQL itself (SQL becomesTuring Complete!) INITIALIZE

Executed upon the arrival of the first tuple ITERATE

Executed upon the arrival of each subsequent tuples (an incremental computation suitable for streams)

TERMINATE Executed after the end of the relation/stream has been reached

Invocation:SELECT myavg(start_price) FROM OpenAuction

13

The UDA AVG in SQL

AGGREGATE avg(Next Int) : Real{ TABLE state(tsum Int, cnt Int);

INITIALIZE : {INSERT INTO state VALUES (Next, 1);

}ITERATE : {

UPDATE stateSET tsum=tsum+Next, cnt=cnt+1;

}TERMINATE : {

INSERT INTO RETURNSELECT tsum/cnt FROM state;

}}

“INSERT INTO RETURN” in TERMINATE a blocking UDA

14

NonBlocking UDA: AVG of last 200 Values

AGGREGATE myavg(Next Int) : Real {TABLE state(tsum Int, cnt Int);

INITIALIZE : {INSERT INTO state VALUES (Next, 1);

}ITERATE : {

UPDATE state SET tsum=tsum+Next, cnt=cnt+1;

INSERT INTO RETURN SELECT tsum/cnt FROM state

WHERE cnt %200 =0; UPDATE state SET tsum=Next, cnt=1

WHERE cnt %200 =1 }

TERMINATE : { }}

Empty TERMINATE Denotes a non-blocking UDA

15

UDAs in ESL

In ESL user-defined Aggregates (UDAs) can be defined directly in SQL, rather than in a PL Native extensibility in SQL via UDAs

(which can also be defined in a PL for better performance)

No impedance mismatch Access to DB tables from UDAs Data Independence and optimization Good ease of use and performance Turing completeness & nb-completeness.

16

Data Intensive Applications & UDAs

Complex Applications can expressed concisely, with good performance

ATLAS: a single-user DBMS developed at UCLA. Support for SQL with UDAs On top of Berkeley-DB record manager.

Data Mining Algorithms in ATLAS Decision Tree Classifiers: 18 lines of codes APriori: 40 lines of codes Modest overhead: <50% w.r.t procedural UDA Data Stream Applications in ESL Data Stream Mining, approximate aggregates,

sketches, histograms, …

17

SQL:2003 OLAP FunctionsAggregates on Windows

CREATE STREAM LastTenAvgSELECT sellerID,

AVG(price) OVER(PARTITION BY sellerID ROWS 9 PRECEDING),

Current_timeFROM ClosedPrice;

CREATE STREAM ClosedAuction (/*auction closings */ itemID, /*id of the item in this auction.*/

buyerID /*buyer of this item.*/) Final price real /*final price of the item */, Current_time) order by … source …

Auctions

For each seller, show the average selling price over the last 10 items sold (physical window)

18

Optimizing Window AVG in ESL

WINDOW AGGREGATE avg(Next Real) : Real{ TABLE state(tsum Int, cnt Real);

TABLE inwindow(wnext Real);

INITIALIZE : { INSERT INTO state VALUES (Next, 1)}

ITERATE : { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state}

EXPIRE: { /*if there are expired tuples, take the oldest */ UPDATE state SET cnt= cnt-1, tsum = tsum – (select wnext FROM inwindow WHERE oldest(inwindow)) }

}

•For each expired tuple decrease the count by one and the sum by the expired value—works for logical & physical windows

19

MAX

System maintains inwindow Remove dominated (less & older) values The oldest is always the max.

WINDOW AGGREGATE max (Next Real) : Real

{ TABLE inwindow(wnext real);

INITIALIZE : { etc.} /*system adds new tuples to inwindow*/

ITERATE : { DELETE FROM inwindow WHERE wnext <Next;

INSERT INTO RETURN SELECT wnext FROM inwindow WHERE oldest(inwindow) }

EXPIRE: { } /*expired tuples removed automatically*/

}

20

For Each Aggregate two versions

The traditional Base aggregate with terminate

The Window aggregate with inwindow and expire.

These definitions will take care of both logical and physical windows.

But there are more complications: slides and

tumbles.

21

Slides and Tumbles

CREATE STREAM LastTenAvgSELECT sellerID,

max(price) OVER(RANGE 10 MINUTE PRECEDING SLIDE 2 MINUTE),

Current_timeFROM ClosedPrice;

Every two minutes, show the average selling price over the last 10 minutes (logical window)

Here the window is W=10 and the slide is

S=2. Tumble: When S ≥ W

22

SLIDEs

The slide constructs divides a window into panes, results only returned at the end of each pane

Slide is conducive to optimization. Combine summaries into the desired aggregation E.g.: MAX(1, 2, 3, 4)= MAX(MAX(1,2), MAX(3,4)) = 4

I.e., for MAX, we can perform MAX on subsets of numbers as local summaries, then combine them together to get the true MAX

Proposed before: but what constructs should be used to integrate these concepts into the language?

window

slide/pane

window

Summary Tuples

23

Slides &Tumbles--Examples Tumble – where the SLIDE size is equal or larger

than the window size E.g. Once every 50 tuples, compute and return average

over the last 10 tuples Easy to optimize

Skip the first 40 tuples of every 50 tuples, and compute the blocking base version of the aggregate on the last 10

Slide – where slide size is smaller than the window size E.g. Once every 10 tuples, compute and return average

over the last 50 tuples Naïve implementation--not optimized

Perform incremental maintenance on every incoming tuple Ignore RETURN statements for most incoming tuples Only invoke RETURN once every 10 tuples

24

Pane-based SLIDE Optimization

Two-level cascading aggregates using two existing aggregates Perform sub-aggregation inside each pane using the base

aggregate No need for incremental maintenance here Computed with a blocking aggregate once for each pane

Combine the summary tuples using the window aggregate that returns on every incoming tuple (non-blocking)

With incremental maintenance here At any time, only the last un-finished pane needs to store data tuples all finished panes are reduced to one reusable summary tuple

window

Agg1 (base)

window

Agg2 (window)

25

Pane-based SLIDE optimizationExample: SUM with window size 50 tuples, and slide size 10

tuples First create a stream of summary tuples using base

aggregateCREATE STREAM temp AS ( SELECT itemID, base_max(sale_price) as s

OVER(PARTITION BY itemID ROWS 9 PRECEDING SLIDE 10)

FROM Auction); Then apply the window version of the aggregate

SELECT itemID, window_max(s) OVER(PARTITION BY itemID ROWS 4 PRECEDING)

FROM temp;

• This simple approach can be used to implement very complex aggregations (e.g. ensemble classifiers)• Applies uniformly to logical/physical windows defined in SQL or in an external language

26

Summary

{ Logical, Physical} x {tumble, slide, unlimited_preceding}

Six different types of calls, supported by two definitions

Both SQL or procedural languages can be used in the definition.

27

Window UDAs vs. Base UDAs

Base UDAs: called as traditional SQL-2 aggregates, with optional GROUP BY

Window UDAs: called with SQL:2003 OVER clause logical or physical windows optional PARTITION BY and SLIDE clauses in ESL

Clear semantics and optimization rules unify: UDAs—SQL or PL-defined, algebraic or not … window (logical & physical), slice, tumbles, etc. System and user roles in optimization.

28

Window UDAs: Physical Optimization

The Stream Mill System provides efficient support for: Management of new & expiring tuples in buffer Main memory & intelligent paging into disk Events caused by tuple expiration Users can access the buffer as the table called

inwindow

29

Conclusion

Language Technology: ESL a very powerful language for data stream and DB

applications Simple semantics and unified syntax conforming to

SQL:2003 standards Strong case for the DB-oriented approach to data

streams System Technology:

Some performance-oriented techniques well-developed—e.g., buffer management for windows

For others: work is still in progress—stay tuned for latest news

Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill

http://wis.cs.ucla.edu/stream-mill



30

*********

The EndTHANK YOU !

*****

31

References[1]ATLaS user manual. http://wis.cs.ucla.edu/atlas. [2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of

Lecture Notes in Computer Science. Springer, 1999. [4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous

queries over streams and relations. Technical report, Stanford University, 2002. [5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream

systems. In PODS, 2002. [9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N.

Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002.

[10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995. [11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002. [12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for

internet databases. In SIGMOD, pages 379-390, May 2000. [13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream

database for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003. [14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record,

32(2):5-14, 2003.

[15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. [16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language

and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006

[17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in Data Stream Management Systems. ICDE 2007.

32

References (Cont.)[18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database

Sequences and Data Streams. VLDB 2004: 492-503[19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously

adaptive continuous queries over streams. In SIGMOD, pages 49-61, 2002. [20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J. Rosenstein,

and R. Varma. Query processing, approximation, and resource management in a data stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003.

[21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted relational query language, 1998.

[23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001.

[24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001.

[25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998. [26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages

232-239, Taipei, Taiwan, March 1995. [27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM

SIGMOD 1994, pages 430-441. ACM Press, 1994. [28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996. [29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In

SIGMOD, pages 321-330, 6 1992. [30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics

in continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003. [31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In

Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.

1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo.

Documents

Transcript of 1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo.