Post on 17-Jul-2015
Data Stream Management
Authors
Lukasz Golab & M. Tamer Özsu
Supervised by
Dr. Sakti Pramanik
Presented by
AKM Tauhidul Islam
Outline• Introduction
o Motivation
o Problem Statement
o Definitions
• Data Stream Management System (DSMS)
• Streaming Data Warehouse (SDW)
• Discussion
Introduction• Stream data - Produced incrementally over time, rather than
being available in full before its processing begins
• Examples:
• Applications:o Sensor Networks - E.g. TinyDB
o Network Traffic Analysis - E.g. Traffic statistics and critical condition
detection.
o Financial Tickers - On-line analysis of stock prices, discover correlations,
identify trends.
o Transaction Log Analysis - E.g. Web click streams and telephone calls
Transaction data streams Log Streams
Credit card purchases,Telecommunications,Web Accesses
Climate DataGPS trackingSensor networksIP networks
Motivation• Massive data sets:
o Huge numbers of users, e.g.,• AT&T long-distance: ~ 300M calls/day
• AT&T IP backbone: ~ 10B IP flows/day
o Highly detailed measurements, e.g.,• NOAA: satellite-based measurements of earth geodetics
o Huge number of measurement points, e.g.,• Sensor networks with huge number of sensors
• Near real-time analysiso ISP: controlling service levels
o NOAA: tornado detection using weather radar
o Hospital: Patient monitoring
• Traditional data feedso Simple queries (e.g., value lookup) needed in real-time
o Complex queries (e.g., trend analyses) performed off-line
Problem StatementDBMS DSMS
Data Persistent Relations Streams, time windows
Data Access Random Sequential, One-pass
Updates Arbitrary Append Only
Update Rates Relatively Low High, bursty
Processing Model Query Driven Data driven
Queries One time Continuous
Query Plans Fixed Adaptive
Query Optimizations One Query Multi-query
Query Answers Exact Exact or Approximate
Latency Relatively High Low
DataWarehouse
SDW
Data Historical Recent and Historical
UpdateFrequency
Low High
UpdatePropagation
Synchronous Asynchronous
ETL Process Complex Fast, Light-weight
Fig : Comparison of Data Stream Management Systems and Streaming Data Warehouses with traditional database and warehouse systems
Definitions• Non-blocking Execution : Query operator Q doesn’t require
entire input
• Monotonicity : All previous results preserved o Q(т) € Q(т’), for query operator Q, where т <= т’
o Q is monotonic only if non-blocking
• Delta : Doesn’t hold monotonicity property , produce update
result at time т, negative / Positive delta
• Punctuation : Special tuple containing a predicate that is
guaranteed to be satisfied by the remainder of the data stream
• Heartbeat : Punctuations that govern timestamps of future
tuples
• Average slowdown = Tuple response time/ shortest processing
time
Outline• Introduction
• Data Stream Management System (DSMS)
o Stream Data Models
o Query Language & Semantics
o Query Processing
o Query Optimization
• Streaming Data Warehouse (SDW)
• Discussion
DSMS• Input Buffer/Monitor
o Captures streaming inputs
o May collect statistics on streams
o Random sampling
• Working storageo Stores recent stream data
o Used for query processing
• Local Storageo Used for metadata
o Foreign key mapping
o Naming translation
• Query Processoro Convert queries into execution plans
o Change plans for different workloads /
input rates
o Contains buffers, operator queues
o Deploys scheduling methods
• Continuous Query Repository
• Resultso May input to users, to other applications
o Stored in an SDW for further analysis
Fig : i) Abstract reference architecture of a DSMS & ii) A traditional DBMS
Stream Data Models• Base Streams – Produced by sources, append only
• Derived streams – produced by continuous queries
• Streams have fixed schemao <timestamp, source IP Addr, source port, destination IP Addr, destination port, size>
• Data Stream Modelso Describe underlying signals S : [l ... N] -> R
o Aggregate model – Range value for a signal
o Cash Register model – Partial non-negative range value
o Turnstile model – Partial range value
o Reset model – Range value; Reset previous value of a signal
• Stream Windows – important to user and query points of view
o Fixed window
o Sliding window
o Landmark window
o Jumping window – update every k-ticks or k-arrivals
o Tumbling window - update every k-ticks or k-arrivals , k = window size
Query Language & Semantics
• Query Algebrao Stream-to-stream
o Mixed Algebra
• Query Operators – Similar syntax to DBMS, very different semantics
• Relation-like query operatorso Selection, projection, union – stateless operators
o Join – window joins
o Aggregate operators
• DSMS exclusive operatorso Buffered sort operator
o Random sampling operator
o User defined aggregate functions (UDAF)
• Query Languageso GSQL
o CQL
o ESL
Query Operators• Selections, (duplicate preserving)
projections are straightforwardo Local, per-element operators
o Duplicate eliminating projection is like grouping
o Projection needs to include ordering attribute
o No restriction for position ordered streams
• Aggregate expressions:o distributive: sum, count, min, max
o algebraic: average
o holistic: count-distinct, median
Fig: Simple continuous query operators: i) - Selection, ii) Count, iii) Negation
Query Operators• Join operators problematic on
streamso May need to join arbitrarily far apart
stream tuples
o Operations on implicit / explicit windows
• SELECT * FROM S1, S2
WHERE Sl.attr = S2.attr
GROUP BY Sl.timestamp/60 AS minute
• SELECT * FROM S1, S2
WHERE Sl.attr = S2.attr
GROUP BY IS1 .timestamp| - |S2.timestampl <= w
• SELECT * FROM S1 [RANGE w] , S2 [RANGE w]
WHERE Sl.attr = S2.attr
Fig: Simple continuous query operators: i) Join, ii) Sliding window join with state
Query Processing• Declarative queries ->Logical query plan -> Physical Plan
o Directed Acyclic Graphs (nodes->operators, edges -> data flow)
• Queries sharing memory/streams combined to a single plan
Fig: a) Query plan for two queries: i) a join of streams Sl and S2 with a selection predicate on Sl, and 2) an aggregate on S2. b) A continuous query with selection and tumbling window aggregation
• Scheduling o FIFS, Round Robin – simple, not efficient
o Operators with higher throughput – low latency
o Operators with min processing & selectivity –smaller queue
• Heartbeats & Punctuationso Typically issued by sources
o Reduce amount of states needed by operators
o Prevent operators doing unnecessary tasks
o Query plans can also issue heartbeats to avoid pipeline stalls and delayed results
SELECT minute, SUM(size) FROM s WHERE destination_port <= 80 GROUP BY timestamp/60 AS minute
Query Processing Cont..
• Queries as views & Negative tupleso Negative tuples implemented by sign on
explicit windows
o Explicit windows on time or count based
o Generated negative tuples processed by
cascading operators
o Negative tuple on aggregate operators
• Count – easy to compute
• Max/Min – Memory intensive
o Twice as many tuples are considered
• Possible avoiding for monotonic
operators
• Tag tuples with expiration time
• Operators known as weak non-
monotonic
Fig: a) Maintaining a view over a sliding window join using negative tuples b) Finding the maximum element in a sliding window
Query Optimization• Finds efficient query plans
• DBMS focus on minimizing I/O while DSMS try to reduce cost per unit
• Static Analysis and Query Rewritingo Ensures query can be evaluated in non-
blocking fashion with limited memory
• S(A,B,C), T(D,E)
• ∏A (бA=D & A>I0 & D<20(S x T) ) , Yes
• ∏A (бA=D (S x T) ), No
• ∏A (бB<D & A>I0 & D<20(S x T) ), Yes, if no duplicate
o Common Rules
• Evaluate inexpensive predicates before complex ones
o Performing selections before joins
o Rules for continuous query operators only
• Selections and explicit time-based windows commute
• Selections and explicit count-based windows don’t commute
o Rewrite based on input(s) constraints
• Join of unbounded streams if matching tuples arrive at most t time units apart
• Multi Query Optimization
Fig : Separate and shared query plans for Ql and Q2
Operator Optimization• Joino Need to remove expired tuples
o Expiration in each time tick costly
o Periodic removal reduce cost but increase join processing cost
o Probe streams with fewer matches
• Aggregationo Synopses allow efficient re-computations
o Prefix synopses
• Suitable for sub-tractable aggregates
• For ex: Sum, Count
o Interval synopses
• Suitable for distributive aggregates
• For ex: Min, Max
• Need to access log b intervals
• Basic interval synopses require b accesses
o Holistic aggregates require additional info in synopses
o Algebraic aggregates computed from derived info
• Avg = Sum / Count
Fig : i) Prefix synopses, ii) Interval synopses, iii) Basic interval synopses
Query Optimization• Load Shedding & Approximationo Random sampling
o Semantic load shedding to drop less important
o Objective is to minimize the drop in accuracy
• Challenging for complex query plan with multiple streams and operators
• Load Balancingo Write part of stream if possible
• Adaptive Query Optimization o Query cost-per-unit time may change
o Query plan dynamically re-ordered on speed, selectivity and queue length
o Trade-off between resulting adaptivity and overhead of dynamic routing
• Distributed Query Optimizationo Parallelizing and distributing the system itself
• Split query plan across nodes
• Partition the streams
o Shifting partial computation to the sources
• In-network processing reduce the communication overhead
Outline• Introduction
• Data Stream Management System (DSMS)
• Streaming Data Warehouse (SDW)o Data ETL
o Update Propagation
o Data Expiration
o Update Scheduling
o Query Processing on SDW
• Discussion
SDW• Data streams/feeds arrive periodically
• ETL process - data cleaning, standardization and so on
• Table types o Base tables – Sourced directly from raw files
o Derived tables – Materialized view over base or other derived table
• Update scheduler selects files update order o Based on dependencies and workloads
Fig : Abstract reference architecture of a SDW
ETL• Simple tasks – un-compression, standardization
• Complex tasks
o Joining new data with descriptive attributes relations
• Relations R are disk based
• Data buffer at main memory
• Mesh Join
o Access blocks of R in sequential order
o Tuple removed from buffer when join to all blocks of R
o Loading data into tables
• Tables are partitioned into timestamp ranges
• Affect small number or recent partitions
Fig : Partitioning a table on a timestamp attribute
Update Propagation• Goals
o Propagate changes across layers of derived
tables
o Avoid recomputing an entire derived table
o Efficiently identify partition dependency
• Partition dependencies may not be
obvious from the SQL specification
Fig : Updating a partitioned derived table
Fig : Partition dependency
Data Expiration• Tuples may have variable lifetime
• Tables can be partitioned on insertion and expiration timestampso Partitions may not have equal size
• One solution is to assign updates in round robin fashion
Fig : Partitioning a table on two attributes: insertion and expiration timestamp
Update Scheduling• External sources push new data
• So many data feeds and derived
tables
• Resource usage control by using
scheduler
• Minimize data staleness
• Priority weighted staleness metric
to select tables which minimize it
most
Fig : plot of the staleness of a SDW table over time
Query Processing• Overhead of partitioned tables
o Too small partitions are difficult to manage
o Too big ones need to be recomputed as new data arrives
o Solution : Bigger partitions as data become old
• Data Availability and Concurrency controlo Tables are updated frequently
o Queries should not be blocked and output consistent data
o Solution : Multi-version concurrency control at partition level
Discussion• End-to-end data stream management
• DSMS allows relational like queries as well as pattern matching
and event processing queries
• Query semantics are different than traditional ones
• SDW research problems introduced recently
• Didn’t cover data mining techniques, fault tolerance and distributed processing in the lecture
References1. Data stream management, Luckasz Golab & M. Tamer Özsu
• Data stream management system – introduction, concepts and issues. Morton Lindeberg, University of Oslo