Approximate Query Processing (AQP) in Data Streams
description
Transcript of Approximate Query Processing (AQP) in Data Streams
![Page 1: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/1.jpg)
Zahid Irfan & Dr. Asim Karim (Advisor)
(zahidi, akarim @lums.edu.pk)CS-509-Masters of Science (CS) Project
Lahore University of Management Sciences,Lahore, Pakistan
8 May 2004
Approximate Query Processing (AQP) in Data Streams
![Page 2: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/2.jpg)
Acknowledgement
This work is primarily based on the research paper “One-pass wavelets decompositions of data streams” by Gilbert, Muthukrishnan, Strauss and Kotidis, IEEE Trans. Knowledge and Data Engineering May/June, 2003.
Work by Muthukrishnan, Piotr Indyk and of course Johnson-Lindenstrauss.
![Page 3: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/3.jpg)
Introduction Streams and Streaming Models Wavelet Transform & Embedded
Vectors Pseudo-Random Number Generator Implementation Details Test Results Conclusions and Future Work
AQP in Data Streams
![Page 4: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/4.jpg)
Lets solve a puzzle. Guess the missing number in a random sequence of numbers [1…N] without repetition.
Introduction
Space Requirements O (1). Time Complexity O (n).
What about two numbers, three numbers …. and so on…
![Page 5: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/5.jpg)
Data Stream “A sequence of digitally encoded
signals used to represent information in transmission”.
Input stream is the sequence a [i], arrives sequentially item by item.
Data Streams
![Page 6: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/6.jpg)
Applications Networks Data Monitoring.
Applied to Traffic Flow Analysis World Wide Web.
Website hits, statistics etc. Online Transactions Processing
System Large Databases Query Processing
Data Streams Applications
![Page 7: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/7.jpg)
Time Series Comprises value of the same quantity
over different time intervals. Typical examples
Daily closing values of Stock Exchange Traffic at an IP-Link at time intervals.
Stream Models
![Page 8: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/8.jpg)
Cash Register Model Positive updates arrive over period of
time. Typical examples
well … Cash Register Cricket Scores Internet web-site hits or other statistics.
Stream Models
![Page 9: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/9.jpg)
Turnstile Model Fully dynamic model Updates are both negative & positive
e.g. Passengers in an airport
Relative Hardness Turnstile > Cash Register > Time Series “Depends and varies from application to
application”.
Stream Models
![Page 10: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/10.jpg)
Wavelets A mathematical hierarchical tool for
decomposition of signals/ functions. Types of Wavelets
Haar Wavelets Daubechies Wavelets Many more…
Wavelet Transform
![Page 11: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/11.jpg)
Haar Wavelet Example
Resolution Averages Detail CoefficientsD = [2, 2, 0, 2, 3, 5, 4, 4]
[2, 1, 4, 4] [0, -1, -1, 0]
[1.5, 4] [0.5, 0]
[2.75] [-1.25]
----3
2
1
0
Haar Wavelet Decomposition [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
![Page 12: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/12.jpg)
Wavelet in <,> Space Haar Wavelets can be represented as
the following. Example vector A of N=4, 4 coefficients.
W1= 1/N*[1 1 1 1], W2 = 1/N*[1 1 -1 -1], W3=1/N*[1 -1 1 -1], W4=1/N*[1 1 1 -1]
1st Coefficient = <A,W1>. Average Coefficient 2nd Coefficient = <A,W2>. Detail Coefficient 3rd Coefficient = <A,W3>. Detail Coefficient 4th Coefficient = <A,W4>. Detail Coefficient
![Page 13: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/13.jpg)
Embedding Vectors
Embedding Vectors Any n-point metric space can be
embedded into an O(log2 n) dimensional Euclidean space and L1 metric with 1+є distortion
f(v) = embedding for vector v = < <v, r1>, <v, r1>, … <v, rk>
>
![Page 14: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/14.jpg)
Johnson-Lindenstrauss (JL) Lemma Simply stated <a,b>~<a,rj>*<b,rj>
Where j=1…k, k<<N rj is random vector= {1, -1 with equal
probability} Implications
Represent a vector in RN space in k-dimensional space.
Benefits : Approximate Queries… ??
Johnson-Lindenstruass Lemma
![Page 15: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/15.jpg)
<a,b>~<a,rj>*<b,rj> Approximate queries can be used by
choosing special b. Query ith value choose b=[ 0..010…0] Range Query (i,j) value choose
b=[ 0..01..10…0], where b[x]=1 for i<=x<=j.
What's the catch?? … rj is also size of N. So where to store the random vectors??
AQP & JL-Lemma
![Page 16: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/16.jpg)
Solution to large space over head is generate the random vectors on the fly!!
Such as : for (i=0;i<k;i++) { srand (i);
for (j=0;j<N;j++) {rand (); }
} This solution works but there is a more
elegant solution to this problem. Reed-Muller Codes Extractor.
Pseudo-Random Generator
![Page 17: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/17.jpg)
Reed-Muller Generator The Matrix
values represent RM codes.
RM (x,y)= Replace
01 & 1 -1 we get wavelet basis vectors.
2 mod 2 1X
y
![Page 18: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/18.jpg)
Benefits of Reed-Muller Pseudo Random generator Generated on the fly. Every value is independently
computed without anything to do with the previous values.
Most nearly imitates Wavelet basis vectors.
Hence the sketch contains most of the energy of the signal.
Reed-Muller PR Generator
![Page 19: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/19.jpg)
Things learnt so far There is a way to embed the N data
into k<<N vectors JL-Lemma : <a,b>~<a,r><b,r> Reed-Muller Codes excellent imitators
of both wavelet basis vectors as well as random vectors.
Query Processing is possible thanks to JL- Lemma.
Lessons so far !!
![Page 20: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/20.jpg)
Implementation Details
Implementation Trivia Implemented in Visual C++ 6.0 Design follows Classes and Objects
paradigm Test Results and graphs from MS
Excel
![Page 21: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/21.jpg)
Data Flow Diagram
Dataset
DatasetGenerator
Data StreamGenerator
Wavelets-basedDecomposition
Reed-MullerGenerator
SketchQueryProcessing
Engine
![Page 22: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/22.jpg)
Dataset Generator Synthetic Data Set was generated
using Random Distributions. Normal Distribution
Calling Telephone Number 9497000~9497999 (1000 lines)
Receiving Telephone Number Exponential Distribution
Call Time 0~512 minutes
![Page 23: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/23.jpg)
Data Streamer
The data streaming class offers methods, which help in useful imitation of a real-time data stream by continuously presenting the program with data. Type DataStreamer::getData();
![Page 24: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/24.jpg)
Pseudo Random Generator
2 mod 2
),(1X
yyxG
This class calculates the Reed-Muller based Pseudo-random Numbers.
type PseudoRandomGenerator::getRandom (int X,int Y);
Uses the formula
![Page 25: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/25.jpg)
Data Decomposition
The data is decomposed into a sketch by calculating the dot product of data stream with O (log N) random vectors.
The sketch is stored into Main Memory to be utilized by the query processing engine. Sketch [j]+=Data [i]*Random (i, j);
Here i=(1,N) and j=(1,k);
![Page 26: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/26.jpg)
Query Processing Engine
The Query Processing Engine uses the sketch and a new vector b. Uses the same old JL-Lemma
<a,b>~<a,rj>*<b,rj> Setting various values of b result in
theoretically any sort of query.
![Page 27: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/27.jpg)
Point Query Processing
Point Query Point Query can be processed by
asking for any single value in the whole data stream.
Point Query Algorithm Prepare b[i]={0 for i !=j , 1 for i=j} and
generate <b,r> QuerySketch[j] +=B[i] * Random (i,j); Result = (DataSketch * Query Sketch)/ N
![Page 28: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/28.jpg)
Range Query Processing Range Query
Range Queries specify the low and high between which the query is to be processed.
Even multiple ranges can be specified Query Algorithm
Prepare b[i]={0 for i !=j , 1 for i=j} and generate <b,r>
QuerySketch[j] +=B[i] * Random (i,j); Result = (DataSketch * Query Sketch)/ N
![Page 29: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/29.jpg)
AQP Test
Time Complexity Analysis Query Processing Accuracy with
Data Size Query Processing Accuracy with
Sketch Size
![Page 30: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/30.jpg)
Time Complexity
Time Complexity The following Time complexities were
found to be linear in size of data. Sketching Time Query Processing Time
![Page 31: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/31.jpg)
Time Complexity (Sketching)
Sketching Time versus Data Size (Sketch Size assumed to be log N)
0
20
40
60
80
100
120
10,000.00 100,000.00 1,000,000.00 10,000,000.00
Ske
tch
ing
Tim
e (s
eco
nd
s)
`
![Page 32: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/32.jpg)
Time Complexity (Query)
Querying Time versus Data Size (Sketch Size assumed to be log N)
0
20
40
60
80
100
120
10,000.00 100,000.00 1,000,000.00 10,000,000.00
Qu
eryi
ng
Tim
e (s
eco
nd
s)
`
![Page 33: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/33.jpg)
Accuracy versus Data Size
Data Size versus Accuracy of Query PSNR (dB) versus Data Size
Data Size is increased by Power of 2 Sketch size assumed to be log N
![Page 34: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/34.jpg)
PSNR (dB) versus Data Size
PSNR(dB) versus Data Size (Sketch Size assumed to be log N)
100
105
110
115
120
10.00 1,000,010.00
2,000,010.00
3,000,010.00
4,000,010.00
5,000,010.00
`
![Page 35: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/35.jpg)
Accuracy versus Sketch Size
Accuracy of Query against the Sketch Size. PSNR (dB) versus Sketch Size
Data Size is assumed to be constant = 32768
Sketch Size is varied
![Page 36: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/36.jpg)
PSNR (dB) versus Sketch Size
PSNR(dB) versus Sketch Size(DataSize N=32768)
100110120
130140150
0 20 40 60 80 100 120
Sketch Size
PS
NR
![Page 37: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/37.jpg)
Conclusions
Space Complexity Reduction Prohibitively large data stream in sub-
linear space. Time Complexity Reduction
one-pass data stream algorithm. Scalability to multi-dimensions
![Page 38: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/38.jpg)
Applications and Future Work Data Mining Streams Multimedia & Databases
Trying it with Video coding might be fun or disaster
Graph Theory Problems MST, Matching etc. need to be solved in the
streaming model. Computational Geometry
Earth observation data streams or weather data streams
Solve any problem that can be modeled as a data stream
![Page 39: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/39.jpg)
References S. Acharaya, P.B. Gibbons, V. Poosala and S. Ramaswamy, “Join
Synopsis for Approximate Query Answering”, ACM In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999.
J. M. Hellerstein, P. J. Haas and H. J. Wang, “Online Aggregation”, In the Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, 1997.
Y. E. Iaonnidis and V. Poosala, “Histograms-Based Approximation to Set-Valued Query Answers”, In the proceedings of 25th International Conference on Very Large Databases, 1999.
K. Chakrabarti, M. Garofalakis, R. Rastogi and K. Shim, “Approximate Query Processing Using Wavelets”, The Proceedings of the 26th Conference on Very Large Databases, Eygpt, 2000.
F. Olken, “Random Sampling in Databases”, PhD Thesis, University of California at Berkeley, 1993.
A.C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strass, “One-pass wavelet Decomposition of Data Streams”, IEEE Transactions of Knowledge and Data Engineering, Vol. 15, No.3, May/June 2003.
A. Ta-Shma, D. Zuckerman, and S. Safra, “Extractors from Reed-Muller Codes” In Proceedings of 42nd Annual IEEE Symposium on Foundations of Computer Science, 2001.
![Page 40: Approximate Query Processing (AQP) in Data Streams](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814bb1550346895db885f2/html5/thumbnails/40.jpg)
Questions & Answers
Thanks to the following for their sincere help in this projectDr. Asim Karim, Dr. Sarmad Abbasi, Dr. Asim Loan, Dr. Sohaib A.
Khan and all my friends speciallyLaeeq Aslam and Aimal Tariq Rextin.