Probabilistic Threshold Range Aggregate Query
Processing over Uncertain Data
Wenjie Zhang
University of New South Wales & NICTA, AustraliaJoint work:
Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)
Outline
DB@UNSW
2
Background and Preliminaries Probabilistic Threshold Range Aggregate
Query Exact query processing Approximate query processing: Simple
Sampling & Double Sampling Experiments
Conclusion
Applications
DB@UNSW
3
Many applications involve data that is imperfect due to data randomness and incompleteness limitation of equipment delay or lose in data transfer … …
Applications Sensor networks Environmental surveillance Moving objects Data cleaning and integration … …
Applications
DB@UNSW
4
Sensor Networks: Sensor readings are often imprecise due to equipment
limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08)
Applications
DB@UNSW
5
Mobile Equipments / Moving Objects A mobile object reports its location periodically, the
exact location is often uncertain.
Applications
DBG @ UNSW
Data Quality Social Data Collection: Errors and estimation
inherent in customer surveys and sampling
7
Outline
DB@UNSW
8
Background and Preliminaries Modeling Uncertainty & Related Work
Probabilistic Threshold Range Query Conclusion
Modeling Uncertainty ( cont. )
DB@UNSW
9
Uncertain Objects Model1. Continuous case: described using a probability
density function (PDF) fU such that . E.g., uniform distribution, normal distribution.
Uu U duuf 1)(
Modeling Uncertainty ( cont. )
DB@UNSW
10
Uncertain Objects Model2. Discrete case : described using a set of
instances each instance u has an occurrence probability pu
1 Uu up
Possible World Semantics
DB@UNSW
11
Given a set of uncertain objects U1,U2, ..., Un, a possible world W = u1,u2, .., un is a set of n instances --- one instance per uncertain object
The probability of a possible worlds is
P(W) =
Let Ω be the set of all possible world, clearly,
n
i iuP1 )(
1)( WWP
Probabilistic Queries:
DB@UNSW
12
Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]
Aggregate Queries [BDJR05, MJ07, CG07]
Join Queries [CSP06, AW07]
Top-k queries [SIC07, YLSK08, RDS07, HJZL08]
Nearest Neighbor Queries [KKR07, CCMC08]
Skyline Queries [PJLY07]
… …
Related Work
DB@UNSW
14
Range Queries [TCXNKP05, BPS06, AY08]
Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t.
Appearance probability
r
o .reg ion
rregionoxdxxpdfo
.)(.
Outline
DB@UNSW
16
Introduction Modeling Uncertainty & Related Work Probabilistic Threshold Range Aggregate
Query (PTRA) Conclusion
Contribution
DB@UNSW
17
Formally define PTRA query aU-Tree structure for exact PTRA query singleSample and doubleSample
techniques for approximate answer.
Problem Statement
DB@UNSW
18
Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq
Problem Definition
DB@UNSW
19
Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b)
Exact Query Processing ( aU-Tree)
DB@UNSW
20
Main idea: add aggregate information on U-tree Advantage: stop at intermediate level if
pruned or fully covered by the query Disadvantage: otherwise, still need to drill
down to the leaf nodes. For a large portion of uncertain objects,
appearance probability needs to be computed Expensive for a massive number of instances
per object!
singleSample
DB@UNSW
22
Sampling the instances of the uncertain objects. If m’ out of m sampled instances are inside query
region, then the approximate appearance probability is m’/m
doubleSample
DB@UNSW
24
Single Sampling is expensive when there is a massive number of objects!
Sampling the uncertain objects as well. Naive : uniform sampling objects from all
uncertain objects.
doubleSample: Accuracy
DB@UNSW
25
•Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed.•Using Chernoff-Hoeffding bound.
doubleSample: Our Approach
DB@UNSW
26
Skew! Aim: select K disjoint groups covering all objects
with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.)
The optimization problem is NP-hard. Observation:
Min-skew is a good heuristic to conduct such a group.
aU-tree groups objects with a similar principle to the min-skew.
doubleSample: Our Approach
DB@UNSW
27
Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! Find a level L such that the number of nodes at level
L is smaller than K but the number of nodes at level L-1 is larger than K.
Feed the min-skew algorithm with the subtrees at level L.
(note: if at a level L, the number of nodes = K, then these K subtrees are chosen.)
Step 2: sample objects in each subtree. Step 3. sample instances in each sampled object.
Experiments
DB@UNSW
28
Algorithms:
exact, singleSample, doubleSample
Data set:
LB : 53k objects at long beach country
CA : 62k objects at California
Synthetic aircraft dataset in 3D
10k instances for each points follow Uniform or constrained-Gaussian
Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size 8K
Conclusion
DB@UNSW
32
Definition of PTRA aU-Tree technique Sampling technique Future work. Any approach with
theoretic guarantee?
Top Related