Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks...
-
Upload
christian-nichols -
Category
Documents
-
view
217 -
download
2
Transcript of Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks...
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Efficient Data Acquisition in Sensor
Networks
Presented ByKedar Bellare
(Slides adapted from Carlos Guestrin)
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Papers A. Deshpande, C. Guestrin, S. Madden, J.
Hellerstein, W.Hong. "Model-Driven Data Acquisition in Sensor Networks," In the 30th International Conference on Very Large Data Bases (VLDB 2004), Toronto, Canada, August 2004.
A. Deshpande, C. Guestrin, W. Hong, S. Madden. "Exploiting Correlated Attributes in Acquisitional Query Processing" In the 21st International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005.
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Model-driven Data Acquisition in Sensor
Networks
Amol Deshpande1,4 Carlos Guestrin4,2 Sam Madden4,3
Joe Hellerstein1,4 Wei Hong4
1UC Berkeley 2Carnegie Mellon University 3MIT 4Intel Research - Berkeley
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Every time step
Analogy:Sensor net as a database
TinyDBQuery
Distributequery
Collectquery answer
or data
SQL-stylequery
Declarative interface: Sensor nets are not just for PhDs Decrease deployment time
Data aggregation: Can reduce communication
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Every time step
Limitations of existing approach
TinyDBQuery
Distributequery
Collectdata
New QuerySQL-style
query
Redoprocesseverytimequery
changes
Query distribution: Every node must receive query (even when approximate answer needed)
Data collection: Every node must wake up at every time step Data loss ignored No quality guarantees Data inefficient – ignoring correlations
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Sensor net data is correlated
Spatial-temporal correlation
Inter-attributed correlation
Data is not i.i.d. shouldn’t ignore missing data
Observing one sensor information about other sensors (and future values)
Observing one attribute information about other attributes
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
10 20 300
0.1
0.2
0.3
0.4
t
SQL-style query
with desired confidence
Model-driven data acquisition: overview
Probabilistic Model
10 20 300
0.1
0.2
0.3
0.4
Query
Data gathering
plan
Conditionon new
observations
10 20 300
0.1
0.2
0.3
0.4
New Query
posterior belief
Strengths of model-based data acquisition Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic) queries
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Benefits of Statistical Models
More robust interpretation of sensor net readings Account for biases in spatial sampling Identify faulty sensors Extrapolate missing values of sensors
More efficient data acquisition Lesser number of attributes to observe Reuse of information between queries Exploit correlations – acquire data when model not able
to answer query with acceptable confidence More complex queries
Probabilistic queries
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Issues introduced by Models
Optimization problem Given query and model, choose data acquisition
plan to best refine answer Two dependencies – statistical benefit of
acquiring reading AND system costs Any non-trivial statistical model can capture
first dependency Improving model-driven estimates for nearby
nodes Connectivity of wireless sensnet affects
second dependency
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Probabilistic models and queries
User’s perspective:QuerySELECT nodeId, temp ± 0.5°C, conf(.95) FROM sensorsWHERE nodeId in {1..8}
System selects and observes subset of nodesObserved nodes: {3,6,8}
Query result
Node 1 2 3 4 5 6 7 8
Temp. 17.3
18.1 17.4 16.1 19.2 21.3 17.5 16.3
Conf. 98%
95% 100% 99% 95% 100% 98% 100%
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Probabilistic models - Illustration
Node 0 – Interfacebetween user and sensor net
No need to query entire network
Model chooses to observevoltage even though queryis temperature
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Probabilistic models (Contd.)
Why did model choose to observe voltage instead of temperature?
Correlation in Value
Cost differential
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Probabilistic models and queries
Joint distribution P(X1,…,Xn)
Probabilistic queryExample:
Value of X2± with prob. > 1- Prob. below 1-?
Observe attributes
Example: Observe X1=18
P(X2|X1=18)
Higher prob.,could answer query
Learn from historical data
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Dynamic models: filteringJoint distribution
at time t Condition onobservations
t
Fewer obs. infuture queries
Example: Kalman filter Learn from historical data
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Kalman Filtering Transition model maintained for each hour of
the day – mod(t,24)
Evolution of system over time from to
Compute using simple marginalization
Next obtain posterior distribution for observations including that at (t+1)
)|,,( 11
ttn
t oXXp )|,,( 111
1tt
nt oXXp
),,|,,( 111
1tn
ttn
t XXXXp
tn
tttn
ttn
ttn
tttn
t dxdxoXXpXXXXpoXXp 1
111
111
1111 )|,,(),,|,,()|,,(
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Kalman Filtering (Contd.) Transition model is learned by first computing
joint density
Then use conditioning rule to compute the transition model
),,,,,( 111
1tn
ttn
t XXXXp
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Supported queries Value query
Xi ± with prob. at least 1-
SELECT and Range query Xi[a,b] with prob. at least 1- which sensors have temperature greater than 25°C ?
Aggregation average ± of subset of attribs. with prob. > 1- combine aggregation and selection probability > 10 sensors have temperature greater than
25°C ? Queries require solution to integrals
Many queries computed in closed-form Some require numerical integration/sampling
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Probabilistic queries Range queries –
First marginalize multivariate gaussian – Done by dropping entries being marginalized
Compute confidence of query using error function If confidence is less, make observation to improve
confidence Conditioning a gaussian on value of some attributes
gives another gaussian
),( iii baXP
oYYoY1
|
YYYYoYY 1
||
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Probabilistic queries (Contd.)
Value Query
Easy to estimate Also determine confidence interval for given error
bound and observe attributes if needed Posterior mean can be obtained directly from mean
vector conditioned on observed o
iiii dxoxpxx )|(_
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Probabilistic queries (Contd.)
Average aggregates If we are interested in average over attributes A Define random variable The pdf of Y is given by:
where 1[:] is the indicator function Once P(Y=y|o) is defined simply define a value query for
random variable Y Other complex aggregate queries can be similarly answered
by constructing new random variables PDF of sum of gaussians is a gaussian where:
expected mean is the mean of expected values variance is weighted sum of variances Xi plus covariances of
Xi and Xj
nAi
in dxdxyAxoxxpoyYp 11 /1)|,,()|(
AXYAi
i /
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
10 20 300
0.1
0.2
0.3
0.4
t
SQL-style query
with desired confidence
Model-driven data acquisition: overview
Probabilistic Model
10 20 300
0.1
0.2
0.3
0.4
Query
Data gathering
plan
Conditionon new
observations
10 20 300
0.1
0.2
0.3
0.4
posterior beliefWhat sensors do we observe ?How do we collect observations?
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Acquisition costs Attributes have
different acquisition costs
Exploit correlation through probabilistic model
Must consider networking cost1
2
63
4 5
cheaper?
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Network model and plan format
Assume known (quasi-static) network topology Define traversal using (1.5-approximate) TSP Ct(S ) is expected cost of TSP (lossy communication)
12
63
4 5
7 8
129
10 11
Cost of collecting subset S of sensor values:
C(S )= Ca(S )+ Ct(S )
Goal:Find subset S that is sufficient to answer query at minimum cost C(S )
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Choosing observation plan
Is a subset S sufficient? Xi2[a,b] with prob. > 1-
If we observe S =s :Ri(s ) = max{ P(Xi2[a,b] | s ), 1-P(Xi2[a,b] | s )}
Value of S is unknown:Ri(S ) = P(s ) Ri(s ) dsOptimization problem:
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Observation Plan General optimization problem is NP-hard Two algorithms
Exhaustive search – exponential Greedy search
Begin with empty observation plan Compute benefit R and cost C for added attribute If confidence reached, choose attribute with minimum
cost Else add the attribute which has maximum
benefit/cost ratio Repeat until you reach desired confidence
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
10 20 300
0.1
0.2
0.3
0.4
t
SQL-style query
with desired confidence
BBQ system
Probabilistic Model
10 20 300
0.1
0.2
0.3
0.4
Query
Data gathering
plan
Conditionon new
observations
10 20 300
0.1
0.2
0.3
0.4
posterior belief
ValueRangeAverage
Multivariate GaussiansLearn from historical data
Equivalent to Kalman filterSimple matrix operations
Simple matrix operations
Exhaustive or greedy searchFactor 1.5 TSP approximation
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Exploiting correlated attributes
Extension of single plan to conditional plan Useful when cost of acquisition non-negligible Correlations exist between one or more attributes
Queries of the form multi-predicate range queries
Query evaluation can become cheaper by observing additional attributes
If additional attributes are low-cost Reject tuple with high confidence without
expensive acquisition – substantial performance gains
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Conditional Plans
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Conditional Plans (Contd.) Simple binary decision trees Each interior node n_j specifies binary
conditioning predicate (depends on only single attribute value)
Choose conditional plan with minimum expected cost
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Cost of Conditional Plans Optimal plan
Traversal cost
Expected plan cost
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Issues in Conditional Plans
Need to estimate P(Tj|t) Naïve method is to scan historical data for each
computation – expensive Cost model
Only acquisition cost taken into account Transmission cost Size of plan to fit in RAM Add into the cost model
Authors only focus on limiting plan sizes
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Architecture
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Optimal Conditional Plan Problem is hard
Even if we are given conditional probabilities (by oracle) complexity is #P-hard – reduction from 3-SAT
Even if we try to optimize our plan with respect to set of d tuples D problem is NP-complete – reduction from complexity of binary decision trees
Exhaustive search Depth first search With caching and Pruning
Also heuristic solutions using greedy
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE
Example: Intel Berkeley Lab deployment
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE50
51
52 53
54
46
48
49
47
43
45
44
42 41
3739
38 36
33
3
6
10
11
12
13 14
1516
17
19
2021
22
242526283032
31
2729
23
18
9
5
8
7
4
34
1
2
3540
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Experimental results
Redwood trees and Intel Lab datasets Learned models from data
Static model Dynamic model – Kalman filter, time-indexed
transition probabilities Evaluated on a wide range of queries
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE50
51
52 53
54
46
48
49
47
43
45
44
42 41
3739
38 36
33
3
6
10
11
12
13 14
1516
17
19
2021
22
242526283032
31
2729
23
18
9
5
8
7
4
34
1
2
3540
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Cost versus Confidence level
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Obtaining approximate values
Query: True temperature value ± epsilon with confidence 95%
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Approximate range queries
Query: Temperature in [T1,T2] with confidence 95%
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Comparison to other methods
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Intel Lab traversals
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
10 20 300
0.1
0.2
0.3
0.4
t
SQL-style query
with desired confidence
BBQ system
Probabilistic Model
10 20 300
0.1
0.2
0.3
0.4
Query
Data gathering
plan
Conditionon new
observations
10 20 300
0.1
0.2
0.3
0.4
posterior belief
ValueRangeAverage
Multivariate GaussiansLearn from historical data
Equivalent to Kalman filterSimple matrix operations
Simple matrix operations
Exhaustive or greedy searchFactor 1.5 TSP approximationExtensions
More complex queries Other probabilistic models More advanced planning Outlier detection Dynamic networks Continuous queries …
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Conclusions Model-driven data acquisition
Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic)
queries
Basis for future sensor network systems
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004
Discussion Questions What other models apart from multivariate
gaussian can be used? If other models are used will their solution be in closed form?
Model-driven techniques are suitable only if test data is same as training data. Will solution be adaptable if test region is different from training region?
Optimization problem is hard and expensive to compute even with heuristics. Will it work for real-time data analysis?
Outlier detection is not supported for model-driven acquisition. Is there any way to do it for model-based sensor networks?
If in general your needed confidence on the query is low then some nodes may not be queried at all?