Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks...

43
Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted from Carlos Guestrin)

Transcript of Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks...

Page 1: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Efficient Data Acquisition in Sensor

Networks

Presented ByKedar Bellare

(Slides adapted from Carlos Guestrin)

Page 2: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Papers A. Deshpande, C. Guestrin, S. Madden, J.

Hellerstein, W.Hong. "Model-Driven Data Acquisition in Sensor Networks," In the 30th International Conference on Very Large Data Bases (VLDB 2004), Toronto, Canada, August 2004.

A. Deshpande, C. Guestrin, W. Hong, S. Madden. "Exploiting Correlated Attributes in Acquisitional Query Processing" In the 21st International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005.

Page 3: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Model-driven Data Acquisition in Sensor

Networks

Amol Deshpande1,4 Carlos Guestrin4,2 Sam Madden4,3

Joe Hellerstein1,4 Wei Hong4

1UC Berkeley 2Carnegie Mellon University 3MIT 4Intel Research - Berkeley

Page 4: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Every time step

Analogy:Sensor net as a database

TinyDBQuery

Distributequery

Collectquery answer

or data

SQL-stylequery

Declarative interface: Sensor nets are not just for PhDs Decrease deployment time

Data aggregation: Can reduce communication

Page 5: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Every time step

Limitations of existing approach

TinyDBQuery

Distributequery

Collectdata

New QuerySQL-style

query

Redoprocesseverytimequery

changes

Query distribution: Every node must receive query (even when approximate answer needed)

Data collection: Every node must wake up at every time step Data loss ignored No quality guarantees Data inefficient – ignoring correlations

Page 6: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Sensor net data is correlated

Spatial-temporal correlation

Inter-attributed correlation

Data is not i.i.d. shouldn’t ignore missing data

Observing one sensor information about other sensors (and future values)

Observing one attribute information about other attributes

Page 7: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

10 20 300

0.1

0.2

0.3

0.4

t

SQL-style query

with desired confidence

Model-driven data acquisition: overview

Probabilistic Model

10 20 300

0.1

0.2

0.3

0.4

Query

Data gathering

plan

Conditionon new

observations

10 20 300

0.1

0.2

0.3

0.4

New Query

posterior belief

Strengths of model-based data acquisition Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic) queries

Page 8: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Benefits of Statistical Models

More robust interpretation of sensor net readings Account for biases in spatial sampling Identify faulty sensors Extrapolate missing values of sensors

More efficient data acquisition Lesser number of attributes to observe Reuse of information between queries Exploit correlations – acquire data when model not able

to answer query with acceptable confidence More complex queries

Probabilistic queries

Page 9: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Issues introduced by Models

Optimization problem Given query and model, choose data acquisition

plan to best refine answer Two dependencies – statistical benefit of

acquiring reading AND system costs Any non-trivial statistical model can capture

first dependency Improving model-driven estimates for nearby

nodes Connectivity of wireless sensnet affects

second dependency

Page 10: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Probabilistic models and queries

User’s perspective:QuerySELECT nodeId, temp ± 0.5°C, conf(.95) FROM sensorsWHERE nodeId in {1..8}

System selects and observes subset of nodesObserved nodes: {3,6,8}

Query result

Node 1 2 3 4 5 6 7 8

Temp. 17.3

18.1 17.4 16.1 19.2 21.3 17.5 16.3

Conf. 98%

95% 100% 99% 95% 100% 98% 100%

Page 11: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Probabilistic models - Illustration

Node 0 – Interfacebetween user and sensor net

No need to query entire network

Model chooses to observevoltage even though queryis temperature

Page 12: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Probabilistic models (Contd.)

Why did model choose to observe voltage instead of temperature?

Correlation in Value

Cost differential

Page 13: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Probabilistic models and queries

Joint distribution P(X1,…,Xn)

Probabilistic queryExample:

Value of X2± with prob. > 1- Prob. below 1-?

Observe attributes

Example: Observe X1=18

P(X2|X1=18)

Higher prob.,could answer query

Learn from historical data

Page 14: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Dynamic models: filteringJoint distribution

at time t Condition onobservations

t

Fewer obs. infuture queries

Example: Kalman filter Learn from historical data

Page 15: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Kalman Filtering Transition model maintained for each hour of

the day – mod(t,24)

Evolution of system over time from to

Compute using simple marginalization

Next obtain posterior distribution for observations including that at (t+1)

)|,,( 11

ttn

t oXXp )|,,( 111

1tt

nt oXXp

),,|,,( 111

1tn

ttn

t XXXXp

tn

tttn

ttn

ttn

tttn

t dxdxoXXpXXXXpoXXp 1

111

111

1111 )|,,(),,|,,()|,,(

Page 16: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Kalman Filtering (Contd.) Transition model is learned by first computing

joint density

Then use conditioning rule to compute the transition model

),,,,,( 111

1tn

ttn

t XXXXp

Page 17: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Supported queries Value query

Xi ± with prob. at least 1-

SELECT and Range query Xi[a,b] with prob. at least 1- which sensors have temperature greater than 25°C ?

Aggregation average ± of subset of attribs. with prob. > 1- combine aggregation and selection probability > 10 sensors have temperature greater than

25°C ? Queries require solution to integrals

Many queries computed in closed-form Some require numerical integration/sampling

Page 18: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Probabilistic queries Range queries –

First marginalize multivariate gaussian – Done by dropping entries being marginalized

Compute confidence of query using error function If confidence is less, make observation to improve

confidence Conditioning a gaussian on value of some attributes

gives another gaussian

),( iii baXP

oYYoY1

|

YYYYoYY 1

||

Page 19: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Probabilistic queries (Contd.)

Value Query

Easy to estimate Also determine confidence interval for given error

bound and observe attributes if needed Posterior mean can be obtained directly from mean

vector conditioned on observed o

iiii dxoxpxx )|(_

Page 20: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Probabilistic queries (Contd.)

Average aggregates If we are interested in average over attributes A Define random variable The pdf of Y is given by:

where 1[:] is the indicator function Once P(Y=y|o) is defined simply define a value query for

random variable Y Other complex aggregate queries can be similarly answered

by constructing new random variables PDF of sum of gaussians is a gaussian where:

expected mean is the mean of expected values variance is weighted sum of variances Xi plus covariances of

Xi and Xj

nAi

in dxdxyAxoxxpoyYp 11 /1)|,,()|(

AXYAi

i /

Page 21: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

10 20 300

0.1

0.2

0.3

0.4

t

SQL-style query

with desired confidence

Model-driven data acquisition: overview

Probabilistic Model

10 20 300

0.1

0.2

0.3

0.4

Query

Data gathering

plan

Conditionon new

observations

10 20 300

0.1

0.2

0.3

0.4

posterior beliefWhat sensors do we observe ?How do we collect observations?

Page 22: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Acquisition costs Attributes have

different acquisition costs

Exploit correlation through probabilistic model

Must consider networking cost1

2

63

4 5

cheaper?

Page 23: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Network model and plan format

Assume known (quasi-static) network topology Define traversal using (1.5-approximate) TSP Ct(S ) is expected cost of TSP (lossy communication)

12

63

4 5

7 8

129

10 11

Cost of collecting subset S of sensor values:

C(S )= Ca(S )+ Ct(S )

Goal:Find subset S that is sufficient to answer query at minimum cost C(S )

Page 24: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Choosing observation plan

Is a subset S sufficient? Xi2[a,b] with prob. > 1-

If we observe S =s :Ri(s ) = max{ P(Xi2[a,b] | s ), 1-P(Xi2[a,b] | s )}

Value of S is unknown:Ri(S ) = P(s ) Ri(s ) dsOptimization problem:

Page 25: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Observation Plan General optimization problem is NP-hard Two algorithms

Exhaustive search – exponential Greedy search

Begin with empty observation plan Compute benefit R and cost C for added attribute If confidence reached, choose attribute with minimum

cost Else add the attribute which has maximum

benefit/cost ratio Repeat until you reach desired confidence

Page 26: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

10 20 300

0.1

0.2

0.3

0.4

t

SQL-style query

with desired confidence

BBQ system

Probabilistic Model

10 20 300

0.1

0.2

0.3

0.4

Query

Data gathering

plan

Conditionon new

observations

10 20 300

0.1

0.2

0.3

0.4

posterior belief

ValueRangeAverage

Multivariate GaussiansLearn from historical data

Equivalent to Kalman filterSimple matrix operations

Simple matrix operations

Exhaustive or greedy searchFactor 1.5 TSP approximation

Page 27: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Exploiting correlated attributes

Extension of single plan to conditional plan Useful when cost of acquisition non-negligible Correlations exist between one or more attributes

Queries of the form multi-predicate range queries

Query evaluation can become cheaper by observing additional attributes

If additional attributes are low-cost Reject tuple with high confidence without

expensive acquisition – substantial performance gains

Page 28: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Conditional Plans

Page 29: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Conditional Plans (Contd.) Simple binary decision trees Each interior node n_j specifies binary

conditioning predicate (depends on only single attribute value)

Choose conditional plan with minimum expected cost

Page 30: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Cost of Conditional Plans Optimal plan

Traversal cost

Expected plan cost

Page 31: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Issues in Conditional Plans

Need to estimate P(Tj|t) Naïve method is to scan historical data for each

computation – expensive Cost model

Only acquisition cost taken into account Transmission cost Size of plan to fit in RAM Add into the cost model

Authors only focus on limiting plan sizes

Page 32: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Architecture

Page 33: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Optimal Conditional Plan Problem is hard

Even if we are given conditional probabilities (by oracle) complexity is #P-hard – reduction from 3-SAT

Even if we try to optimize our plan with respect to set of d tuples D problem is NP-complete – reduction from complexity of binary decision trees

Exhaustive search Depth first search With caching and Pruning

Also heuristic solutions using greedy

Page 34: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

Example: Intel Berkeley Lab deployment

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE50

51

52 53

54

46

48

49

47

43

45

44

42 41

3739

38 36

33

3

6

10

11

12

13 14

1516

17

19

2021

22

242526283032

31

2729

23

18

9

5

8

7

4

34

1

2

3540

Page 35: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Experimental results

Redwood trees and Intel Lab datasets Learned models from data

Static model Dynamic model – Kalman filter, time-indexed

transition probabilities Evaluated on a wide range of queries

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE50

51

52 53

54

46

48

49

47

43

45

44

42 41

3739

38 36

33

3

6

10

11

12

13 14

1516

17

19

2021

22

242526283032

31

2729

23

18

9

5

8

7

4

34

1

2

3540

Page 36: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Cost versus Confidence level

Page 37: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Obtaining approximate values

Query: True temperature value ± epsilon with confidence 95%

Page 38: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Approximate range queries

Query: Temperature in [T1,T2] with confidence 95%

Page 39: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Comparison to other methods

Page 40: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Intel Lab traversals

Page 41: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

10 20 300

0.1

0.2

0.3

0.4

t

SQL-style query

with desired confidence

BBQ system

Probabilistic Model

10 20 300

0.1

0.2

0.3

0.4

Query

Data gathering

plan

Conditionon new

observations

10 20 300

0.1

0.2

0.3

0.4

posterior belief

ValueRangeAverage

Multivariate GaussiansLearn from historical data

Equivalent to Kalman filterSimple matrix operations

Simple matrix operations

Exhaustive or greedy searchFactor 1.5 TSP approximationExtensions

More complex queries Other probabilistic models More advanced planning Outlier detection Dynamic networks Continuous queries …

Page 42: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Conclusions Model-driven data acquisition

Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic)

queries

Basis for future sensor network systems

Page 43: Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

Copyright ©2004 Carlos Guestrin www.cs.cmu.edu/~guestrin VLDB 2004

Discussion Questions What other models apart from multivariate

gaussian can be used? If other models are used will their solution be in closed form?

Model-driven techniques are suitable only if test data is same as training data. Will solution be adaptable if test region is different from training region?

Optimization problem is hard and expensive to compute even with heuristics. Will it work for real-time data analysis?

Outlier detection is not supported for model-driven acquisition. Is there any way to do it for model-based sensor networks?

If in general your needed confidence on the query is low then some nodes may not be queried at all?