Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via...

Surveillance in an Abruptly Changing Worldvia Multiarmed Bandits

Vaibhav Srivastava

Department of Mechanical & Aerospace Engineering

Princeton University

December 15, 2014

Joint work with: Paul Reverdy and Naomi Leonard

IEEE Conference on Decision and ControlLos Angeles, CA

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 1 / 12

Surveillance and Environmental Monitoring

Picture Credit: http://www.kevindemarco.com

Underwater Search Underwater Robotic Testbed at Princeton University

1Repetitive search of object of interest, e.g., a certain type of algae

2Events of interest may arrive according to some process

3Noisy measurements and exploration-exploitation trade-o↵

4Environmental features are correlated

5Travel may be costly


Incomplete Literature Review

Environmental Monitoring and Surveillance

A. Singh, A. Krause, C. Guestrin, and W. J. Kaiser. E�cient informative sensing using multiple robots.Journal of Artificial Intelligence Research, 34(2):707–755, 2009

G. A. Hollinger et. al. Underwater data collection using robotic sensor networks. IEEE Journal on Selected

Areas in Communications, 30(5):899–911, 2012

N. E. Leonard, D. A. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. E. Davis. Collective motion,sensor networks, and ocean sampling. Proc of the IEEE, 95(1):48–74, 2007

N. Sydney and D. A. Paley. Multivehicle coverage control for a nonstationary spatiotemporal field. Auto-matica, 50(5):1381–1390, 2014

R. Graham and J. Cortes. Adaptive information collection by robotic sensor networks for spatial estimation.IEEE Trans on Automatic Control, 57(6):1404–1419, 2012

Multi-armed Bandit ProblemsT. L. Lai and H. Robbins. Asymptotically e�cient adaptive allocation rules. Advances in Applied Mathe-

matics, 6(1):4–22, 1985

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine

learning, 47(2):235–256, 2002

E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. InICAIS, pages 592–600, April 2012

A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. arXivpreprint arXiv:0805.3415, 2008


Stochastic Multi-armed Bandits

N options with unknown mean rewards mi

the obtained reward is corrupted by noise

distribution of noise is known ⇠ N (0, �2s

)

can play only one option at a time

Pic Credit: Microsoft Rsch

Objective: maximize expected cumulative reward until time T

Equivalently: Minimize the cumulative regret

Cum. Regret =TX

t=1

�mmax � m

i

t

�.

mmax = max reward it

= arm picked at time t

Prototypical example of exploration-exploitation trade-o↵


Stochastic Multi-armed Bandits

N options with unknown mean rewards mi

the obtained reward is corrupted by noise

distribution of noise is known ⇠ N (0, �2s

)

can play only one option at a time

Pic Credit: Microsoft Rsch

Objective: maximize expected cumulative reward until time T

Equivalently: Minimize the cumulative regret

Cum. Regret =TX

t=1

�mmax � m

i

t

�.

mmax = max reward it

= arm picked at time t

Prototypical example of exploration-exploitation trade-o↵Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 4 / 12

Spatially Embedded Gaussian Multi-armed Bandits

reward at option i ⇠ N (mi

, �2s

)

prior on rewards m ⇠ N (µ0,⌃0)

spatial structure captured through ⌃0, e.g., �0ij

= �0 exp(�dij

/�)

value of option i at time t: Qt

i

=

⇣1 � 1

Kt

⌘-upper credible limit = µt

i|{z}exploit

+ �t

i

�

�1⇣1 � 1

Kt

⌘

| {z }explore

Inference Algorithm:

⇤

t

µt

= rt

�t

/�2s

+ ⇤

t�1µt�1

⇤

t

= �t

�T

t

/�2s

+ ⇤

t�1, ⌃

t

= ⇤

�1t

,

Upper Credible Limit (UCL) Algorithm:

- pick option with maximum value at each time

E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In ICAIS, pages 592–600, April 2012

P. Reverdy, V. S., and N. E. Leonard. Modeling human decision making in generalized Gaussian multiarmed bandits. Proc of the IEEE, 102(4):544–571, 2014

V. S., P. Reverdy, and N. E. Leonard. Correlated and dynamic multiarmed bandit problems: Bayesian algorithms and regret analysis. 2014. In prep


Spatially Embedded Gaussian Multi-armed Bandits

reward at option i ⇠ N (mi

, �2s

)

prior on rewards m ⇠ N (µ0,⌃0)

spatial structure captured through ⌃0, e.g., �0ij

= �0 exp(�dij

/�)

value of option i at time t: Qt

i

=

⇣1 � 1

Kt

⌘-upper credible limit = µt

i|{z}exploit

+ �t

i

�

�1⇣1 � 1

Kt

⌘

| {z }explore

Inference Algorithm:

⇤

t

µt

= rt

�t

/�2s

+ ⇤

t�1µt�1

⇤

t

= �t

�T

t

/�2s

+ ⇤

t�1, ⌃

t

= ⇤

�1t

,

Upper Credible Limit (UCL) Algorithm:

- pick option with maximum value at each time

E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In ICAIS, pages 592–600, April 2012

P. Reverdy, V. S., and N. E. Leonard. Modeling human decision making in generalized Gaussian multiarmed bandits. Proc of the IEEE, 102(4):544–571, 2014

V. S., P. Reverdy, and N. E. Leonard. Correlated and dynamic multiarmed bandit problems: Bayesian algorithms and regret analysis. 2014. In prep


Gaussian multiarmed bandits with abrupt changesSliding-Window Approach: Description

the mean rewards switches to unknown values at unknown times

the switched rewards may have the same correlation scale

the number of switches until time T is upper bounded by ⇣T

Sliding-window UCL algorithm

estimate mean using observations at times {(t � tw

)

++ 1, . . . , t};

selects the arm i with the maximum value of

Qt,tw

i

:= µt,tw

i

+ �t,tw

i

�

�1⇣1 � 1

K min{tw

, t}

⌘,

an adaptation of the frequentist algorithm by Garivier and Moulines


Gaussian multiarmed bandits with abrupt changesSliding-Window Approach: Analysis

Sliding-window UCL algorithm

estimate mean using observations at times {(t � tw

)

++ 1, . . . , t};


Qt,tw

i

:= µt,tw

i

+ �t,tw

i

�

�1⇣1 � 1

K min{tw

, t}

⌘,

Analysis of Sliding-Window UCL algorithm

for ⇣T

= O(T ⌫), ⌫ 2 [0, 1) and t

w

=

⌃qT logT

⇣T

⌥

E[nTi

] O(T1+⌫2

plogT );

for ⇣T

�T , for some � 2 [0, 1), and tw

=

⌃q� log ��

⌥

E[nTi

] O(Tp

�� log �).


Gaussian multiarmed bandits with abrupt changesBlock Allocation Strategy: Description of Blocks

Block allocation to reduce travel cost

Divide sampling times into frames {1, . . . , L+ 1}L-th frame ends at 2

k

w

, kw

equivalent of width of time-window

k-th frame subdivided in blocks on length k 2 {1, . . . , L}(L+ 1)-th frame contains times {2kw + 1, . . . ,T}(L+ 1)-th frame subdivided in blocks on length k

w

20 21 22 23 2kw

2k�1 2k

k kk k

2k�1 2k 2�

Tz}|{frame fk

⌧k(r�1) |{z}block r

2kw T

kw kwkwkw

Frame structure20 21 22 23 2kw

2k�1 2k

k kk k

2k�1 2k 2�

Tz}|{frame fk


2kw T

kw kwkwkw

20 21 22 23 2kw

2k�1 2k

k kk k

2k�1 2k 2�

Tz}|{frame fk


2kw T

kw kwkwkw

Blocks Last FrameVaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 8 / 12

Gaussian multiarmed bandits with abrupt changesBlock Allocation Strategy: Description of algorithm

Block Sliding-Window UCL algorithm

At beginning of r -th block in k-th frame, i.e., at time ⌧kr

performs the estimation using the observations collected in the

time-window {(⌧kr

� 2

k

w

)

++ 1, . . . , ⌧

kr

};


Q⌧kr

,kw

i

:= µkr ,kw

i

+ �kr ,kw

i

�

�1(1 � 1/K min{2kw , ⌧

kr

}),

for the duration of the block

Block SW-UCL achieves same order of performance as SW-UCL


Numerical Illustration

Environment: 5 ⇥ 5 square grid

Reward at optimal m⇤= 10

Reward at other arms mj

= mi

exp(�0.3dij

), dij

= distance

Assumed correlation scale ⇢ij

= exp(�0.3dij

)

�2s

= 1 and �20 = 10

Number of changes ⇣T

= bpT c

102 103 104 1050

0.5

1

Horizon length

Subo

ptim

alar

m se

lect

ion

102 103 104 1050

0.5

1

Horizon length

Num

ber o

f tra

nsiti

ons

102 103 104 1050

0.5

1

Horizon length

Subo

ptim

alar

m se

lect

ion

102 103 104 1050

0.5

1

Horizon length

Num

ber o

f tra

nsiti

ons

Expt number of selections of suboptimal arms Expt number of transitions among arms

Black line: SWUCL Red line: Adaptive SWUCL Green line: Block SWUCLVaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 10 / 12

How important is the correlation scale

Beluga

Underwater Testbed Virtual reward surface

Experiment and Video Credit: Peter Langdren


Conclusions and Future Directions

Conclusions

1A multiarmed bandit framework for surveillance problems

2Arrival on events of interest =) Abrupt changes in reward surface

3Exploration-Exploitation trade-o↵ and role of correlation scale

4Block allocation to reduce travel cost

Future Directions

1Extension to multiple vehicles

2Environmental partitioning strategies catered to

addressing exploration-exploitation trade-o↵

3Extensions to continuously changing environments


Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via...

Documents

Transcript of Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via...