Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via...

4
Surveillance in an Abruptly Changing World via Multiarmed Bandits Vaibhav Srivastava Department of Mechanical & Aerospace Engineering Princeton University December 15, 2014 Joint work with: Paul Reverdy and Naomi Leonard IEEE Conference on Decision and Control Los Angeles, CA Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 1 / 12 Surveillance and Environmental Monitoring Picture Credit: http://www.kevindemarco.com Underwater Search Underwater Robotic Testbed at Princeton University 1 Repetitive search of object of interest, e.g., a certain type of algae 2 Events of interest may arrive according to some process 3 Noisy measurements and exploration-exploitation trade-o4 Environmental features are correlated 5 Travel may be costly Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 2 / 12 Incomplete Literature Review Environmental Monitoring and Surveillance A. Singh, A. Krause, C. Guestrin, and W. J. Kaiser. Ecient informative sensing using multiple robots. Journal of Artificial Intelligence Research, 34(2):707–755, 2009 G. A. Hollinger et. al. Underwater data collection using robotic sensor networks. IEEE Journal on Selected Areas in Communications, 30(5):899–911, 2012 N. E. Leonard, D. A. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. E. Davis. Collective motion, sensor networks, and ocean sampling. Proc of the IEEE, 95(1):48–74, 2007 N. Sydney and D. A. Paley. Multivehicle coverage control for a nonstationary spatiotemporal field. Auto- matica, 50(5):1381–1390, 2014 R. Graham and J. Cort´ es. Adaptive information collection by robotic sensor networks for spatial estimation. IEEE Trans on Automatic Control, 57(6):1404–1419, 2012 Multi-armed Bandit Problems T. L. Lai and H. Robbins. Asymptotically ecient adaptive allocation rules. Advances in Applied Mathe- matics, 6(1):4–22, 1985 P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002 E. Kaufmann, O. Capp´ e, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In ICAIS, pages 592–600, April 2012 A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415, 2008 Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 3 / 12 Stochastic Multi-armed Bandits N options with unknown mean rewards m i the obtained reward is corrupted by noise distribution of noise is known N (0, σ 2 s ) can play only one option at a time Pic Credit: Microsoft Rsch Objective: maximize expected cumulative reward until time T Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 4 / 12

Transcript of Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via...

Page 1: Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via Multiarmed Bandits Vaibhav Srivastava Department of Mechanical & Aerospace Engineering

Surveillance in an Abruptly Changing Worldvia Multiarmed Bandits

Vaibhav Srivastava

Department of Mechanical & Aerospace Engineering

Princeton University

December 15, 2014

Joint work with: Paul Reverdy and Naomi Leonard

IEEE Conference on Decision and ControlLos Angeles, CA

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 1 / 12

Surveillance and Environmental Monitoring

Picture Credit: http://www.kevindemarco.com

Underwater Search Underwater Robotic Testbed at Princeton University

1Repetitive search of object of interest, e.g., a certain type of algae

2Events of interest may arrive according to some process

3Noisy measurements and exploration-exploitation trade-o↵

4Environmental features are correlated

5Travel may be costly

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 2 / 12

Incomplete Literature Review

Environmental Monitoring and Surveillance

A. Singh, A. Krause, C. Guestrin, and W. J. Kaiser. E�cient informative sensing using multiple robots.Journal of Artificial Intelligence Research, 34(2):707–755, 2009

G. A. Hollinger et. al. Underwater data collection using robotic sensor networks. IEEE Journal on Selected

Areas in Communications, 30(5):899–911, 2012

N. E. Leonard, D. A. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. E. Davis. Collective motion,sensor networks, and ocean sampling. Proc of the IEEE, 95(1):48–74, 2007

N. Sydney and D. A. Paley. Multivehicle coverage control for a nonstationary spatiotemporal field. Auto-matica, 50(5):1381–1390, 2014

R. Graham and J. Cortes. Adaptive information collection by robotic sensor networks for spatial estimation.IEEE Trans on Automatic Control, 57(6):1404–1419, 2012

Multi-armed Bandit ProblemsT. L. Lai and H. Robbins. Asymptotically e�cient adaptive allocation rules. Advances in Applied Mathe-

matics, 6(1):4–22, 1985

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine

learning, 47(2):235–256, 2002

E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. InICAIS, pages 592–600, April 2012

A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. arXivpreprint arXiv:0805.3415, 2008

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 3 / 12

Stochastic Multi-armed Bandits

N options with unknown mean rewards mi

the obtained reward is corrupted by noise

distribution of noise is known ⇠ N (0, �2s

)

can play only one option at a time

Pic Credit: Microsoft Rsch

Objective: maximize expected cumulative reward until time T

Equivalently: Minimize the cumulative regret

Cum. Regret =TX

t=1

�mmax � m

i

t

�.

mmax = max reward it

= arm picked at time t

Prototypical example of exploration-exploitation trade-o↵

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 4 / 12

Page 2: Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via Multiarmed Bandits Vaibhav Srivastava Department of Mechanical & Aerospace Engineering

Stochastic Multi-armed Bandits

N options with unknown mean rewards mi

the obtained reward is corrupted by noise

distribution of noise is known ⇠ N (0, �2s

)

can play only one option at a time

Pic Credit: Microsoft Rsch

Objective: maximize expected cumulative reward until time T

Equivalently: Minimize the cumulative regret

Cum. Regret =TX

t=1

�mmax � m

i

t

�.

mmax = max reward it

= arm picked at time t

Prototypical example of exploration-exploitation trade-o↵Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 4 / 12

Spatially Embedded Gaussian Multi-armed Bandits

reward at option i ⇠ N (mi

, �2s

)

prior on rewards m ⇠ N (µ0,⌃0)

spatial structure captured through ⌃0, e.g., �0ij

= �0 exp(�dij

/�)

value of option i at time t: Qt

i

=

⇣1 � 1

Kt

⌘-upper credible limit = µt

i|{z}exploit

+ �t

i

�1⇣1 � 1

Kt

| {z }explore

Inference Algorithm:

t

µt

= rt

�t

/�2s

+ ⇤

t�1µt�1

t

= �t

�T

t

/�2s

+ ⇤

t�1, ⌃

t

= ⇤

�1t

,

Upper Credible Limit (UCL) Algorithm:

- pick option with maximum value at each time

E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In ICAIS, pages 592–600, April 2012

P. Reverdy, V. S., and N. E. Leonard. Modeling human decision making in generalized Gaussian multiarmed bandits. Proc of the IEEE, 102(4):544–571, 2014

V. S., P. Reverdy, and N. E. Leonard. Correlated and dynamic multiarmed bandit problems: Bayesian algorithms and regret analysis. 2014. In prep

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 5 / 12

Spatially Embedded Gaussian Multi-armed Bandits

reward at option i ⇠ N (mi

, �2s

)

prior on rewards m ⇠ N (µ0,⌃0)

spatial structure captured through ⌃0, e.g., �0ij

= �0 exp(�dij

/�)

value of option i at time t: Qt

i

=

⇣1 � 1

Kt

⌘-upper credible limit = µt

i|{z}exploit

+ �t

i

�1⇣1 � 1

Kt

| {z }explore

Inference Algorithm:

t

µt

= rt

�t

/�2s

+ ⇤

t�1µt�1

t

= �t

�T

t

/�2s

+ ⇤

t�1, ⌃

t

= ⇤

�1t

,

Upper Credible Limit (UCL) Algorithm:

- pick option with maximum value at each time

E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In ICAIS, pages 592–600, April 2012

P. Reverdy, V. S., and N. E. Leonard. Modeling human decision making in generalized Gaussian multiarmed bandits. Proc of the IEEE, 102(4):544–571, 2014

V. S., P. Reverdy, and N. E. Leonard. Correlated and dynamic multiarmed bandit problems: Bayesian algorithms and regret analysis. 2014. In prep

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 5 / 12

Gaussian multiarmed bandits with abrupt changesSliding-Window Approach: Description

the mean rewards switches to unknown values at unknown times

the switched rewards may have the same correlation scale

the number of switches until time T is upper bounded by ⇣T

Sliding-window UCL algorithm

estimate mean using observations at times {(t � tw

)

++ 1, . . . , t};

selects the arm i with the maximum value of

Qt,tw

i

:= µt,tw

i

+ �t,tw

i

�1⇣1 � 1

K min{tw

, t}

⌘,

an adaptation of the frequentist algorithm by Garivier and Moulines

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 6 / 12

Page 3: Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via Multiarmed Bandits Vaibhav Srivastava Department of Mechanical & Aerospace Engineering

Gaussian multiarmed bandits with abrupt changesSliding-Window Approach: Analysis

Sliding-window UCL algorithm

estimate mean using observations at times {(t � tw

)

++ 1, . . . , t};

selects the arm i with the maximum value of

Qt,tw

i

:= µt,tw

i

+ �t,tw

i

�1⇣1 � 1

K min{tw

, t}

⌘,

Analysis of Sliding-Window UCL algorithm

for ⇣T

= O(T ⌫), ⌫ 2 [0, 1) and t

w

=

⌃qT logT

⇣T

E[nTi

] O(T1+⌫2

plogT );

for ⇣T

�T , for some � 2 [0, 1), and tw

=

⌃q� log ��

E[nTi

] O(Tp

�� log �).

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 7 / 12

Gaussian multiarmed bandits with abrupt changesBlock Allocation Strategy: Description of Blocks

Block allocation to reduce travel cost

Divide sampling times into frames {1, . . . , L+ 1}L-th frame ends at 2

k

w

, kw

equivalent of width of time-window

k-th frame subdivided in blocks on length k 2 {1, . . . , L}(L+ 1)-th frame contains times {2kw + 1, . . . ,T}(L+ 1)-th frame subdivided in blocks on length k

w

20 21 22 23 2kw

2k�1 2k

k kk k

2k�1 2k 2�

Tz}|{frame fk

⌧k(r�1) |{z}block r

2kw T

kw kwkwkw

Frame structure20 21 22 23 2kw

2k�1 2k

k kk k

2k�1 2k 2�

Tz}|{frame fk

⌧k(r�1) |{z}block r

2kw T

kw kwkwkw

20 21 22 23 2kw

2k�1 2k

k kk k

2k�1 2k 2�

Tz}|{frame fk

⌧k(r�1) |{z}block r

2kw T

kw kwkwkw

Blocks Last FrameVaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 8 / 12

Gaussian multiarmed bandits with abrupt changesBlock Allocation Strategy: Description of algorithm

Block Sliding-Window UCL algorithm

At beginning of r -th block in k-th frame, i.e., at time ⌧kr

performs the estimation using the observations collected in the

time-window {(⌧kr

� 2

k

w

)

++ 1, . . . , ⌧

kr

};

selects the arm i with the maximum value of

Q⌧kr

,kw

i

:= µkr ,kw

i

+ �kr ,kw

i

�1(1 � 1/K min{2kw , ⌧

kr

}),

for the duration of the block

Block SW-UCL achieves same order of performance as SW-UCL

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 9 / 12

Numerical Illustration

Environment: 5 ⇥ 5 square grid

Reward at optimal m⇤= 10

Reward at other arms mj

= mi

exp(�0.3dij

), dij

= distance

Assumed correlation scale ⇢ij

= exp(�0.3dij

)

�2s

= 1 and �20 = 10

Number of changes ⇣T

= bpT c

102 103 104 1050

0.5

1

Horizon length

Subo

ptim

alar

m se

lect

ion

102 103 104 1050

0.5

1

Horizon length

Num

ber o

f tra

nsiti

ons

102 103 104 1050

0.5

1

Horizon length

Subo

ptim

alar

m se

lect

ion

102 103 104 1050

0.5

1

Horizon length

Num

ber o

f tra

nsiti

ons

Expt number of selections of suboptimal arms Expt number of transitions among arms

Black line: SWUCL Red line: Adaptive SWUCL Green line: Block SWUCLVaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 10 / 12

Page 4: Surveillance and Environmental Monitoring Surveillance in ...vaibhav/talks/2014c.pdf · via Multiarmed Bandits Vaibhav Srivastava Department of Mechanical & Aerospace Engineering

How important is the correlation scale

Beluga

Underwater Testbed Virtual reward surface

Experiment and Video Credit: Peter Langdren

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 11 / 12

Conclusions and Future Directions

Conclusions

1A multiarmed bandit framework for surveillance problems

2Arrival on events of interest =) Abrupt changes in reward surface

3Exploration-Exploitation trade-o↵ and role of correlation scale

4Block allocation to reduce travel cost

Future Directions

1Extension to multiple vehicles

2Environmental partitioning strategies catered to

addressing exploration-exploitation trade-o↵

3Extensions to continuously changing environments

Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 12 / 12