Reactive Bandits with Attitude - Pedro A. OrtegaReactive Bandits with Attitude Pedro A. Ortega,...

1
Reactive Bandits with Attitude Pedro A. Ortega, Kee-Eung Kim and Daniel D. Lee GRASP Laboratory, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, U.S.A. Reactive Bandits Bandit Algorithms Learning Conclusions 2. The bandit can react to the player's strategy by choosing a location. 1. The possible locations depend on the attitude parameter. 1 0 -3 +3 0 The optimum switches between arms. In the limit, the arm with highest variance is chosen. 1 0 -0.09 -1 2e5 4e5 6e5 8e5 1 0 -0.09 2e5 4e5 6e5 8e5 1 0 1 0 UCB1's empirical frequencies do not converge to the optimum. 0.2 0.0 -0.2 -0.4 1 0 -0.09 2e4 4e4 6e4 8e4 1 0 True parameter is learned very quickly. The mixed strategy converges to the optimal strategy. Motivation Optimal Strategy

Transcript of Reactive Bandits with Attitude - Pedro A. OrtegaReactive Bandits with Attitude Pedro A. Ortega,...

Page 1: Reactive Bandits with Attitude - Pedro A. OrtegaReactive Bandits with Attitude Pedro A. Ortega, Kee-Eung Kim and Daniel D. Lee GRASP Laboratory, School of Engineering and Applied Sciences,

Reactive Bandits with AttitudePedro A. Ortega, Kee-Eung Kim and Daniel D. Lee

GRASP Laboratory, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, U.S.A.

Reactive Bandits

Bandit Algorithms

Learning

Conclusions

2. The bandit can react to the player's strategy by choosing a location.1. The possible locations

depend on the attitude parameter.

1

0

-3 +30

The optimum switchesbetween arms. In the limit, the arm withhighest variance is

chosen.

1

0

-0.09-12e5 4e5 6e5 8e5

1

0

-0.092e5 4e5 6e5 8e5

1

0

1

0

UCB1's empirical frequencies do not

converge to the optimum.

0.2

0.0

-0.2

-0.4

1

0

-0.092e4 4e4 6e4 8e4

1

0

True parameter is learned very quickly.

The mixed strategy converges to the optimal

strategy.

Motivation

Optimal Strategy