Optimal Nudging. Presentation UD.

34
Optimal Nudging A new approach to solving SMDPs Reinaldo Uribe M Universidad de los Andes — Oita University Colorado State University Nov. 11, 2013

Transcript of Optimal Nudging. Presentation UD.

Page 1: Optimal Nudging. Presentation UD.

Optimal NudgingA new approach to solving SMDPs

Reinaldo Uribe MUniversidad de los Andes — Oita University

Colorado State University

Nov. 11, 2013

Page 2: Optimal Nudging. Presentation UD.

Snakes & Ladders

Player advances thenumber of steps indicatedby a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves the playerforward to the top.

Goal: reaching state 100.

Page 3: Optimal Nudging. Presentation UD.

Snakes & Ladders

Boring!(No skill required, only luck.)

Player advances thenumber of steps indicatedby a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves the playerforward to the top.

Goal: reaching state 100.

Page 4: Optimal Nudging. Presentation UD.

Variation: Decision Snakes and Ladders

Sets of “win” and“loss” terminal states.

Actions: either“advance” or “goback,” to be decidedbefore throwing the die.

Page 5: Optimal Nudging. Presentation UD.

Reinforcement Learning: Finding an optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

Page 6: Optimal Nudging. Presentation UD.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Page 7: Optimal Nudging. Presentation UD.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Page 8: Optimal Nudging. Presentation UD.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Page 9: Optimal Nudging. Presentation UD.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Page 10: Optimal Nudging. Presentation UD.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Page 11: Optimal Nudging. Presentation UD.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Page 12: Optimal Nudging. Presentation UD.

Better than optimal?

(Old optimal policy)

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Page 13: Optimal Nudging. Presentation UD.

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Page 14: Optimal Nudging. Presentation UD.

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Page 15: Optimal Nudging. Presentation UD.

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Page 16: Optimal Nudging. Presentation UD.

So, how are average-reward optimal policies found?

Algorithm 1 Generic SMDP solver

Initializerepeat forever

ActDo RL to find value of current π . Usually 1-step Q-learning

Update ρ.

Average-adjusted Q-learning:

Qt+1(st, at)← (1− γt)Qt(st, at) + γt

(rt+1 − ρtct+1 +max

aQt(st+1, a)

)

Page 17: Optimal Nudging. Presentation UD.

Generic Learning Algorithm

Table of algorithms. ARRL

Algorithm Gain update

AACJalali and Ferguson 1989 ρt+1 ←

t∑i=0

r(si, πi(si))

t+ 1

R–LearningSchwartz 1993

ρt+1 ← (1− α)ρt+α(rt+1 +max

aQt(st+1, a)−max

aQt(st, a)

)H–LearningTadepalli and Ok 1998

ρt+1 ← (1−αt)ρt+αt

(rt+1 −Ht(st) +Ht(st+1)

)αt+1 ←

αt

αt + 1

SSP Q-LearningAbounadi et al. 2001

ρt+1 ← ρt + αt minaQt(s, a)

HARGhavamzadeh and Mahadevan 2007 ρt+1 ←

t∑i=0

r(si, πi(si))

t+ 1

Page 18: Optimal Nudging. Presentation UD.

Generic Learning Algorithm

Table of algorithms. SMDPRL

Algorithm Gain update

SMARTDas et al. 1999

ρt+1 ←

t∑i=0

r(si, πi(si))

t∑i=0

c(si, πi(si))MAX-Q

Ghavamzadeh and Mahadevan 2001

Page 19: Optimal Nudging. Presentation UD.

Nudging

Algorithm 2 Nudged Learning

Initialize (π, ρ, Q)repeat

Set reward scheme to (r − ρc).Solve by any RL method.Update ρ

until Qπ(sI) = 0

Note: ‘by any RL method’ refers to a well-studied problem forwhich better algorithms (both practical and with theoreticalguarantees) exist.

ρ can (and will) be updated optimally.

Page 20: Optimal Nudging. Presentation UD.

Nudging

Algorithm 3 Nudged Learning

Initialize (π, ρ, Q)repeat

Set reward scheme to (r − ρc).Solve by any RL method.Update ρ

until Qπ(sI) = 0

Note: ‘by any RL method’ refers to a well-studied problem forwhich better algorithms (both practical and with theoreticalguarantees) exist.

ρ can (and will) be updated optimally.

Page 21: Optimal Nudging. Presentation UD.

The w − l space.Definition

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

D

l

w

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

● ●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ● ●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●●

●● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

● ●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●● ●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●●

● ● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

● ●

●● ●

●● ●

● ●

● ●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

●●●

●●

●●

●● ●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●

● ●

●● ●

●●

●●

●●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

Page 22: Optimal Nudging. Presentation UD.

The w − l space.Value and Cost

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

D

l

w

−D

−0.5

D

0

0.5D

D

D

D

l

w

1

2

48

Page 23: Optimal Nudging. Presentation UD.

The w − l space.Nudged value

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

D

l

w

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

● ●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ● ●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●●

●● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

● ●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●● ●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●●

● ● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

● ●

●● ●

●● ●

● ●

● ●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

●●●

●●

●●

●● ●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●

● ●

●● ●

●●

●●

●●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

−D/2

0

D/2

Page 24: Optimal Nudging. Presentation UD.

The w − l space.As a projective transformation.

Page 25: Optimal Nudging. Presentation UD.

The w − l space.As a projective transformation.

1

D

−D

Policy V

alu

e

Episode Length

Page 26: Optimal Nudging. Presentation UD.

The w − l space.As a projective transformation.

1

D

−D

Policy V

alu

e

Episode LengthD

−D

w

l

Page 27: Optimal Nudging. Presentation UD.

Sample task: two states, continuous actions

s1a1 ∈ [0, 1]r1 = 1 + (a1 − 0.5)2

c1 = 1 + a1

s2

a2 ∈ [0, 1]r2 = 1 + a2c2 = 1 + (a2 − 0.5)2

Page 28: Optimal Nudging. Presentation UD.

Sample task: two states, continuous actions

s1a1 ∈ [0, 1]r1 = 1 + (a1 − 0.5)2

c1 = 1 + a1

s2a2 ∈ [0, 1]r2 = 1 + a2c2 = 1 + (a2 − 0.5)2

Page 29: Optimal Nudging. Presentation UD.

Sample task: two states, continuous actions

Policy Space (Actions)

0

a2

1

0 a1 1

Page 30: Optimal Nudging. Presentation UD.

Sample task: two states, continuous actions

Policy Values and Costs

Policy v

alu

e

Policy cost

4

4

Page 31: Optimal Nudging. Presentation UD.

Sample task: two states, continuous actions

Policy Manifold in w − l

l

w

D/2

D/2

Page 32: Optimal Nudging. Presentation UD.

And the rest...

Neat geometry, linear problems in w − l.Easily exploited using straightforward algebra / calculus.

Updating average reward between iterations can be optimized.

Becomes finding the (or rather an) intersection between twoconics.

Which can be solved in O(1) time.

Worst case, uncertainty reduces in half.

Typically much better than that.

Little extra complexity added to already PAC methods.

Page 33: Optimal Nudging. Presentation UD.

And the rest...

Neat geometry, linear problems in w − l.Easily exploited using straightforward algebra / calculus.

Updating average reward between iterations can be optimized.

Becomes finding the (or rather an) intersection between twoconics.

Which can be solved in O(1) time.

Worst case, uncertainty reduces in half.

Typically much better than that.

Little extra complexity added to already PAC methods.

Page 34: Optimal Nudging. Presentation UD.

Thank [email protected]

Untitled by Li Wei, School of Design, Oita University, 2009.