2-20 value iteration - Swarthmore Collegebryce/cs63/s17/slides/2-20_value_iteration.pdfSearch with...

MDPsandValueIteration

2/20/17

Recall:StateSpaceSearchProblems• Asetofdiscretestates• Adistinguishedstartstate• Asetofactions availabletotheagentineachstate• An actionfunction that, givenastateandanaction,returnsanewstate• Asetofgoalstates,oftenspecifiedasafunction• Awaytomeasuresolutionquality

Whatifactionsaren’tperfect?

• Wemightnotknowexactlywhichnextstatewillresultfromanaction.• Wecanmodelthisasaprobabilitydistributionovernextstates.

SearchwithNon-DeterministicActions• Asetofdiscretestates• Adistinguishedstartstate• Asetofactions availabletotheagentineachstate• An actionfunction that, givenastateandanaction,returnsanewstate

• Asetofgoalstates,oftenspecifiedasafunction• Awaytomeasuresolutionquality• Asetofterminalstates• Arewardfunction thatgivesautilityforeachstate

aprobabilitydistributionovernextstates

MarkovDecisionProcesses(MDPs)Namedafterthe“Markovproperty”:ifyouknowthestatethenyouknowthetransitionprobabilities.

• Westillrepresentstatesandactions.• Actionsnolongerleadtoasinglenextstate.• Insteadtheyleadtooneofseveralpossiblestates,determinedrandomly.

• We’renowworkingwithutilitiesinsteadofgoals.• Expectedutilityworkswellforhandlingrandomness.

• Weneedtoplanforunintendedconsequences.• Evenanoptimalagentmayrunforever!

• States:S

• Actions:As

• Transitionfunction• F(s,a)=s’

• Start∈ S

• Goals⊂ S

• ActionCosts:C(a)

• States:S

• Actions:As

• Transitionprobabilities• P(s’|s,a)

• Start∈ S

• Terminal⊂ S

• StateRewards:R(s)• Canalsohavecosts:C(a)

StateSpaceSearch MDPs

Wecan’trelyonasingleplan!Actionsmightnothavetheoutcomeweexpect,soourplansneedtoincludecontingenciesforstateswecouldendupin.

Insteadofsearchingforaplan,wedeviseapolicy.

Apolicyisafunctionthatmapsstatestoactions.• Foreachstatewecouldendupin,thepolicytellsuswhichactiontotake.

Asimpleexample:GridWorld

end+1

end-1

start

Ifactionsweredeterministic,wecouldsolvethiswithstatespacesearch.• (3,2)wouldbeagoalstate• (3,1)wouldbeadeadend


end+1

end-1

start

• Supposeinsteadthatthemovewetrytomakeonlyworkscorrectly80%ofthetime.• 10%ofthetime,wegoineachperpendiculardirection,e.g.trytogoright,goupinstead.• Ifimpossible,stayinplace.


end+1

end-1

start

• Before,wehadtwoequally-goodalternatives.

• Whichpathisbetterwhenactionsareuncertain?• Whatshouldwedoifwefindourselvesin(2,1)?

DiscountFactorSpecifieshowimpatienttheagentis.

Keyidea:rewardnowisbetterthanrewardlater.

• Rewardsinthefutureareexponentiallydecayed.• Rewardtstepsinthefutureisdiscountedby𝜸t

• Whydoweneedadiscountfactor?

U = �t ·Rt

ValueofaState• Tocomeupwithanoptimalpolicy,westartbydeterminingavalueforeachstate.

• Thevalueofastateisrewardnow,plusdiscountedfuturereward:

• Assumewe’lldothebestthinginthefuture.

V (s) = R(s) + �[future value]

FutureValue• Ifweknowthevalueofotherstates,wecancalculatetheexpectedvalueofeachaction:

• Futurevalueistheexpectedvalueofthebestaction:

E(s, a) =X

s0

P (s0 | s, a) · V (s0)

max

aE(s, a)

ValueIteration• Thevalueofstates dependsonthevalueofotherstatess’.• Thevalueofs’maydependonthevalueofs.

Wecaniterativelyapproximatethevalueusingdynamicprogramming.

• Initializeallvaluestotheimmediaterewards.• Updatevaluesbasedonthebestnext-state.• Repeatuntilconvergence(valuesdon’tchange).

ValueIterationPseudocodevalues = {state : R(state) for each state}

until values don’t change:prev = copy of values

for each state s:initialize best_EVfor each action:

EV = 0for each next state ns:

EV += prob * prev[ns]best_EV = max(EV, best_EV)

values[s] = R(s) + gamma*best_EV

ValueIterationonGridWorld

discount=.9

0 0 0 +1

0 0 -1

0 0 0 0

V (3, 0) = 0 + � ·max [E((3, 0), u),E((3, 0), d),E((3, 0), l),E((3, 0), r) ]

V (2, 1) = 0 + � ·max [E((2, 1), u),E((2, 1), d),E((2, 1), l),E((2, 1), r) ]

V (2, 2) = 0 + � ·max [E((2, 2), u),E((2, 2), d),E((2, 2), l),E((2, 2), r) ]


0 0 .72 +1

0 0 -1

0 0 0 0

discount=.9

V (3, 0) = � ·max [.8 ·�1 + .1 · 0 + .1 · 0,.8 · 0 + .1 · 0 + .1 · 0,.8 · 0 + .1 · 0 + .1 ·�1,

.8 · 0 + .1 ·�1 + .1 · 0 ]

V (2, 1) = � ·max [.8 · 0 + .1 · 0 + .1 ·�1,

.8 · 0 + .1 ·�1 + .1 · 0,

.8 · 0 + .1 · 0 + .1 · 0,

.8 ·�1 + .1 · 0 + .1 · 0 ]

V (2, 2) = � ·max[.8 · 0 + .1 · 0 + .1 · 1,.8 · 0 + .1 · 1 + .1 · 0,.8 · 0 + .1 · 0 + .1 · 0,.8 · 1 + .1 · 0 + .1 · 0]


discount=.9

0 .5184 .7848 +1

0 .4284 -1

0 0 0 0

Exercise:Continuevalueiteration

Whatdowedowiththevalues?Whenvalueshaveconverged,theoptimalpolicyistoselecttheactionwiththehighestexpectedvalueateachstate.

• Whatshouldwedoifwefindourselvesin(2,1)?

.64 .74 .85 +1

.57 .57 -1

.49 .43 .48 .28

2-20 value iteration - Swarthmore Collegebryce/cs63/s17/slides/2-20_value_iteration.pdfSearch with...

Documents

Transcript of 2-20 value iteration - Swarthmore Collegebryce/cs63/s17/slides/2-20_value_iteration.pdfSearch with...