2-20 value iteration - Swarthmore Collegebryce/cs63/s17/slides/2-20_value_iteration.pdfSearch with...
Transcript of 2-20 value iteration - Swarthmore Collegebryce/cs63/s17/slides/2-20_value_iteration.pdfSearch with...
MDPsandValueIteration
2/20/17
Recall:StateSpaceSearchProblems• Asetofdiscretestates• Adistinguishedstartstate• Asetofactions availabletotheagentineachstate• An actionfunction that, givenastateandanaction,returnsanewstate• Asetofgoalstates,oftenspecifiedasafunction• Awaytomeasuresolutionquality
Whatifactionsaren’tperfect?
• Wemightnotknowexactlywhichnextstatewillresultfromanaction.• Wecanmodelthisasaprobabilitydistributionovernextstates.
SearchwithNon-DeterministicActions• Asetofdiscretestates• Adistinguishedstartstate• Asetofactions availabletotheagentineachstate• An actionfunction that, givenastateandanaction,returnsanewstate
• Asetofgoalstates,oftenspecifiedasafunction• Awaytomeasuresolutionquality• Asetofterminalstates• Arewardfunction thatgivesautilityforeachstate
aprobabilitydistributionovernextstates
MarkovDecisionProcesses(MDPs)Namedafterthe“Markovproperty”:ifyouknowthestatethenyouknowthetransitionprobabilities.
• Westillrepresentstatesandactions.• Actionsnolongerleadtoasinglenextstate.• Insteadtheyleadtooneofseveralpossiblestates,determinedrandomly.
• We’renowworkingwithutilitiesinsteadofgoals.• Expectedutilityworkswellforhandlingrandomness.
• Weneedtoplanforunintendedconsequences.• Evenanoptimalagentmayrunforever!
• States:S
• Actions:As
• Transitionfunction• F(s,a)=s’
• Start∈ S
• Goals⊂ S
• ActionCosts:C(a)
• States:S
• Actions:As
• Transitionprobabilities• P(s’|s,a)
• Start∈ S
• Terminal⊂ S
• StateRewards:R(s)• Canalsohavecosts:C(a)
StateSpaceSearch MDPs
Wecan’trelyonasingleplan!Actionsmightnothavetheoutcomeweexpect,soourplansneedtoincludecontingenciesforstateswecouldendupin.
Insteadofsearchingforaplan,wedeviseapolicy.
Apolicyisafunctionthatmapsstatestoactions.• Foreachstatewecouldendupin,thepolicytellsuswhichactiontotake.
Asimpleexample:GridWorld
end+1
end-1
start
Ifactionsweredeterministic,wecouldsolvethiswithstatespacesearch.• (3,2)wouldbeagoalstate• (3,1)wouldbeadeadend
Asimpleexample:GridWorld
end+1
end-1
start
• Supposeinsteadthatthemovewetrytomakeonlyworkscorrectly80%ofthetime.• 10%ofthetime,wegoineachperpendiculardirection,e.g.trytogoright,goupinstead.• Ifimpossible,stayinplace.
Asimpleexample:GridWorld
end+1
end-1
start
• Before,wehadtwoequally-goodalternatives.
• Whichpathisbetterwhenactionsareuncertain?• Whatshouldwedoifwefindourselvesin(2,1)?
DiscountFactorSpecifieshowimpatienttheagentis.
Keyidea:rewardnowisbetterthanrewardlater.
• Rewardsinthefutureareexponentiallydecayed.• Rewardtstepsinthefutureisdiscountedby𝜸t
• Whydoweneedadiscountfactor?
U = �t ·Rt
ValueofaState• Tocomeupwithanoptimalpolicy,westartbydeterminingavalueforeachstate.
• Thevalueofastateisrewardnow,plusdiscountedfuturereward:
• Assumewe’lldothebestthinginthefuture.
V (s) = R(s) + �[future value]
FutureValue• Ifweknowthevalueofotherstates,wecancalculatetheexpectedvalueofeachaction:
• Futurevalueistheexpectedvalueofthebestaction:
E(s, a) =X
s0
P (s0 | s, a) · V (s0)
max
aE(s, a)
ValueIteration• Thevalueofstates dependsonthevalueofotherstatess’.• Thevalueofs’maydependonthevalueofs.
Wecaniterativelyapproximatethevalueusingdynamicprogramming.
• Initializeallvaluestotheimmediaterewards.• Updatevaluesbasedonthebestnext-state.• Repeatuntilconvergence(valuesdon’tchange).
ValueIterationPseudocodevalues = {state : R(state) for each state}
until values don’t change:prev = copy of values
for each state s:initialize best_EVfor each action:
EV = 0for each next state ns:
EV += prob * prev[ns]best_EV = max(EV, best_EV)
values[s] = R(s) + gamma*best_EV
ValueIterationonGridWorld
discount=.9
0 0 0 +1
0 0 -1
0 0 0 0
V (3, 0) = 0 + � ·max [E((3, 0), u),E((3, 0), d),E((3, 0), l),E((3, 0), r) ]
V (2, 1) = 0 + � ·max [E((2, 1), u),E((2, 1), d),E((2, 1), l),E((2, 1), r) ]
V (2, 2) = 0 + � ·max [E((2, 2), u),E((2, 2), d),E((2, 2), l),E((2, 2), r) ]
ValueIterationonGridWorld
0 0 .72 +1
0 0 -1
0 0 0 0
discount=.9
V (3, 0) = � ·max [.8 ·�1 + .1 · 0 + .1 · 0,.8 · 0 + .1 · 0 + .1 · 0,.8 · 0 + .1 · 0 + .1 ·�1,
.8 · 0 + .1 ·�1 + .1 · 0 ]
V (2, 1) = � ·max [.8 · 0 + .1 · 0 + .1 ·�1,
.8 · 0 + .1 ·�1 + .1 · 0,
.8 · 0 + .1 · 0 + .1 · 0,
.8 ·�1 + .1 · 0 + .1 · 0 ]
V (2, 2) = � ·max[.8 · 0 + .1 · 0 + .1 · 1,.8 · 0 + .1 · 1 + .1 · 0,.8 · 0 + .1 · 0 + .1 · 0,.8 · 1 + .1 · 0 + .1 · 0]
ValueIterationonGridWorld
discount=.9
0 .5184 .7848 +1
0 .4284 -1
0 0 0 0
Exercise:Continuevalueiteration
Whatdowedowiththevalues?Whenvalueshaveconverged,theoptimalpolicyistoselecttheactionwiththehighestexpectedvalueateachstate.
• Whatshouldwedoifwefindourselvesin(2,1)?
.64 .74 .85 +1
.57 .57 -1
.49 .43 .48 .28