Markov Decision Processesbboots3/ACRL-Spring2019/... · § Idea: get DFS’s space advantage with...
Transcript of Markov Decision Processesbboots3/ACRL-Spring2019/... · § Idea: get DFS’s space advantage with...
MarkovDecisionProcesses
§ AnMDPisdefinedby:§ Asetofstatess∈ S§ Asetofactionsa∈ A§ AtransitionfunctionT(s,a,s’)
§ Probability thatafromsleadstos’,i.e.,P(s’| s,a)§ Alsocalledthemodelorthedynamics
§ ArewardfunctionR(s,a,s’)§ Sometimes justR(s)orR(s’)
§ Astartstate§ Maybeaterminalstate
§ MDPscanbethoughtofasnon-deterministicsearchproblems
MDPSearchTrees
a
s
s’
s,a
(s,a,s’) isatransition
T(s,a,s’)=P(s’|s,a)s,a,s’
sisastate
(s,a)isaq-state
ComparetoAdversarialSearch(Minimax)
§ Deterministic,zero-sumgames:§ Tic-tac-toe,chess,checkers§ Oneplayermaximizesresult§ Theotherminimizesresult
§ Minimax search:§ Astate-spacesearchtree§ Playersalternateturns§ Computeeachnode’sminimax value:
thebestachievableutilityagainstarational(optimal)adversary
8 2 5 6
max
min2 5
5
Terminalvalues:partofthegame
Minimax values:computedrecursively
Worst-Casevs.AverageCase
10 10 9 100
max
min
Idea:Uncertainoutcomescontrolledbychance,notanadversary!
Expectimax Search
§ Whywouldn’tweknowwhattheresultofanactionwillbe?§ Explicitrandomness:rollingdice§ Unpredictableopponentsrespondrandomly§ Actionscanfail:whenmovingarobot,wheelsmightslip
§ Valuesshouldnowreflectaverage-case(expectimax)outcomes,notworst-case(minimax)outcomes
§ Expectimax search: computetheaveragescoreunderoptimalplay§ Maxnodesasinminimax search§ Chancenodesarelikeminnodesbuttheoutcomeisuncertain§ Calculatetheirexpectedutilities§ I.e.takeweightedaverage(expectation)ofchildren
§ MDPsandvalueiterationformalizethis.
10 4 5 7
max
chance
10 10 9 100
OptimalQuantities
§ Thevalue (utility)ofastates:V*(s)=expectedutilitystartinginsandactingoptimally
§ Thevalue(utility)ofaq-state(s,a):Q*(s,a)=expectedutilitystartingouthavingtakenactionafromstatesand(thereafter)actingoptimally
§ Theoptimalpolicy:π*(s)=optimalactionfromstates
a
s
s’
s,a
(s,a,s’)isatransition
s,a,s’
sisastate
(s,a)isaq-state
DeterministicSearch
a
s
s’
s,a
s,a,s’
a
s
s’
Policies
§ ForMDPs,solutionisanoptimalpolicyπ*:S→A§ Apolicyπ givesanactionforeachstate§ Anoptimalpolicyisonethatmaximizes
expectedutilityiffollowed
§ Indeterministic single-agentsearchproblems,wewantanoptimalplan,justasequenceofactions,fromstarttoagoal
Example:TravelinginRomania
§ Statespace:§ Cities
§ Successorfunction:§ Roads:Gotoadjacentcitywith
cost=distance
§ Startstate:§ Arad
§ Goaltest:§ Isstate==Bucharest?
§ Solution?
SearchingwithaSearchTree
§ Search:§ Expandoutpotentialplans(treenodes)§ Maintainafringeofpartialplansunderconsideration§ Trytoexpandasfewtreenodesaspossible
GeneralTreeSearch
§ Importantideas:§ Fringe§ Expansion§ Explorationstrategy
§ Mainquestion:whichfringenodestoexplore?
SearchAlgorithmProperties
§ Complete:Guaranteedtofindasolutionifoneexists?§ Optimal:Guaranteedtofindtheleastcostpath?§ Time complexity?§ Space complexity?
§ Cartoonofsearchtree:§ bisthebranchingfactor§ misthemaximumdepth§ solutionsatvariousdepths
§ Numberofnodesinentiretree?§ 1+b+b2 +….bm =O(bm)
…b 1 node
b nodes
b2 nodes
bm nodes
m tiers
Depth-FirstSearch
S
a
b
d p
a
c
e
p
h
f
r
q
q c G
a
qe
p
h
f
r
q
q c G
a
S
G
d
b
p q
c
e
h
a
f
rqph
fd
ba
c
e
r
Strategy:expandadeepestnodefirst
Depth-FirstSearch(DFS)Properties
…b 1 node
b nodes
b2 nodes
bm nodes
m tiers
§ WhatnodesDFSexpand?§ Someleftprefixofthetree.§ Couldprocessthewholetree!§ Ifmisfinite,takestimeO(bm)
§ Howmuchspacedoesthefringetake?§ Onlyhassiblingsonpathtoroot,soO(bm)
§ Isitcomplete?§ mcouldbeinfinite,soonlyifweprevent
cycles
§ Isitoptimal?§ No,itfindsthe“leftmost”solution,
regardlessofdepthorcost
Breadth-FirstSearch
S
a
b
d p
a
c
e
p
h
f
r
q
q c G
a
qe
p
h
f
r
q
q c G
a
S
G
d
b
p q
ce
h
a
f
r
Search
Tiers
Strategy:expandashallowestnodefirst
Breadth-FirstSearch(BFS)Properties
§ WhatnodesdoesBFSexpand?§ Processesallnodesaboveshallowestsolution§ Letdepthofshallowestsolutionbes§ SearchtakestimeO(bs)
§ Howmuchspacedoesthefringetake?§ Hasroughlythelasttier,soO(bs)
§ Isitcomplete?§ smustbefiniteifasolutionexists,soyes!
§ Isitoptimal?§ Onlyifcostsareall1(moreoncostslater)
…b 1 node
b nodes
b2 nodes
bm nodes
s tiers
bs nodes
IterativeDeepening
…b
§ Idea:getDFS’sspaceadvantagewithBFS’stime/shallow-solutionadvantages§ RunaDFSwithdepthlimit1.Ifnosolution…§ RunaDFSwithdepthlimit2.Ifnosolution…§ RunaDFSwithdepthlimit3.…..
§ Isn’tthatwastefullyredundant?§ Generallymostworkhappensinthelowestlevelsearched,sonotsobad!
UniformCostSearch
S
a
b
d p
a
c
e
p
h
f
r
q
q c G
a
qe
p
h
f
r
q
q c G
a
Strategy: expand a cheapest node first
S
G
d
b
p q
c
e
h
a
f
r
3 9 1
16411
5
713
8
1011
17 11
0
6
39
1
1
2
8
8 2
15
1
2
Cost contours
2
…
UniformCostSearch(UCS)Properties
§ WhatnodesdoesUCSexpand?§ Processesallnodeswithcostlessthancheapestsolution!§ IfthatsolutioncostsC* andarcscostatleastε , thenthe
“effectivedepth”isroughlyC*/ε§ TakestimeO(bC*/ε)(exponentialineffectivedepth)
§ Howmuchspacedoesthefringetake?§ Hasroughlythelasttier,soO(bC*/ε)
§ Isitcomplete?§ Assumingbestsolutionhasafinitecostandminimumarccost
ispositive,yes!
§ Isitoptimal?§ Yes!(ProofnextviaA*)
b
C*/ε “tiers”c ≤ 3
c ≤ 2c ≤ 1
UniformCostIssues
§ Remember:UCSexploresincreasingcostcontours
§ Thegood:UCSiscompleteandoptimal!
§ Thebad:§ Exploresoptionsinevery“direction”§ Noinformationaboutgoallocation
§ We’llfixthatwithsearchheuristics!
Start Goal
…
c ≤ 3c ≤ 2
c ≤ 1
SearchHeuristics§ Aheuristicis:
§ Afunctionthatestimates howcloseastateistoagoal§ Designedforaparticularsearchproblem§ Examples:Manhattandistance,Euclideandistancefor
pathfinding
10
5
11.2
Example:HeuristicFunction
h(x)
GreedySearch
§ Expandthenodethatseemsclosest…
§ Whatcangowrong?
GreedySearch
§ Strategy:expandanodethatyouthinkisclosesttoagoalstate§ Heuristic:estimateofdistancetonearestgoalforeachstate
§ Worst-case:likeabadly-guidedDFS
…b
…b
CombiningUCSandGreedy
§ Uniform-cost ordersbypathcost,orbackwardcostg(n)§ Greedy ordersbygoalproximity,orforwardcosth(n)
§ A*Search ordersbythesum:f(n)=g(n)+h(n)
S a d
b
Gh=5
h=6
h=2
1
8
11
2
h=6 h=0
c
h=7
3
e h=11
Example:Teg Grenager
S
a
b
c
ed
dG
G
g =0h=6
g =1h=5
g =2h=6
g =3h=7
g =4h=2
g =6h=0
g =9h=1
g =10h=2
g =12h=0
IsA*Optimal?
§ Whatwentwrong?§ Actualbadgoalcost<estimatedgoodgoalcost§ Weneedestimatestobelessthanactualcosts!
A
GS
1 3h=6
h=0
5
h =7
AdmissibleHeuristics
§ Aheuristich isadmissible (optimistic)if:
whereisthetruecosttoanearestgoal
§ Comingupwithadmissibleheuristicsismostofwhat’sinvolvedinusingA*inpractice.
OptimalityofA*TreeSearch
Assume:§ Aisanoptimalgoalnode§ Bisasuboptimalgoalnode§ hisadmissible
Claim:
§ AwillexitthefringebeforeB
…
OptimalityofA*TreeSearch:Blocking
Proof:§ ImagineBisonthefringe§ Someancestorn ofAisonthe
fringe,too(maybeA!)§ Claim:n willbeexpandedbeforeB
1. f(n)islessorequaltof(A)
Definitionoff-costAdmissibilityofh
…
h=0atagoal
OptimalityofA*TreeSearch:Blocking
Proof:§ ImagineBisonthefringe§ Someancestorn ofAisonthe
fringe,too(maybeA!)§ Claim:n willbeexpandedbeforeB
1. f(n)islessorequaltof(A)2. f(A)islessthanf(B)
B issuboptimalh=0atagoal
…
OptimalityofA*TreeSearch:Blocking
Proof:§ ImagineBisonthefringe§ Someancestorn ofAisonthe
fringe,too(maybeA!)§ Claim:n willbeexpandedbeforeB
1. f(n)islessorequaltof(A)2. f(A)islessthanf(B)3. n expandsbeforeB
§ AllancestorsofAexpandbeforeB§ AexpandsbeforeB§ A*searchisoptimal
…
PropertiesofA*
UCSvs A*Contours
§ Uniform-costexpandsequallyinall“directions”
§ A*expandsmainlytowardthegoal,butdoeshedgeitsbetstoensureoptimality
Start Goal
Start Goal
A*Applications
§ Videogames§ Pathing /routingproblems§ Resourceplanningproblems§ Robotmotionplanning§ Languageanalysis§ Machinetranslation§ Speechrecognition§ …
RoboticsExample
A*:Summary
§ A*usesbothbackwardcostsand(estimatesof)forwardcosts
§ A*isoptimalwithadmissible/consistentheuristics
§ Heuristicdesigniskey