Markov Decision Processesbboots3/ACRL-Spring2019/... · § Idea: get DFS’s space advantage with...

MarkovDecisionProcesses

§ AnMDPisdefinedby:§ Asetofstatess∈ S§ Asetofactionsa∈ A§ AtransitionfunctionT(s,a,s’)

§ Probability thatafromsleadstos’,i.e.,P(s’| s,a)§ Alsocalledthemodelorthedynamics

§ ArewardfunctionR(s,a,s’)§ Sometimes justR(s)orR(s’)

§ Astartstate§ Maybeaterminalstate

§ MDPscanbethoughtofasnon-deterministicsearchproblems

MDPSearchTrees

a

s

s’

s,a

(s,a,s’) isatransition

T(s,a,s’)=P(s’|s,a)s,a,s’

sisastate

(s,a)isaq-state

ComparetoAdversarialSearch(Minimax)

§ Deterministic,zero-sumgames:§ Tic-tac-toe,chess,checkers§ Oneplayermaximizesresult§ Theotherminimizesresult

§ Minimax search:§ Astate-spacesearchtree§ Playersalternateturns§ Computeeachnode’sminimax value:

thebestachievableutilityagainstarational(optimal)adversary

8 2 5 6

max

min2 5

5

Terminalvalues:partofthegame

Minimax values:computedrecursively

Worst-Casevs.AverageCase

10 10 9 100

max

min

Idea:Uncertainoutcomescontrolledbychance,notanadversary!

Expectimax Search

§ Whywouldn’tweknowwhattheresultofanactionwillbe?§ Explicitrandomness:rollingdice§ Unpredictableopponentsrespondrandomly§ Actionscanfail:whenmovingarobot,wheelsmightslip

§ Valuesshouldnowreflectaverage-case(expectimax)outcomes,notworst-case(minimax)outcomes

§ Expectimax search: computetheaveragescoreunderoptimalplay§ Maxnodesasinminimax search§ Chancenodesarelikeminnodesbuttheoutcomeisuncertain§ Calculatetheirexpectedutilities§ I.e.takeweightedaverage(expectation)ofchildren

§ MDPsandvalueiterationformalizethis.

10 4 5 7

max

chance

10 10 9 100

OptimalQuantities

§ Thevalue (utility)ofastates:V*(s)=expectedutilitystartinginsandactingoptimally

§ Thevalue(utility)ofaq-state(s,a):Q*(s,a)=expectedutilitystartingouthavingtakenactionafromstatesand(thereafter)actingoptimally

§ Theoptimalpolicy:π*(s)=optimalactionfromstates

a

s

s’

s,a

(s,a,s’)isatransition

s,a,s’

sisastate

(s,a)isaq-state

DeterministicSearch

a

s

s’

s,a

s,a,s’

a

s

s’

Policies

§ ForMDPs,solutionisanoptimalpolicyπ*:S→A§ Apolicyπ givesanactionforeachstate§ Anoptimalpolicyisonethatmaximizes

expectedutilityiffollowed

§ Indeterministic single-agentsearchproblems,wewantanoptimalplan,justasequenceofactions,fromstarttoagoal

Example:TravelinginRomania

§ Statespace:§ Cities

§ Successorfunction:§ Roads:Gotoadjacentcitywith

cost=distance

§ Startstate:§ Arad

§ Goaltest:§ Isstate==Bucharest?

§ Solution?

SearchingwithaSearchTree

§ Search:§ Expandoutpotentialplans(treenodes)§ Maintainafringeofpartialplansunderconsideration§ Trytoexpandasfewtreenodesaspossible

GeneralTreeSearch

§ Importantideas:§ Fringe§ Expansion§ Explorationstrategy

§ Mainquestion:whichfringenodestoexplore?

SearchAlgorithmProperties

§ Complete:Guaranteedtofindasolutionifoneexists?§ Optimal:Guaranteedtofindtheleastcostpath?§ Time complexity?§ Space complexity?

§ Cartoonofsearchtree:§ bisthebranchingfactor§ misthemaximumdepth§ solutionsatvariousdepths

§ Numberofnodesinentiretree?§ 1+b+b2 +….bm =O(bm)

…b 1 node

b nodes

b2 nodes

bm nodes

m tiers

Depth-FirstSearch

S

a

b

d p

a

c

e

p

h

f

r

q

q c G

a

qe

p

h

f

r

q

q c G

a

S

G

d

b

p q

c

e

h

a

f

rqph

fd

ba

c

e

r

Strategy:expandadeepestnodefirst

Depth-FirstSearch(DFS)Properties

…b 1 node

b nodes

b2 nodes

bm nodes

m tiers

§ WhatnodesDFSexpand?§ Someleftprefixofthetree.§ Couldprocessthewholetree!§ Ifmisfinite,takestimeO(bm)

§ Howmuchspacedoesthefringetake?§ Onlyhassiblingsonpathtoroot,soO(bm)

§ Isitcomplete?§ mcouldbeinfinite,soonlyifweprevent

cycles

§ Isitoptimal?§ No,itfindsthe“leftmost”solution,

regardlessofdepthorcost

Breadth-FirstSearch

S

a

b

d p

a

c

e

p

h

f

r

q

q c G

a

qe

p

h

f

r

q

q c G

a

S

G

d

b

p q

ce

h

a

f

r

Search

Tiers

Strategy:expandashallowestnodefirst

Breadth-FirstSearch(BFS)Properties

§ WhatnodesdoesBFSexpand?§ Processesallnodesaboveshallowestsolution§ Letdepthofshallowestsolutionbes§ SearchtakestimeO(bs)

§ Howmuchspacedoesthefringetake?§ Hasroughlythelasttier,soO(bs)

§ Isitcomplete?§ smustbefiniteifasolutionexists,soyes!

§ Isitoptimal?§ Onlyifcostsareall1(moreoncostslater)

…b 1 node

b nodes

b2 nodes

bm nodes

s tiers

bs nodes

IterativeDeepening

…b

§ Idea:getDFS’sspaceadvantagewithBFS’stime/shallow-solutionadvantages§ RunaDFSwithdepthlimit1.Ifnosolution…§ RunaDFSwithdepthlimit2.Ifnosolution…§ RunaDFSwithdepthlimit3.…..

§ Isn’tthatwastefullyredundant?§ Generallymostworkhappensinthelowestlevelsearched,sonotsobad!

UniformCostSearch

S

a

b

d p

a

c

e

p

h

f

r

q

q c G

a

qe

p

h

f

r

q

q c G

a

Strategy: expand a cheapest node first

S

G

d

b

p q

c

e

h

a

f

r

3 9 1

16411

5

713

8

1011

17 11

0

6

39

1

1

2

8

8 2

15

1

2

Cost contours

2

…

UniformCostSearch(UCS)Properties

§ WhatnodesdoesUCSexpand?§ Processesallnodeswithcostlessthancheapestsolution!§ IfthatsolutioncostsC* andarcscostatleastε , thenthe

“effectivedepth”isroughlyC*/ε§ TakestimeO(bC*/ε)(exponentialineffectivedepth)

§ Howmuchspacedoesthefringetake?§ Hasroughlythelasttier,soO(bC*/ε)

§ Isitcomplete?§ Assumingbestsolutionhasafinitecostandminimumarccost

ispositive,yes!

§ Isitoptimal?§ Yes!(ProofnextviaA*)

b

C*/ε “tiers”c ≤ 3

c ≤ 2c ≤ 1

UniformCostIssues

§ Remember:UCSexploresincreasingcostcontours

§ Thegood:UCSiscompleteandoptimal!

§ Thebad:§ Exploresoptionsinevery“direction”§ Noinformationaboutgoallocation

§ We’llfixthatwithsearchheuristics!

Start Goal

…

c ≤ 3c ≤ 2

c ≤ 1

SearchHeuristics§ Aheuristicis:

§ Afunctionthatestimates howcloseastateistoagoal§ Designedforaparticularsearchproblem§ Examples:Manhattandistance,Euclideandistancefor

pathfinding

10

5

11.2

Example:HeuristicFunction

h(x)

GreedySearch

§ Expandthenodethatseemsclosest…

§ Whatcangowrong?

GreedySearch

§ Strategy:expandanodethatyouthinkisclosesttoagoalstate§ Heuristic:estimateofdistancetonearestgoalforeachstate

§ Worst-case:likeabadly-guidedDFS

…b

…b

CombiningUCSandGreedy

§ Uniform-cost ordersbypathcost,orbackwardcostg(n)§ Greedy ordersbygoalproximity,orforwardcosth(n)

§ A*Search ordersbythesum:f(n)=g(n)+h(n)

S a d

b

Gh=5

h=6

h=2

1

8

11

2

h=6 h=0

c

h=7

3

e h=11

Example:Teg Grenager

S

a

b

c

ed

dG

G

g =0h=6

g =1h=5

g =2h=6

g =3h=7

g =4h=2

g =6h=0

g =9h=1

g =10h=2

g =12h=0

IsA*Optimal?

§ Whatwentwrong?§ Actualbadgoalcost<estimatedgoodgoalcost§ Weneedestimatestobelessthanactualcosts!

A

GS

1 3h=6

h=0

5

h =7

AdmissibleHeuristics

§ Aheuristich isadmissible (optimistic)if:

whereisthetruecosttoanearestgoal

§ Comingupwithadmissibleheuristicsismostofwhat’sinvolvedinusingA*inpractice.

OptimalityofA*TreeSearch

Assume:§ Aisanoptimalgoalnode§ Bisasuboptimalgoalnode§ hisadmissible

Claim:

§ AwillexitthefringebeforeB

…

OptimalityofA*TreeSearch:Blocking

Proof:§ ImagineBisonthefringe§ Someancestorn ofAisonthe

fringe,too(maybeA!)§ Claim:n willbeexpandedbeforeB

1. f(n)islessorequaltof(A)

Definitionoff-costAdmissibilityofh

…

h=0atagoal




1. f(n)islessorequaltof(A)2. f(A)islessthanf(B)

B issuboptimalh=0atagoal

…




1. f(n)islessorequaltof(A)2. f(A)islessthanf(B)3. n expandsbeforeB

§ AllancestorsofAexpandbeforeB§ AexpandsbeforeB§ A*searchisoptimal

…

PropertiesofA*

UCSvs A*Contours

§ Uniform-costexpandsequallyinall“directions”

§ A*expandsmainlytowardthegoal,butdoeshedgeitsbetstoensureoptimality

Start Goal

Start Goal

A*Applications

§ Videogames§ Pathing /routingproblems§ Resourceplanningproblems§ Robotmotionplanning§ Languageanalysis§ Machinetranslation§ Speechrecognition§ …

RoboticsExample

A*:Summary

§ A*usesbothbackwardcostsand(estimatesof)forwardcosts

§ A*isoptimalwithadmissible/consistentheuristics

§ Heuristicdesigniskey

Markov Decision Processesbboots3/ACRL-Spring2019/... · § Idea: get DFS’s space advantage with...

Documents

Transcript of Markov Decision Processesbboots3/ACRL-Spring2019/... · § Idea: get DFS’s space advantage with...