1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near...

51
1 Monte-Carlo Tree Search Alan Fern

Transcript of 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near...

Page 1: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

1

Monte-Carlo Tree Search

Alan Fern

Page 2: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

2

Introduction

h Rollout does not guarantee optimality or near optimality5 It only guarantees policy improvement (under certain conditions)

h Theoretical Question:Can we develop Monte-Carlo methods that give us near optimal policies?5 With computation that does NOT depend on number of states!5 This was an open theoretical question until late 90’s.

h Practical Question:Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time?

Page 3: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

3

Look-Ahead Trees

h Rollout can be viewed as performing one level of search on top of the base policy

h In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action5 Can we generalize this to general stochastic MDPs?

h Sparse Sampling is one such algorithm5 Strong theoretical guarantees of near optimality

a1 a2 ak

… … … … … … … …Maybe we should searchfor multiple levels.

Page 4: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

4

Online Planning with Look-Ahead Trees

h At each state we encounter in the environment we build a look-ahead tree of depth h and use it to estimate optimal Q-values of each action5 Select action with highest Q-value estimate

h s = current state

h Repeat 5 T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action5 a = BestRootAction(T) ;; action with best Q-value5 Execute action a in environment5 s is the resulting state

Page 5: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

5

Planning with Look-Ahead Trees

… …sa2 a1

Build look-ahead tree Build look-ahead tree

Real worldstate/action sequence

sa1 a2

a1 a2

s11

a1 a2

s1w…

a1 a2

s21

a1 a2

s2w…

R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)

s

a1 a2

a1 a2

s11

a1 a2

s1w…

a1 a2

s21

a1 a2

s2w…

R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)

………… …………

Page 6: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Expectimax Tree (depth 1)Alternate max &expectation

… …

s max node (max over actions)

expectation node (weighted averageover states)

a b

States -- weighted by probability of occuringafter taking a in s

h After taking each action there is a distribution over next states (nature’s moves)

h The value of an action depends on the immediate reward and weighted value of those states

s1 s2 sn-1 sn s1 s2 sn-1 sn

Page 7: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Expectimax Tree (depth 2)

Alternate max &expectation

… …

… … … ……......

s max node (max over actions)

expectation node (max over actions)

a b

Max Nodes: value equal to max of values of child expectation nodes

Expectation Nodes: value is weighted average of values of its children

Compute values from bottom up (leaf values = 0). Select root action with best value.

a b a b

Page 8: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Exact Expectimax Tree

Max

Exp Exp

Max Max HorizonH

k

# states

......

......

............

............

...... ...... ...... ......

...... ...... ............ ...... ...... ......

# actions

(kn )H leaves

n

Alternate max &expectation

In general can grow tree to any horizon H

Size depends on size of the state-space. Bad!

V*(s,H)

Q*(s,a,H)

Page 9: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Sparse Sampling Tree

Max

Exp Exp

Max MaxHorizon H

k

# states

......

......

............

............

...... ...... ...... ......

Samplingdepth H s

Samplingwidth C

...... ...... ............ ...... ...... ......

(kC )Hs leaves

# actions

(kn )H leaves

n

V*(s,H)

Q*(s,a,H)

Replace expectation with average over w samples

w will typically be much smaller than n.

(kw)H leaves

Sampling width w

Page 10: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Sparse Sampling [Kearns et. al. 2002]

SparseSampleTree(s,h,w)

If h == 0 Then Return [0, null]

For each action a in s

Q*(s,a,h) = 0

For i = 1 to w

Simulate taking a in s resulting in si and reward ri

[V*(si,h),a*] = SparseSample(si,h-1,w)

Q*(s,a,h) = Q*(s,a,h) + ri + β V*(si,h)

Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)

V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)

a* = argmaxa Q*(s,a,h)

Return [V*(s,h), a*]

The Sparse Sampling algorithm computes root value via depth first expansion

Return value estimate V*(s,h) of state s and estimated optimal action a*

Page 11: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

SparseSample(h=2,w=2)

Alternate max &averaging

s

a

Max Nodes: value equal to max of values of child expectation nodes

Average Nodes: value is average of values of w sampled children

Compute values from bottom up (leaf values = 0). Select root action with best value.

Page 12: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

SparseSample(h=2,w=2)

Alternate max &averaging

s

a

Max Nodes: value equal to max of values of child expectation nodes

Average Nodes: value is average of values of w sampled children

Compute values from bottom up (leaf values = 0). Select root action with best value.

SparseSample(h=1,w=2)

10

Page 13: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

SparseSample(h=2,w=2)

Alternate max &averaging

s

a

Max Nodes: value equal to max of values of child expectation nodes

Average Nodes: value is average of values of w sampled children

Compute values from bottom up (leaf values = 0). Select root action with best value.

SparseSample(h=1,w=2)

10

SparseSample(h=1,w=2)

0

Page 14: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

SparseSample(h=2,w=2)

Alternate max &averaging

s

a

Max Nodes: value equal to max of values of child expectation nodes

Average Nodes: value is average of values of w sampled children

Compute values from bottom up (leaf values = 0). Select root action with best value.

SparseSample(h=1,w=2)

10

SparseSample(h=1,w=2)

05

Page 15: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

SparseSample(h=2,w=2)

Alternate max &averaging

s

a

Max Nodes: value equal to max of values of child expectation nodes

Average Nodes: value is average of values of w sampled children

Compute values from bottom up (leaf values = 0). Select root action with best value.

5

SparseSample(h=1,w=2) SparseSample(h=1,w=2)

4 2

b

3

Select action ‘a’ since its value is larger than b’s.

Page 16: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Sparse Sampling (Cont’d)h For a given desired accuracy, how large

should sampling width and depth be?5 Answered: Kearns, Mansour, and Ng (1999)

h Good news: gives values for w and H to achieve policy arbitrarily close to optimal5 Values are independent of state-space size!5 First near-optimal general MDP planning algorithm

whose runtime didn’t depend on size of state-space

h Bad news: the theoretical values are typically still intractably large---also exponential in H

5 Exponential in H is the best we can do in general5 In practice: use small H and use heuristic at leaves

Page 17: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Sparse Sampling (Cont’d)h In practice: use small H & evaluate leaves with heuristic

5 For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy)

… …

… … … ……......

a1 a2

policysims

Page 18: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

18

Uniform vs. Adaptive Tree Search

h Sparse sampling wastes time on bad parts of tree5 Devotes equal resources to each

state encountered in the tree5 Would like to focus on most

promising parts of tree

h But how to control exploration of new parts of tree vs. exploiting promising parts?

h Need adaptive bandit algorithmthat explores more effectively

Page 19: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

19

What now?

h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search

Page 20: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

20

Bandits: Cumulative Regret Objective

s

a1 a2 ak

h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step)5 Optimal (in expectation) is to pull optimal arm n times5 UniformBandit is poor choice --- waste time on bad arms5 Must balance exploring machines to find good payoffs

and exploiting current knowledge

Page 21: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

21

Cumulative Regret Objectiveh Theoretical results are often about “expected

cumulative regret” of an arm pulling strategy.

h Protocol: At time step n the algorithm picks an arm based on what it has seen so far and receives reward ( are random variables).

h Expected Cumulative Regret (: difference between optimal expected cumulative reward and expected cumulative reward of our strategy at time n

Page 22: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

22

UCB Algorithm for Minimizing Cumulative Regret

h Q(a) : average reward for trying action a (in our single state s) so far

h n(a) : number of pulls of arm a so far

h Action choice by UCB after n pulls:

h Assumes rewards in [0,1]. We can always normalize if we know max value.

)(

ln2)(maxarg

an

naQa an

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2), 235-256.

Page 23: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

23

UCB: Bounded Sub-Optimality

)(

ln2)(maxarg

an

naQa an

Value Term: favors actions that looked good historically

Exploration Term:actions get an exploration bonus that grows with ln(n)

Expected number of pulls of sub-optimal arm a is bounded by:

where is the sub-optimality of arm a

na

ln82

a

Doesn’t waste much time on sub-optimal arms, unlike uniform!

Page 24: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

24

UCB Performance Guarantee[Auer, Cesa-Bianchi, & Fischer, 2002]

Theorem: The expected cumulative regret of UCB after n arm pulls is bounded by O(log n)

h Is this good?

Yes. The average per-step regret is O

Theorem: No algorithm can achieve a better expected regret (up to constant factors)

Page 25: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

25

What now?

h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search

Page 26: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

h UCT is an instance of Monte-Carlo Tree Search5 Applies principle of UCB5 Similar theoretical properties to sparse sampling5 Much better anytime behavior than sparse sampling

h Famous for yielding a major advance in computer Go

h A growing number of success stories

UCT Algorithm [Kocsis & Szepesvari, 2006]

Page 27: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

h Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”

h During construction each tree node s stores: 5 state-visitation count n(s) 5 action counts n(s,a) 5 action values Q(s,a)

h Repeat until time is up1. Execute rollout policy starting from root until horizon

(generates a state-action-reward trajectory)

2. Add first node not in current tree to the tree

3. Update statistics of each tree node s on trajectoryg Increment n(s) and n(s,a) for selected action ag Update Q(s,a) by total reward observed after the node

Monte-Carlo Tree Search

What is the rollout policy?

Page 28: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

h Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy

h Rollout policies have two distinct phases5 Tree policy: selects actions at nodes already in tree

(each action must be selected at least once)5 Default policy: selects actions after leaving tree

h Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction 5 Rather than uniformly explore actions at each node

Rollout Policies

Page 29: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

h For illustration purposes we will assume MDP is:5 Deterministic 5 Only non-zero rewards are at terminal/leaf nodes

h Algorithm is well-defined without these assumpitons

Example MCTS

Page 30: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

DefaultPolicy

Terminal(reward = 1)

1

1

1

At a leaf node tree policy selects a random action then executes default

Initially tree is single leaf

Assume all non-zero reward occurs at terminal nodes.

new tree node

Iteration 1

Page 31: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1

Must select each action at a node at least once

0

DefaultPolicy

Terminal(reward = 0)

new tree node

Iteration 2

Page 32: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/2

Must select each action at a node at least once

0

Iteration 3

Page 33: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/2

0

When all node actions tried once, select action according to tree policy

Tree Policy

Iteration 3

Page 34: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/2

When all node actions tried once, select action according to tree policy

0

Tree Policy

0

DefaultPolicy

new tree node

Iteration 3

Page 35: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1/2

1/3

When all node actions tried once, select action according to tree policy

0Tree

Policy

0

Iteration 4

Page 36: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0

0

Iteration 4

Page 37: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

2/3

2/3

When all node actions tried once, select action according to tree policy

0Tree

Policy

0

What is an appropriate tree policy?Default policy?

1

Page 38: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

38

h Basic UCT uses random default policy5 In practice often use hand-coded or learned policy

h Tree policy is based on UCB:5 Q(s,a) : average reward received in trajectories so

far after taking action a in state s5 n(s,a) : number of times action a taken in s5 n(s) : number of times state s encountered

),(

)(ln),(maxarg)(

asn

snKasQs aUCT

Theoretical constant that must be selected empirically in practice

UCT Algorithm [Kocsis & Szepesvari, 2006]

Page 39: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0Tree

Policy

0

a1 a2),(

)(ln),(maxarg)(

asn

snKasQs aUCT

Page 40: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0Tree

Policy

0

),(

)(ln),(maxarg)(

asn

snKasQs aUCT

Page 41: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0Tree

Policy

0

0

),(

)(ln),(maxarg)(

asn

snKasQs aUCT

Page 42: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/3

1/4

When all node actions tried once, select action according to tree policy

0Tree

Policy

0/1

),(

)(ln),(maxarg)(

asn

snKasQs aUCT

0

Page 43: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/3

1/4

When all node actions tried once, select action according to tree policy

0Tree

Policy

0/1

),(

)(ln),(maxarg)(

asn

snKasQs aUCT

0

1

Page 44: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Current World State

1

1/3

2/5

When all node actions tried once, select action according to tree policy

1/2Tree

Policy

0/1

),(

)(ln),(maxarg)(

asn

snKasQs aUCT

0

1

Page 45: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

45

UCT Recap

h To select an action at a state s5 Build a tree using N iterations of monte-carlo tree

searchg Default policy is uniform randomg Tree policy is based on UCB rule

5 Select action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate)

h The more simulations the more accurate

Page 46: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Some Successes

Computer Go

Klondike Solitaire (wins 40% of games)

General Game Playing Competition

Real-Time Strategy Games

Combinatorial Optimization

Crowd Sourcing

List is growing

Usually extend UCT is some ways

Page 47: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Some ImprovementsUse domain knowledge to handcraft a more intelligent default policy than random

E.g. don’t choose obviously stupid actions

In Go a hand-coded default policy is used

Learn a heuristic function to evaluate positions

Use the heuristic function to initialize leaf nodes(otherwise initialized to zero)

Page 48: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Practical IssuesSelecting K

There is no fixed rule

Experiment with different values in your domain

Rule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect

The best value of K may depend on the number of iterations N

Usually a single value of K will work well for values of N that are large enough

Page 49: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Practical Issues

a b

s

UCT can have trouble building deep trees when actions can result in a large # of possible next states

Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf

Page 50: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Practical IssuesUCT can have trouble building deep trees when actions can result in a large # of possible next states

Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf

a b

s

Page 51: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Practical IssuesDegenerates into a depth 1 tree search

Solution: We have a “sparseness parameter” that controls how many new children of an action will be generated

Set sparseness to something like 5 or 10

a b

s

….. …..