Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas...

47
Prof. Dr. Claudia Linnhoff-Popien Thomy Phan, Andreas Sedlmeier, Fabian Ritz http://www.mobile.ifi.lmu.de WiSe 2019/20 Praktikum Autonome Systeme Automated Planning

Transcript of Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas...

Page 1: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. Claudia Linnhoff-PopienThomy Phan, Andreas Sedlmeier, Fabian Ritzhttp://www.mobile.ifi.lmu.de

WiSe 2019/20

Praktikum Autonome Systeme

Automated Planning

Page 2: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Recap: Decision Making

Page 3: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Artificial Intelligence

Act

Learn Think

MachineLearning

Planning

Pattern Recognition Scheduling

Decision Making

Multi-Agent Systems

Reinforcement Learning

Social Interactivity

Page 4: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Artificial Intelligence

Act

Learn Think

MachineLearning

Planning

Pattern Recognition Scheduling

Decision Making

Multi-Agent Systems

Reinforcement Learning

Social Interactivity

Page 5: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Decision Making

โ€ข Goal: Autonomously select actions to solve a (complex) task

โ€“ time could be important (but not necessarily)

โ€“ maximize the expected reward for each state

5

Page 6: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Multi-Armed Bandits

6

โ€ข Multi-Armed Bandit: situation, where you have to learn how to make a good (long-term) choice

โ€ข Explore choices to gather information (= Exploration)

โ€“ Example: random choice

โ€ข Prefer promising choices (= Exploitation)

โ€“ Example: greedy choice (e.g., using argmax)

โ€ข A good Multi-Armed Bandit solution should always balance betweenExploration and Exploitation

?

Page 7: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Multi-Armed Bandits

7

Actor Environment

Exploration/Exploitation:make a choice

Feedback:observe (delayed) reward

Update statisticsbased on feedback

Execute chosen action

Page 8: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Multi-Armed Bandits

8

Page 9: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Sequential Decision Making

Page 10: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Sequential Decision Making

โ€ข Goal: Autonomously select actions to solve a (complex) task

โ€“ time is important (actions might have long termconsequences)

โ€“ maximize the expected cumulative reward for each state

10

Page 11: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Sequential Decision Making Example

11

โ€ข Rooms: reach a goal as fast as possible

agent

actions

goal (reward = +1)

Page 12: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Markov Decision Processes

โ€ข A Markov Decision Process (MDP) is defined as M = โŸจ๐’ฎ,๐’œ,๐’ซ,โ„›โŸฉ:

โ€“ ๐’ฎ is a (finite) set of states

โ€“ ๐’œ is a (finite) set of actions

โ€“ ๐’ซ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก , ๐‘Ž๐‘ก โˆˆ [0, 1] is the probability for reaching ๐‘ ๐‘ก+1 โˆˆ ๐’ฎ whenexecuting ๐‘Ž๐‘ก โˆˆ ๐’œ in ๐‘ ๐‘ก โˆˆ ๐’ฎ

โ€“ โ„› ๐‘ ๐‘ก , ๐‘Ž๐‘ก โˆˆ โ„ is a reward function

12

Page 13: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Rooms as MDP

13

โ€ข Define Rooms as MDP M = โŸจ๐’ฎ,๐’œ,๐’ซ,โ„›โŸฉ:โ€“ States ๐“ข: position of the agent

โ€“ Actions ๐“: move north/south/west/east

โ€“ Transitions ๐“Ÿ: deterministic movement. No transitionif moving against wall.

โ€“ Rewards ๐“ก: +1 if goal is reached, 0 otherwise

Page 14: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Markov Decision Processes

โ€ข MDPs formally describe environments for Sequential Decision Making

โ€ข All states ๐‘ ๐‘ก โˆˆ ๐’ฎ are Markov such that

โ„™ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก = โ„™ ๐‘ ๐‘ก+1 ๐‘ 1, โ€ฆ , ๐‘ ๐‘ก (no history of past states required)

โ€ข Assumes full observability of the state

โ€ข States and actions may be discrete or continuous

โ€ข Many problems can be formulated as MDPs!

โ€“ E.g., multi-armed bandits are MDPs with a single state

14

Page 15: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Policies

โ€ข A policy ๐œ‹: ๐’ฎ โ†’ ๐’œ represents the behavioral strategy of an agent

โ€“ Policies may also be stochastic ๐œ‹ ๐‘Ž๐‘ก ๐‘ ๐‘ก โˆˆ [0,1]

15

โ€ข Policy examples for Rooms:

โ€“ ๐œ‹0: maps each state st โˆˆ ๐’ฎ to a randomaction at โˆˆ ๐’œ

โ€“ ๐œ‹1 : maps each state st โˆˆ ๐’ฎ to action at =๐‘€๐‘œ๐‘ฃ๐‘’๐‘†๐‘œ๐‘ข๐‘กโ„Ž โˆˆ ๐’œ

โ€“ ๐œ‹2 : maps state st โˆˆ ๐’ฎ to action at =๐‘€๐‘œ๐‘ฃ๐‘’๐‘†๐‘œ๐‘ข๐‘กโ„Ž โˆˆ ๐’œ if t is odd and select atrandom otherwise .

1. How do we know which policy is better?

2. How can we improve a given policy?

Page 16: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Returns

โ€ข The return of a state ๐‘ ๐‘ก โˆˆ ๐’ฎ for a horizon โ„Ž given a policy ๐œ‹ is thecumulative (discounted) future reward (โ„Ž may be infinite!):

๐บ๐‘ก =

๐‘˜=0

โ„Žโˆ’1

๐›พ๐‘˜ โ„› ๐‘ ๐‘ก+๐‘˜ , ๐œ‹ ๐‘ ๐‘ก+๐‘˜ , ๐›พ โˆˆ [0,1]

16

โ€ข Rooms Example: ๐›พ = 0.99

โ€“ The chosen paths needs 18 steps toreach the goal

โ€“ Thus, the return from the starting pointis: ๐บ1 = ๐‘Ÿ1 + ๐›พ๐‘Ÿ2 + โ€ฆ+ ๐›พ17๐‘Ÿ18 =

= ๐›พ17๐‘Ÿ18 = 0.9917 ~ 0.843

โ€ข What would be the return ๐บ1, if the goalisnโ€˜t reached at all?

โ€ข What is the optimal value of ๐บ1?

Page 17: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Value Functions

โ€ข The value of a state ๐‘ ๐‘ก โˆˆ ๐’ฎ is the expected return of ๐‘ ๐‘ก for a horizon โ„Ž โˆˆ โ„•given a policy ๐œ‹:

๐‘‰๐œ‹ ๐‘ ๐‘ก = ๐”ผ[๐บ๐‘ก|๐‘ ๐‘ก]

โ€ข The action value of a state ๐‘ ๐‘ก โˆˆ ๐’ฎ and action ๐‘Ž๐‘ก โˆˆ ๐’œ is the expectedreturn of executing ๐‘Ž๐‘ก in ๐‘ ๐‘ก for a horizon โ„Ž โˆˆ โ„• given a policy ๐œ‹:

๐‘„๐œ‹ ๐‘ ๐‘ก, ๐‘Ž๐‘ก = ๐”ผ[๐บ๐‘ก|๐‘ ๐‘ก , ๐‘Ž๐‘ก]

17

โ€ข Rooms Example:

โ€“ ๐‘‰๐œ‹ and/or ๐‘„๐œ‹ can be estimated byaveraging over several returns ๐บ๐‘กobserved by executing a (fixed) policy ๐œ‹

โ€ข Value functions (๐‘ฝ๐… and/or ๐‘ธ๐…) canbe used to evaluate policies ๐…

Page 18: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Remark: Return / Value Estimation

โ€ข Estimating the return ๐บ๐‘ก or the value ๐‘„๐œ‹ ๐‘ ๐‘ก, ๐‘Ž๐‘ก of state-action pair โŸจ๐‘ ๐‘ก , ๐‘Ž๐‘กโŸฉ always has the following form:

โ„› ๐‘ ๐‘ก , ๐‘Ž๐‘ก + ๐›พ๐‘‹

โ€ข ๐‘‹ could be:

โ€“ The successor return ๐บ๐‘ก+1โ€ข reward seqence must be known

โ€“ The successor value ๐‘„๐œ‹ ๐‘ ๐‘ก+1, ๐‘Ž๐‘ก+1โ€ข โŸจ๐‘ ๐‘ก+1, ๐‘Ž๐‘ก+1โŸฉ and ๐‘„๐œ‹ must be known

โ€“ The expected successor value ๐”ผ ๐บ๐‘ก+1 ๐‘ ๐‘ก+1, ๐‘Ž๐‘ก+1โ€ข ๐’ซ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก , ๐‘Ž๐‘ก , ๐‘ ๐‘ก+1, ๐‘Ž๐‘ก+1 , and ๐‘„๐œ‹ must be known

18

Monte Carlo Planning / Learning

Temporal-DifferenceLearning

Dynamic Programming

Page 19: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Optimal Policies and Value Functions

โ€ข Goal: Find an optimal policy ๐œ‹โˆ— which maximizes the expected return๐”ผ[๐บ๐‘ก|๐‘ ๐‘ก] for each state:

๐œ‹โˆ— = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐œ‹๐‘‰๐œ‹ ๐‘ ๐‘ก , โˆ€๐‘ ๐‘ก โˆˆ ๐’ฎ

โ€ข The optimal value function is defined by:

๐‘‰โˆ— ๐‘ ๐‘ก = ๐‘‰๐œ‹โˆ—๐‘ ๐‘ก = ๐‘š๐‘Ž๐‘ฅ๐œ‹๐‘‰

๐œ‹ ๐‘ ๐‘ก๐‘„โˆ— ๐‘ ๐‘ก, ๐‘Ž๐‘ก = ๐‘„๐œ‹โˆ— ๐‘ ๐‘ก , ๐‘Ž๐‘ก = ๐‘š๐‘Ž๐‘ฅ๐œ‹๐‘„

๐œ‹ ๐‘ ๐‘ก, ๐‘Ž๐‘ก

โ€ข When ๐‘‰โˆ— or ๐‘„โˆ— is known, ๐œ‹โˆ—can be derived.

19

How to find an optimal policy or theoptimal value function?

Page 20: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Automated Planning

Page 21: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Artificial Intelligence

Act

Learn Think

MachineLearning

Planning

Pattern Recognition Scheduling

Decision Making

Multi-Agent Systems

Reinforcement Learning

Social Interactivity

Page 22: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Automated Planning

โ€ข Goal: Find (near-)optimal policies ๐œ‹โˆ— to solve complex problems

โ€ข Use (heuristic) lookahead search on a given model ๐‘€ โ‰ˆ ๐‘€ of the problem

22

Page 23: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Planning Approaches (Examples)

23

Tree Search Evolutionary Computation Dynamic Programming

Page 24: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Dynamic Programming

Page 25: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Dynamic Programming

โ€ข Dynamic refers to sequential / temporal component of a problem

โ€ข Programming refers to optimization of a program

โ€ข We want to solve Markov Decision Processes (MDPs):โ€“ MDPs are sequential decision making problems

โ€“ To find a solution, we need to optimize a program (policy ๐œ‹)

25

Page 26: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Policy Iteration

โ€ข Dynamic Programming approach to find an optimal policy ๐œ‹โˆ—

โ€ข Starts with a (random) guess ๐œ‹0โ€ข Consists of two alternating steps given ๐œ‹๐‘›:

โ€ข Terminates when ๐œ‹๐‘–+1 = ๐œ‹๐‘– or when a time budget runs out

โ€ข Policy Iteration forms the basis for most Planning and Reinforcement Learning algorithms!

26

Policy Evaluation Policy Improvement

compute ๐‘‰๐œ‹๐‘› and/or ๐‘„๐œ‹๐‘› maximize ๐‘‰๐œ‹๐‘›/๐‘„๐œ‹๐‘›

Page 27: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Value Iteration

โ€ข Dynamic Programming approach to find the optimal value function ๐‘‰โˆ—

โ€ข Starts with a (random) guess ๐‘‰0

โ€ข Iteratively updates the value estimate ๐‘‰๐‘›(๐‘ ๐‘ก) for each state ๐‘ ๐‘ก โˆˆ ๐’ฎ

๐‘‰๐‘›+1 ๐‘ ๐‘ก = max๐‘Ž๐‘กโˆˆ๐’œ

{โ„› ๐‘ ๐‘ก, ๐‘Ž๐‘ก + ๐›พ

๐‘ ๐‘ก+1โˆˆ๐’ฎ

๐’ซ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก, ๐‘Ž๐‘ก ๐‘‰๐‘›(๐‘ ๐‘ก+1)}

โ€ข Terminates when ๐‘‰๐‘›+1 = ๐‘‰๐‘› or when a time budget runs out

โ€ข The optimal action-value function ๐‘„โˆ— is computed analogously

โ€ข ๐‘‰โˆ— and/or ๐‘„โˆ— can be used to derive an optimal policy ๐œ‹โˆ—

โ€ข Do you see the link to Policy Iteration?

27

Policy EvaluationPolicy Improvement

Page 28: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Value Iteration - Example

โ€ข Optimal โ€žValue Mapโ€œ in Rooms (๐›พ = 0.99): for each state ๐‘ ๐‘ก โˆˆ ๐’ฎ

โ€“ ๐‘‰๐‘›+1 ๐‘ ๐‘ก = max๐‘Ž๐‘กโˆˆ๐’œ

{โ„› ๐‘ ๐‘ก , ๐‘Ž๐‘ก + ๐›พฯƒ๐‘ ๐‘ก+1โˆˆ๐’ฎ ๐’ซ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก , ๐‘Ž๐‘ก ๐‘‰๐‘›(๐‘ ๐‘ก+1)}

28

Remember: โ„› ๐‘ ๐‘ก , ๐‘Ž๐‘ก + ๐›พ๐‘‹

In this case

๐‘‹ =

๐‘ ๐‘ก+1โˆˆ๐’ฎ

๐’ซ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก , ๐‘Ž๐‘ก ๐‘‰๐‘›(๐‘ ๐‘ก+1)

Page 29: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Value Iteration - Example

โ€ข Optimal โ€žValue Mapโ€œ in Rooms (๐›พ = 0.99): for each state ๐‘ ๐‘ก โˆˆ ๐’ฎ

โ€“ ๐‘‰๐‘›+1 ๐‘ ๐‘ก = max๐‘Ž๐‘กโˆˆ๐’œ

{โ„› ๐‘ ๐‘ก , ๐‘Ž๐‘ก + ๐›พฯƒ๐‘ ๐‘ก+1โˆˆ๐’ฎ ๐’ซ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก , ๐‘Ž๐‘ก ๐‘‰๐‘›(๐‘ ๐‘ก+1)}

29

Page 30: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Advantages and Disadvantages of DP

โ€ข General approach (does not require explicit domain knowledge)

โ€ข Converges to optimal solution

โ€ข Does not require exploration-exploitation (all states are visited anyway)

โ€ข Computational costs

โ€ข Memory costs

โ€ข Availability of an explicit model M = โŸจ๐’ฎ,๐’œ,๐’ซ,โ„›โŸฉ

30

?

Page 31: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Intermediate Summary

โ€ข What we know so far:โ€“ Markov Decision Processes (MDPs)

โ€“ Policies and Value Functions

โ€“ Optimally solve MDPs with Dynamic Programming

โ€ข What we donโ€˜t know (yet):โ€“ How to find solutions in a more scalable way?

โ€“ How to react to unexpected events?

โ€“ How to find solutions without a model?

31

Page 32: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Monte Carlo Planning

Page 33: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Global Planning and Local Planning

โ€ข Global Planningโ€“ considers the entire state space ๐’ฎ to approximate ๐œ‹โˆ—

โ€“ produces for each state ๐‘ ๐‘ก โˆˆ ๐’ฎ a mapping to actions ๐‘Ž๐‘ก โˆˆ ๐’œ

โ€“ typically performed offline (before deploying the agent)

โ€“ Examples: Dynamic Programming (Policy and Value Iteration)

โ€ข Local Planningโ€“ only considers the current state ๐‘ ๐‘กโˆˆ ๐’ฎ (and possible future states) to

approximate ๐œ‹โˆ—(๐‘ ๐‘ก)

โ€“ recommends an action ๐‘Ž๐‘ก โˆˆ ๐’œ for current state ๐‘ ๐‘กโˆˆ ๐’ฎ

โ€“ can be performed online (interleaving planning and execution)

โ€“ Examples: Monte Carlo Tree Search

33

Page 34: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Global Planning vs. Local Planning

34

Global Planning Local Planning

Page 35: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Planning

โ€ข Dynamic Programming always assumes full knowledge of the underlyingMDP M = โŸจ๐’ฎ,๐’œ,๐’ซ, โ„›โŸฉ

โ€“ Most real-world applications have extremely large state spaces

โ€“ Especially ๐’ซ ๐‘ ๐‘ก+1 ๐‘ ๐‘ก , ๐‘Ž๐‘ก is hard to pre-specify in practice!

โ€ข Monte Carlo Planning only requires a generative model as blackboxsimulator ๐‘€ โ‰ˆ M

โ€“ Given some state ๐‘ ๐‘ก โˆˆ ๐’ฎ and action ๐‘Ž๐‘กโˆˆ ๐’œ, the generative modelprovides a sample ๐‘ ๐‘ก+1 โˆˆ ๐’ฎ and ๐‘Ÿ๐‘ก = โ„›(๐‘ ๐‘ก, ๐‘Ž๐‘ก)

โ€“ Can be used to approximate ๐‘‰โˆ— or ๐‘„โˆ— via statistical sampling

โ€“ Requires minimal domain knowledge ( ๐‘€ can be easily replaced)

35

Page 36: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Explicit Model vs. Generative Model

โ€ข Generative model can be easier implemented than explicit probabilitydistributions!

36

?

Real Environment Environment Model

Page 37: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Planning with Generative Model

37

Page 38: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Rollouts (MCR)

โ€ข Goal: Given a state ๐‘ ๐‘ก โˆˆ ๐’ฎ and a policy ๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก , we want to find theaction ๐‘Ž๐‘ก โˆˆ ๐’ฎ which maximizes ๐‘„๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ก, ๐‘Ž๐‘ก = ๐”ผ[๐บ๐‘ก|๐‘ ๐‘ก , ๐‘Ž๐‘ก]

โ€ข Approach: Given a computation budget of ๐พ simulations and a horizon โ„Ž

โ€“ Sample ๐พ action sequences (= plans) of length โ„Ž from ๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก

โ€“ Simulate all plans with generative model M and compute the return๐บ๐‘ก for each plan

โ€“ Update estimate of ๐‘„๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ก, ๐‘Ž๐‘ก = ๐”ผ[๐บ๐‘ก|๐‘ ๐‘ก , ๐‘Ž๐‘ก]*

โ€“ Finally: Select action ๐‘Ž๐‘ก โˆˆ ๐’ฎ with highest ๐‘„๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ก, ๐‘Ž๐‘ก

*only estimate ๐‘„๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ก, ๐‘Ž๐‘ก of the first action ๐‘Ž๐‘ก in each plan!

38

Page 39: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Why do Monte Carlo Rollouts work?

โ€ข MCR estimates value function ๐‘„๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ก, ๐‘Ž๐‘ก of ๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก via sampling

โ€ข Final decision is a maximization of ๐‘„๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ก, ๐‘Ž๐‘ก

39

โ€ข MCR makes always decisions with thesame or better quality than ๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก

โ€ข Thus, decision quality depends on ๐œ‹๐‘Ÿ๐‘œ๐‘™๐‘™๐‘œ๐‘ข๐‘ก and the simulation model

Page 40: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Tree Search (MCTS)

40

โ€ข Current state-of-the-art algorithm for Monte Carlo Planning. Used for:

โ€“ board games like Go, Chess, Shogi

โ€“ combinatorial optimization problems like Rubix Cube

โ€ข Approach: Incrementally construct and traverse a search tree given a computation budget of ๐พ simulations and a horizon โ„Ž

โ€“ nodes represent states ๐‘ ๐‘ก โˆˆ ๐’ฎ (and actions ๐‘Ž๐‘ก โˆˆ ๐’œ)

โ€“ search tree is used to โ€žlearnโ€œ ๐‘„ โ‰ˆ ๐‘„โˆ— via blackbox simulation

Page 41: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Tree Search Phases

41

โ€ข Selection

โ€ข Expansion

โ€ข Evaluation/Simulation

โ€ข Backup

๐‘ ๐‘ก

๐‘ ๐‘ก+11 ๐‘ ๐‘ก+1

2

๐‘ ๐‘ก+21 ๐‘ ๐‘ก+2

2

๐‘Ž๐‘ก1 ๐‘Ž๐‘ก

2

๐‘Ž๐‘ก+11 ๐‘Ž๐‘ก+1

2

current state in โ€žreal worldโ€œ

Page 42: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Tree Search - Selection

42

โ€ข Selection

โ€ข Expansion

โ€ข Evaluation/Simulation

โ€ข Backup

๐‘ ๐‘ก

๐‘ ๐‘ก+11 ๐‘ ๐‘ก+1

2

๐‘ ๐‘ก+21 ๐‘ ๐‘ก+2

2

Example Selection Strategies:โ€ข Randomโ€ข Greedyโ€ข ๐œ–-Greedyโ€ข UCB1

Exploration-Exploitation!!!

Multi-Armed Bandits

Page 43: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Tree Search - Expansion

43

โ€ข Selection

โ€ข Expansion

โ€ข Evaluation/Simulation

โ€ข Backup

๐‘ ๐‘ก+31

๐‘ ๐‘ก

๐‘ ๐‘ก+11 ๐‘ ๐‘ก+1

2

๐‘ ๐‘ก+21 ๐‘ ๐‘ก+2

2

Page 44: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Tree Search - Expansion

44

โ€ข Selection

โ€ข Expansion

โ€ข Evaluation/Simulation

โ€ข Backup

?

Example Evaluation Strategies: โ€ข Rollouts (e.g., Random)โ€ข Value Function ๐‘‰๐œ‹ ๐‘ ๐‘ก (e.g.,

Reinforcement Learning)

๐‘ ๐‘ก

๐‘ ๐‘ก+11 ๐‘ ๐‘ก+1

2

๐‘ ๐‘ก+21 ๐‘ ๐‘ก+2

2

๐‘ ๐‘ก+31

Page 45: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Monte Carlo Tree Search - Backup

45

โ€ข Selection

โ€ข Expansion

โ€ข Evaluation/Simulation

โ€ข Backup

๐‘ ๐‘ก

๐‘ ๐‘ก+11 ๐‘ ๐‘ก+1

2

๐‘ ๐‘ก+21 ๐‘ ๐‘ก+2

2

๐‘ ๐‘ก+31

?

Remember:

โ„› ๐‘ ๐‘ก, ๐‘Ž๐‘ก + ๐›พ๐‘‹

In this case ๐‘‹ = ๐บ๐‘ก+1(return from next state ๐‘ ๐‘ก+1)

Page 46: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme

WiSe 2019/20, Automated Planning

Summary

โ€ข What we know so far:โ€“ Markov Decision Processes (MDPs)

โ€“ Policies and Value Functions

โ€“ Optimally solve MDPs with Dynamic Programming

โ€“ Approximately solve MDPs with Monte Carlo Search

โ€ข What we donโ€˜t know (yet):โ€“ How to find solutions without a model?

46

Page 47: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning

Thank you!