Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas...
Transcript of Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas...
![Page 1: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/1.jpg)
Prof. Dr. Claudia Linnhoff-PopienThomy Phan, Andreas Sedlmeier, Fabian Ritzhttp://www.mobile.ifi.lmu.de
WiSe 2019/20
Praktikum Autonome Systeme
Automated Planning
![Page 2: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/2.jpg)
Recap: Decision Making
![Page 3: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/3.jpg)
Artificial Intelligence
Act
Learn Think
MachineLearning
Planning
Pattern Recognition Scheduling
Decision Making
Multi-Agent Systems
Reinforcement Learning
Social Interactivity
![Page 4: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/4.jpg)
Artificial Intelligence
Act
Learn Think
MachineLearning
Planning
Pattern Recognition Scheduling
Decision Making
Multi-Agent Systems
Reinforcement Learning
Social Interactivity
![Page 5: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/5.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Decision Making
โข Goal: Autonomously select actions to solve a (complex) task
โ time could be important (but not necessarily)
โ maximize the expected reward for each state
5
![Page 6: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/6.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Multi-Armed Bandits
6
โข Multi-Armed Bandit: situation, where you have to learn how to make a good (long-term) choice
โข Explore choices to gather information (= Exploration)
โ Example: random choice
โข Prefer promising choices (= Exploitation)
โ Example: greedy choice (e.g., using argmax)
โข A good Multi-Armed Bandit solution should always balance betweenExploration and Exploitation
?
![Page 7: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/7.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Multi-Armed Bandits
7
Actor Environment
Exploration/Exploitation:make a choice
Feedback:observe (delayed) reward
Update statisticsbased on feedback
Execute chosen action
![Page 8: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/8.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Multi-Armed Bandits
8
![Page 9: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/9.jpg)
Sequential Decision Making
![Page 10: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/10.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Sequential Decision Making
โข Goal: Autonomously select actions to solve a (complex) task
โ time is important (actions might have long termconsequences)
โ maximize the expected cumulative reward for each state
10
![Page 11: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/11.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Sequential Decision Making Example
11
โข Rooms: reach a goal as fast as possible
agent
actions
goal (reward = +1)
![Page 12: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/12.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Markov Decision Processes
โข A Markov Decision Process (MDP) is defined as M = โจ๐ฎ,๐,๐ซ,โโฉ:
โ ๐ฎ is a (finite) set of states
โ ๐ is a (finite) set of actions
โ ๐ซ ๐ ๐ก+1 ๐ ๐ก , ๐๐ก โ [0, 1] is the probability for reaching ๐ ๐ก+1 โ ๐ฎ whenexecuting ๐๐ก โ ๐ in ๐ ๐ก โ ๐ฎ
โ โ ๐ ๐ก , ๐๐ก โ โ is a reward function
12
![Page 13: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/13.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Rooms as MDP
13
โข Define Rooms as MDP M = โจ๐ฎ,๐,๐ซ,โโฉ:โ States ๐ข: position of the agent
โ Actions ๐: move north/south/west/east
โ Transitions ๐: deterministic movement. No transitionif moving against wall.
โ Rewards ๐ก: +1 if goal is reached, 0 otherwise
![Page 14: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/14.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Markov Decision Processes
โข MDPs formally describe environments for Sequential Decision Making
โข All states ๐ ๐ก โ ๐ฎ are Markov such that
โ ๐ ๐ก+1 ๐ ๐ก = โ ๐ ๐ก+1 ๐ 1, โฆ , ๐ ๐ก (no history of past states required)
โข Assumes full observability of the state
โข States and actions may be discrete or continuous
โข Many problems can be formulated as MDPs!
โ E.g., multi-armed bandits are MDPs with a single state
14
![Page 15: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/15.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Policies
โข A policy ๐: ๐ฎ โ ๐ represents the behavioral strategy of an agent
โ Policies may also be stochastic ๐ ๐๐ก ๐ ๐ก โ [0,1]
15
โข Policy examples for Rooms:
โ ๐0: maps each state st โ ๐ฎ to a randomaction at โ ๐
โ ๐1 : maps each state st โ ๐ฎ to action at =๐๐๐ฃ๐๐๐๐ข๐กโ โ ๐
โ ๐2 : maps state st โ ๐ฎ to action at =๐๐๐ฃ๐๐๐๐ข๐กโ โ ๐ if t is odd and select atrandom otherwise .
1. How do we know which policy is better?
2. How can we improve a given policy?
![Page 16: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/16.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Returns
โข The return of a state ๐ ๐ก โ ๐ฎ for a horizon โ given a policy ๐ is thecumulative (discounted) future reward (โ may be infinite!):
๐บ๐ก =
๐=0
โโ1
๐พ๐ โ ๐ ๐ก+๐ , ๐ ๐ ๐ก+๐ , ๐พ โ [0,1]
16
โข Rooms Example: ๐พ = 0.99
โ The chosen paths needs 18 steps toreach the goal
โ Thus, the return from the starting pointis: ๐บ1 = ๐1 + ๐พ๐2 + โฆ+ ๐พ17๐18 =
= ๐พ17๐18 = 0.9917 ~ 0.843
โข What would be the return ๐บ1, if the goalisnโt reached at all?
โข What is the optimal value of ๐บ1?
![Page 17: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/17.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Value Functions
โข The value of a state ๐ ๐ก โ ๐ฎ is the expected return of ๐ ๐ก for a horizon โ โ โgiven a policy ๐:
๐๐ ๐ ๐ก = ๐ผ[๐บ๐ก|๐ ๐ก]
โข The action value of a state ๐ ๐ก โ ๐ฎ and action ๐๐ก โ ๐ is the expectedreturn of executing ๐๐ก in ๐ ๐ก for a horizon โ โ โ given a policy ๐:
๐๐ ๐ ๐ก, ๐๐ก = ๐ผ[๐บ๐ก|๐ ๐ก , ๐๐ก]
17
โข Rooms Example:
โ ๐๐ and/or ๐๐ can be estimated byaveraging over several returns ๐บ๐กobserved by executing a (fixed) policy ๐
โข Value functions (๐ฝ๐ and/or ๐ธ๐ ) canbe used to evaluate policies ๐
![Page 18: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/18.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Remark: Return / Value Estimation
โข Estimating the return ๐บ๐ก or the value ๐๐ ๐ ๐ก, ๐๐ก of state-action pair โจ๐ ๐ก , ๐๐กโฉ always has the following form:
โ ๐ ๐ก , ๐๐ก + ๐พ๐
โข ๐ could be:
โ The successor return ๐บ๐ก+1โข reward seqence must be known
โ The successor value ๐๐ ๐ ๐ก+1, ๐๐ก+1โข โจ๐ ๐ก+1, ๐๐ก+1โฉ and ๐๐ must be known
โ The expected successor value ๐ผ ๐บ๐ก+1 ๐ ๐ก+1, ๐๐ก+1โข ๐ซ ๐ ๐ก+1 ๐ ๐ก , ๐๐ก , ๐ ๐ก+1, ๐๐ก+1 , and ๐๐ must be known
18
Monte Carlo Planning / Learning
Temporal-DifferenceLearning
Dynamic Programming
![Page 19: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/19.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Optimal Policies and Value Functions
โข Goal: Find an optimal policy ๐โ which maximizes the expected return๐ผ[๐บ๐ก|๐ ๐ก] for each state:
๐โ = ๐๐๐๐๐๐ฅ๐๐๐ ๐ ๐ก , โ๐ ๐ก โ ๐ฎ
โข The optimal value function is defined by:
๐โ ๐ ๐ก = ๐๐โ๐ ๐ก = ๐๐๐ฅ๐๐
๐ ๐ ๐ก๐โ ๐ ๐ก, ๐๐ก = ๐๐โ ๐ ๐ก , ๐๐ก = ๐๐๐ฅ๐๐
๐ ๐ ๐ก, ๐๐ก
โข When ๐โ or ๐โ is known, ๐โcan be derived.
19
How to find an optimal policy or theoptimal value function?
![Page 20: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/20.jpg)
Automated Planning
![Page 21: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/21.jpg)
Artificial Intelligence
Act
Learn Think
MachineLearning
Planning
Pattern Recognition Scheduling
Decision Making
Multi-Agent Systems
Reinforcement Learning
Social Interactivity
![Page 22: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/22.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Automated Planning
โข Goal: Find (near-)optimal policies ๐โ to solve complex problems
โข Use (heuristic) lookahead search on a given model ๐ โ ๐ of the problem
22
![Page 23: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/23.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Planning Approaches (Examples)
23
Tree Search Evolutionary Computation Dynamic Programming
![Page 24: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/24.jpg)
Dynamic Programming
![Page 25: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/25.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Dynamic Programming
โข Dynamic refers to sequential / temporal component of a problem
โข Programming refers to optimization of a program
โข We want to solve Markov Decision Processes (MDPs):โ MDPs are sequential decision making problems
โ To find a solution, we need to optimize a program (policy ๐)
25
![Page 26: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/26.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Policy Iteration
โข Dynamic Programming approach to find an optimal policy ๐โ
โข Starts with a (random) guess ๐0โข Consists of two alternating steps given ๐๐:
โข Terminates when ๐๐+1 = ๐๐ or when a time budget runs out
โข Policy Iteration forms the basis for most Planning and Reinforcement Learning algorithms!
26
Policy Evaluation Policy Improvement
compute ๐๐๐ and/or ๐๐๐ maximize ๐๐๐/๐๐๐
![Page 27: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/27.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Value Iteration
โข Dynamic Programming approach to find the optimal value function ๐โ
โข Starts with a (random) guess ๐0
โข Iteratively updates the value estimate ๐๐(๐ ๐ก) for each state ๐ ๐ก โ ๐ฎ
๐๐+1 ๐ ๐ก = max๐๐กโ๐
{โ ๐ ๐ก, ๐๐ก + ๐พ
๐ ๐ก+1โ๐ฎ
๐ซ ๐ ๐ก+1 ๐ ๐ก, ๐๐ก ๐๐(๐ ๐ก+1)}
โข Terminates when ๐๐+1 = ๐๐ or when a time budget runs out
โข The optimal action-value function ๐โ is computed analogously
โข ๐โ and/or ๐โ can be used to derive an optimal policy ๐โ
โข Do you see the link to Policy Iteration?
27
Policy EvaluationPolicy Improvement
![Page 28: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/28.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Value Iteration - Example
โข Optimal โValue Mapโ in Rooms (๐พ = 0.99): for each state ๐ ๐ก โ ๐ฎ
โ ๐๐+1 ๐ ๐ก = max๐๐กโ๐
{โ ๐ ๐ก , ๐๐ก + ๐พฯ๐ ๐ก+1โ๐ฎ ๐ซ ๐ ๐ก+1 ๐ ๐ก , ๐๐ก ๐๐(๐ ๐ก+1)}
28
Remember: โ ๐ ๐ก , ๐๐ก + ๐พ๐
In this case
๐ =
๐ ๐ก+1โ๐ฎ
๐ซ ๐ ๐ก+1 ๐ ๐ก , ๐๐ก ๐๐(๐ ๐ก+1)
![Page 29: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/29.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Value Iteration - Example
โข Optimal โValue Mapโ in Rooms (๐พ = 0.99): for each state ๐ ๐ก โ ๐ฎ
โ ๐๐+1 ๐ ๐ก = max๐๐กโ๐
{โ ๐ ๐ก , ๐๐ก + ๐พฯ๐ ๐ก+1โ๐ฎ ๐ซ ๐ ๐ก+1 ๐ ๐ก , ๐๐ก ๐๐(๐ ๐ก+1)}
29
![Page 30: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/30.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Advantages and Disadvantages of DP
โข General approach (does not require explicit domain knowledge)
โข Converges to optimal solution
โข Does not require exploration-exploitation (all states are visited anyway)
โข Computational costs
โข Memory costs
โข Availability of an explicit model M = โจ๐ฎ,๐,๐ซ,โโฉ
30
?
![Page 31: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/31.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Intermediate Summary
โข What we know so far:โ Markov Decision Processes (MDPs)
โ Policies and Value Functions
โ Optimally solve MDPs with Dynamic Programming
โข What we donโt know (yet):โ How to find solutions in a more scalable way?
โ How to react to unexpected events?
โ How to find solutions without a model?
31
![Page 32: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/32.jpg)
Monte Carlo Planning
![Page 33: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/33.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Global Planning and Local Planning
โข Global Planningโ considers the entire state space ๐ฎ to approximate ๐โ
โ produces for each state ๐ ๐ก โ ๐ฎ a mapping to actions ๐๐ก โ ๐
โ typically performed offline (before deploying the agent)
โ Examples: Dynamic Programming (Policy and Value Iteration)
โข Local Planningโ only considers the current state ๐ ๐กโ ๐ฎ (and possible future states) to
approximate ๐โ(๐ ๐ก)
โ recommends an action ๐๐ก โ ๐ for current state ๐ ๐กโ ๐ฎ
โ can be performed online (interleaving planning and execution)
โ Examples: Monte Carlo Tree Search
33
![Page 34: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/34.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Global Planning vs. Local Planning
34
Global Planning Local Planning
![Page 35: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/35.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Planning
โข Dynamic Programming always assumes full knowledge of the underlyingMDP M = โจ๐ฎ,๐,๐ซ, โโฉ
โ Most real-world applications have extremely large state spaces
โ Especially ๐ซ ๐ ๐ก+1 ๐ ๐ก , ๐๐ก is hard to pre-specify in practice!
โข Monte Carlo Planning only requires a generative model as blackboxsimulator ๐ โ M
โ Given some state ๐ ๐ก โ ๐ฎ and action ๐๐กโ ๐, the generative modelprovides a sample ๐ ๐ก+1 โ ๐ฎ and ๐๐ก = โ(๐ ๐ก, ๐๐ก)
โ Can be used to approximate ๐โ or ๐โ via statistical sampling
โ Requires minimal domain knowledge ( ๐ can be easily replaced)
35
![Page 36: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/36.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Explicit Model vs. Generative Model
โข Generative model can be easier implemented than explicit probabilitydistributions!
36
?
Real Environment Environment Model
![Page 37: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/37.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Planning with Generative Model
37
![Page 38: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/38.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Rollouts (MCR)
โข Goal: Given a state ๐ ๐ก โ ๐ฎ and a policy ๐๐๐๐๐๐๐ข๐ก , we want to find theaction ๐๐ก โ ๐ฎ which maximizes ๐๐๐๐๐๐๐๐ข๐ก ๐ ๐ก, ๐๐ก = ๐ผ[๐บ๐ก|๐ ๐ก , ๐๐ก]
โข Approach: Given a computation budget of ๐พ simulations and a horizon โ
โ Sample ๐พ action sequences (= plans) of length โ from ๐๐๐๐๐๐๐ข๐ก
โ Simulate all plans with generative model M and compute the return๐บ๐ก for each plan
โ Update estimate of ๐๐๐๐๐๐๐๐ข๐ก ๐ ๐ก, ๐๐ก = ๐ผ[๐บ๐ก|๐ ๐ก , ๐๐ก]*
โ Finally: Select action ๐๐ก โ ๐ฎ with highest ๐๐๐๐๐๐๐๐ข๐ก ๐ ๐ก, ๐๐ก
*only estimate ๐๐๐๐๐๐๐๐ข๐ก ๐ ๐ก, ๐๐ก of the first action ๐๐ก in each plan!
38
![Page 39: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/39.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Why do Monte Carlo Rollouts work?
โข MCR estimates value function ๐๐๐๐๐๐๐๐ข๐ก ๐ ๐ก, ๐๐ก of ๐๐๐๐๐๐๐ข๐ก via sampling
โข Final decision is a maximization of ๐๐๐๐๐๐๐๐ข๐ก ๐ ๐ก, ๐๐ก
39
โข MCR makes always decisions with thesame or better quality than ๐๐๐๐๐๐๐ข๐ก
โข Thus, decision quality depends on ๐๐๐๐๐๐๐ข๐ก and the simulation model
![Page 40: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/40.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Tree Search (MCTS)
40
โข Current state-of-the-art algorithm for Monte Carlo Planning. Used for:
โ board games like Go, Chess, Shogi
โ combinatorial optimization problems like Rubix Cube
โข Approach: Incrementally construct and traverse a search tree given a computation budget of ๐พ simulations and a horizon โ
โ nodes represent states ๐ ๐ก โ ๐ฎ (and actions ๐๐ก โ ๐)
โ search tree is used to โlearnโ ๐ โ ๐โ via blackbox simulation
![Page 41: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/41.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Tree Search Phases
41
โข Selection
โข Expansion
โข Evaluation/Simulation
โข Backup
๐ ๐ก
๐ ๐ก+11 ๐ ๐ก+1
2
๐ ๐ก+21 ๐ ๐ก+2
2
๐๐ก1 ๐๐ก
2
๐๐ก+11 ๐๐ก+1
2
current state in โreal worldโ
![Page 42: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/42.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Tree Search - Selection
42
โข Selection
โข Expansion
โข Evaluation/Simulation
โข Backup
๐ ๐ก
๐ ๐ก+11 ๐ ๐ก+1
2
๐ ๐ก+21 ๐ ๐ก+2
2
Example Selection Strategies:โข Randomโข Greedyโข ๐-Greedyโข UCB1
Exploration-Exploitation!!!
Multi-Armed Bandits
![Page 43: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/43.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Tree Search - Expansion
43
โข Selection
โข Expansion
โข Evaluation/Simulation
โข Backup
๐ ๐ก+31
๐ ๐ก
๐ ๐ก+11 ๐ ๐ก+1
2
๐ ๐ก+21 ๐ ๐ก+2
2
![Page 44: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/44.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Tree Search - Expansion
44
โข Selection
โข Expansion
โข Evaluation/Simulation
โข Backup
?
Example Evaluation Strategies: โข Rollouts (e.g., Random)โข Value Function ๐๐ ๐ ๐ก (e.g.,
Reinforcement Learning)
๐ ๐ก
๐ ๐ก+11 ๐ ๐ก+1
2
๐ ๐ก+21 ๐ ๐ก+2
2
๐ ๐ก+31
![Page 45: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/45.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Monte Carlo Tree Search - Backup
45
โข Selection
โข Expansion
โข Evaluation/Simulation
โข Backup
๐ ๐ก
๐ ๐ก+11 ๐ ๐ก+1
2
๐ ๐ก+21 ๐ ๐ก+2
2
๐ ๐ก+31
?
Remember:
โ ๐ ๐ก, ๐๐ก + ๐พ๐
In this case ๐ = ๐บ๐ก+1(return from next state ๐ ๐ก+1)
![Page 46: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/46.jpg)
Prof. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme
WiSe 2019/20, Automated Planning
Summary
โข What we know so far:โ Markov Decision Processes (MDPs)
โ Policies and Value Functions
โ Optimally solve MDPs with Dynamic Programming
โ Approximately solve MDPs with Monte Carlo Search
โข What we donโt know (yet):โ How to find solutions without a model?
46
![Page 47: Praktikum Autonome Systeme Automated PlanningProf. Dr. C. Linnhoff-Popien, Thomy Phan, Andreas Sedlmeier, Fabian Ritz - Praktikum Autonome Systeme WiSe 2019/20, Automated Planning](https://reader033.fdocuments.us/reader033/viewer/2022060815/60940f0be774b05806066363/html5/thumbnails/47.jpg)
Thank you!