Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision...
Transcript of Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision...
![Page 1: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/1.jpg)
. . . . . .
Planning under uncertaintyMarkov decision processes
Christos Dimitrakakis
Chalmers
August 31, 2014
![Page 2: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/2.jpg)
. . . . . .
ContentsSubjective probability and utility
Subjective probabilityRewards and preferences
Bandit problemsIntroductionBernoulli bandits
Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples
Episodic problemsPolicy evaluationBackwards induction
Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms
Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC
![Page 3: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/3.jpg)
. . . . . .
Objective Probability
x
Pθ
Figure: The double slit experiment
![Page 4: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/4.jpg)
. . . . . .
Objective Probability
Figure: The double slit experiment
![Page 5: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/5.jpg)
. . . . . .
Objective Probability
Figure: The double slit experiment
![Page 6: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/6.jpg)
. . . . . .
What about everyday life?
![Page 7: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/7.jpg)
. . . . . .
Subjective probability
Making decisions requires making predictions.
Outcomes of decisions are uncertain.
How can we represent this uncertainty?
Subjective probability
Describe which events we think are more likely.
We quantify this with probability.
Why probability?
Quantifies uncertainty in a “natural” way.
A framework for drawing conclusions from data.
Computationally convenient for decision making.
![Page 8: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/8.jpg)
. . . . . .
Subjective probability
Making decisions requires making predictions.
Outcomes of decisions are uncertain.
How can we represent this uncertainty?
Subjective probability
Describe which events we think are more likely.
We quantify this with probability.
Why probability?
Quantifies uncertainty in a “natural” way.
A framework for drawing conclusions from data.
Computationally convenient for decision making.
![Page 9: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/9.jpg)
. . . . . .
Subjective probability
Making decisions requires making predictions.
Outcomes of decisions are uncertain.
How can we represent this uncertainty?
Subjective probability
Describe which events we think are more likely.
We quantify this with probability.
Why probability?
Quantifies uncertainty in a “natural” way.
A framework for drawing conclusions from data.
Computationally convenient for decision making.
![Page 10: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/10.jpg)
. . . . . .
Subjective probability
Making decisions requires making predictions.
Outcomes of decisions are uncertain.
How can we represent this uncertainty?
Subjective probability
Describe which events we think are more likely.
We quantify this with probability.
Why probability?
Quantifies uncertainty in a “natural” way.
A framework for drawing conclusions from data.
Computationally convenient for decision making.
![Page 11: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/11.jpg)
. . . . . .
Assumptions about our beliefs
Our beliefs must be consistent. This can be achieved if they satisfy someassumptions:
Assumption 1 (SP1)
It is always possible to say whether one event is more likely than the other.
Assumption 2 (SP2)
If we can split events A,B in such a way that each part of A is less likely thanits counterpart in B, then A is less likely than B.
There also a couple of technical assumptions..
![Page 12: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/12.jpg)
. . . . . .
Assumptions about our beliefs
Our beliefs must be consistent. This can be achieved if they satisfy someassumptions:
Assumption 1 (SP1)
It is always possible to say whether one event is more likely than the other.
Assumption 2 (SP2)
If we can split events A,B in such a way that each part of A is less likely thanits counterpart in B, then A is less likely than B.
There also a couple of technical assumptions..
![Page 13: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/13.jpg)
. . . . . .
Assumptions about our beliefs
Our beliefs must be consistent. This can be achieved if they satisfy someassumptions:
Assumption 1 (SP1)
It is always possible to say whether one event is more likely than the other.
Assumption 2 (SP2)
If we can split events A,B in such a way that each part of A is less likely thanits counterpart in B, then A is less likely than B.
There also a couple of technical assumptions..
![Page 14: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/14.jpg)
. . . . . .
Resulting properties of relative likelihoods
Theorem 1 (Transitivity)
If A,B,D such that A ≾ B and B ≾ D, then A ≾ D.
Theorem 2 (Complement)
For any A,B: A ≾ B iff A∁ ≿ B∁.
Theorem 3 (Fundamental property of relative likelihoods)
If A ⊂ B then A ≾ B. Furthermore, ∅ ≾ A ≾ S for any event A.
Theorem 4
For a given likelihood relation between events, there exists a unique probabilitydistribution P such that
P(A) ≥ P(B)⇔ A ≿ B
Similar results can be derived for conditional likelihoods and probabilities.
![Page 15: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/15.jpg)
. . . . . .
Rewards
We are going to receive a reward r from a set R of possible rewards.
We prefer some rewards to others.
Example 5 (Possible sets of rewards R)
R is a set of tickets to different musical events.
R is a set of financial commodities.
![Page 16: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/16.jpg)
. . . . . .
When we cannot select rewards directly
In most problems, we cannot just choose which reward to receive.
We can only specify a distribution on rewards.
Example 6 (Route selection)
Each reward r ∈ R is the time it takes to travel from A to B.
Route P1 is faster than P2 in heavy traffic and vice-versa.
Which route should be preferred, given a certain probability for heavytraffic?
In order to choose between random rewards, we use the concept of utility.
![Page 17: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/17.jpg)
. . . . . .
When we cannot select rewards directly
In most problems, we cannot just choose which reward to receive.
We can only specify a distribution on rewards.
Example 6 (Route selection)
Each reward r ∈ R is the time it takes to travel from A to B.
Route P1 is faster than P2 in heavy traffic and vice-versa.
Which route should be preferred, given a certain probability for heavytraffic?
In order to choose between random rewards, we use the concept of utility.
![Page 18: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/18.jpg)
. . . . . .
When we cannot select rewards directly
In most problems, we cannot just choose which reward to receive.
We can only specify a distribution on rewards.
Example 6 (Route selection)
Each reward r ∈ R is the time it takes to travel from A to B.
Route P1 is faster than P2 in heavy traffic and vice-versa.
Which route should be preferred, given a certain probability for heavytraffic?
In order to choose between random rewards, we use the concept of utility.
![Page 19: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/19.jpg)
. . . . . .
Utility
Definition 7 (Utility)
The utility is a function U : R → R, such that for all a, b ∈ R
a ≿∗ b iff U(a) ≥ U(b), (1.1)
The expected utility of a distribution P on R is:
EP(U) =
∫R
U(r) dP(r) (1.2)
Assumption 3 (The expected utility hypothesis)
The utility of P is equal to the expected utility of the reward under P.Consequently,
P ≿∗ Q iff EP(U) ≥ EQ(U). (1.3)
![Page 20: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/20.jpg)
. . . . . .
Utility
Definition 7 (Utility)
The utility is a function U : R → R, such that for all a, b ∈ R
a ≿∗ b iff U(a) ≥ U(b), (1.1)
The expected utility of a distribution P on R is:
EP(U) =
∫R
U(r) dP(r) (1.2)
Assumption 3 (The expected utility hypothesis)
The utility of P is equal to the expected utility of the reward under P.Consequently,
P ≿∗ Q iff EP(U) ≥ EQ(U). (1.3)
![Page 21: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/21.jpg)
. . . . . .
Utility
Definition 7 (Utility)
The utility is a function U : R → R, such that for all a, b ∈ R
a ≿∗ b iff U(a) ≥ U(b), (1.1)
The expected utility of a distribution P on R is:
EP(U) =
∫R
U(r) dP(r) (1.2)
Assumption 3 (The expected utility hypothesis)
The utility of P is equal to the expected utility of the reward under P.Consequently,
P ≿∗ Q iff EP(U) ≥ EQ(U). (1.3)
![Page 22: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/22.jpg)
. . . . . .
Example 8
r U(r) P Qdid not enter 0 1 0
paid 1 CU and lost −1 0 0.99paid 1 CU and won 10 9 0 0.01
Table: A simple gambling problem
P QE(U | ·) 0 −0.9
Table: Expected utility for the gambling problem
![Page 23: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/23.jpg)
. . . . . .
The St. Petersburg Paradox
A simple game [Bernoulli, 1713]
A fair coin is tossed until a head is obtained.
If the first head is obtained on the n-th toss, our reward will be 2n
currency units.
How much are you willing to pay, to play this game once?
The probability to stop at round n is 2−n.
Thus, the expected monetary gain of the game is
∞∑n=1
2n2−n =∞.
If your utility function were linear you’d be willing to pay any amount toplay.
![Page 24: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/24.jpg)
. . . . . .
The St. Petersburg Paradox
A simple game [Bernoulli, 1713]
A fair coin is tossed until a head is obtained.
If the first head is obtained on the n-th toss, our reward will be 2n
currency units.
How much are you willing to pay, to play this game once?
The probability to stop at round n is 2−n.
Thus, the expected monetary gain of the game is
∞∑n=1
2n2−n =∞.
If your utility function were linear you’d be willing to pay any amount toplay.
![Page 25: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/25.jpg)
. . . . . .
The St. Petersburg Paradox
A simple game [Bernoulli, 1713]
A fair coin is tossed until a head is obtained.
If the first head is obtained on the n-th toss, our reward will be 2n
currency units.
How much are you willing to pay, to play this game once?
The probability to stop at round n is 2−n.
Thus, the expected monetary gain of the game is
∞∑n=1
2n2−n =∞.
If your utility function were linear you’d be willing to pay any amount toplay.
![Page 26: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/26.jpg)
. . . . . .
The St. Petersburg Paradox
A simple game [Bernoulli, 1713]
A fair coin is tossed until a head is obtained.
If the first head is obtained on the n-th toss, our reward will be 2n
currency units.
How much are you willing to pay, to play this game once?
The probability to stop at round n is 2−n.
Thus, the expected monetary gain of the game is
∞∑n=1
2n2−n =∞.
If your utility function were linear you’d be willing to pay any amount toplay.
![Page 27: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/27.jpg)
. . . . . .
The St. Petersburg Paradox
A simple game [Bernoulli, 1713]
A fair coin is tossed until a head is obtained.
If the first head is obtained on the n-th toss, our reward will be 2n
currency units.
How much are you willing to pay, to play this game once?
The probability to stop at round n is 2−n.
Thus, the expected monetary gain of the game is
∞∑n=1
2n2−n =∞.
If your utility function were linear you’d be willing to pay any amount toplay.
![Page 28: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/28.jpg)
. . . . . .
Summary
We can subjectively indicate which events we think are more likely.
Using relative likelihoods, we can define a subjective probability P for allevents.
Similarly, we can subjectively indicate preferences for rewards.
We can determine a utility function for all rewards.
Hypothesis: we prefer the probability distribution (over rewards) with thehighest expected utility.
Concave utility functions imply risk aversion (and convex, risk-taking).
![Page 29: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/29.jpg)
. . . . . .
Experimental design and Markov decision processes
The following problems
Shortest path problems.
Optimal stopping problems.
Reinforcement learning problems.
Experiment design (clinical trial) problems
Advertising.
can be all formalised as Markov decision processes.
Applications
Robotics.
Economics.
Automatic control.
Resource allocation
![Page 30: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/30.jpg)
. . . . . .
ContentsSubjective probability and utility
Subjective probabilityRewards and preferences
Bandit problemsIntroductionBernoulli bandits
Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples
Episodic problemsPolicy evaluationBackwards induction
Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms
Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC
![Page 31: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/31.jpg)
. . . . . .
Bandit problems
![Page 32: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/32.jpg)
. . . . . .
Bandit problems
Applications
Efficient optimisation.
Online advertising.
Clinical trials.
Robot scientist.
..
-1
.
-0.5
.
0
.
0.5
.
1
.
1.5
.
2
.0
.0.5
.1
.1.5
.2
.2.5
.3
.3.5
.4
![Page 33: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/33.jpg)
. . . . . .
Bandit problems
Applications
Efficient optimisation.
Online advertising.
Clinical trials.
Robot scientist.
..
-1
.
-0.5
.
0
.
0.5
.
1
.
1.5
.
2
.0
.0.5
.1
.1.5
.2
.2.5
.3
.3.5
.4
![Page 34: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/34.jpg)
. . . . . .
Bandit problems
Applications
Efficient optimisation.
Online advertising.
Clinical trials.
Robot scientist.
..
-1
.
-0.5
.
0
.
0.5
.
1
.
1.5
.
2
.0
.0.5
.1
.1.5
.2
.2.5
.3
.3.5
.4
![Page 35: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/35.jpg)
. . . . . .
Bandit problems
Applications
Efficient optimisation.
Online advertising.
Clinical trials.
Robot scientist.
![Page 36: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/36.jpg)
. . . . . .
Bandit problems
Applications
Efficient optimisation.
Online advertising.
Clinical trials.
Robot scientist.
Ultrasound
![Page 37: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/37.jpg)
. . . . . .
Bandit problems
Applications
Efficient optimisation.
Online advertising.
Clinical trials.
Robot scientist.
![Page 38: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/38.jpg)
. . . . . .
The stochastic n-armed bandit problem
Actions and rewards
A set of actions A = 1, . . . , n. Each action gives you a random reward with distribution P(rt | at = i).
The expected reward of the i-th arm is ρi ≜ E(rt | at = i).
Utility
The utility is the sum of the rewards obtained
U ≜∑t
rt .
![Page 39: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/39.jpg)
. . . . . .
Policy
Definition 9 (Policies)
A policy π is an algorithm for taking actions given the observed history.
Pπ(at+1 | a1, r1, . . . , at , rt)
is the probability of the next action at+1.
![Page 40: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/40.jpg)
. . . . . .
Bernoulli bandits
Example 10 (Bernoulli bandits)
Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,
P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)
Then the expected reward for the i-th bandit is ρi ≜ E(rt | at = i) = ?.
Exercise 1 (The optimal policy under perfect knowledge)
If we know ωi for all i , what is the best policy?
A At every step, play the bandit i with the greatest ωi .
B At every step, play the bandit i with probability increasing with ωi .
C There is no right answer. It depends on the horizon T .
D It is too complicated.
![Page 41: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/41.jpg)
. . . . . .
Bernoulli bandits
Example 10 (Bernoulli bandits)
Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,
P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)
Then the expected reward for the i-th bandit is ρi ≜ E(rt | at = i) = ωi .
Exercise 1 (The optimal policy under perfect knowledge)
If we know ωi for all i , what is the best policy?
A At every step, play the bandit i with the greatest ωi .
B At every step, play the bandit i with probability increasing with ωi .
C There is no right answer. It depends on the horizon T .
D It is too complicated.
![Page 42: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/42.jpg)
. . . . . .
Bernoulli bandits
Example 10 (Bernoulli bandits)
Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,
P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)
Then the expected reward for the i-th bandit is ρi ≜ E(rt | at = i) = ωi .
Exercise 1 (The optimal policy under perfect knowledge)
If we know ωi for all i , what is the best policy?
A At every step, play the bandit i with the greatest ωi .
B At every step, play the bandit i with probability increasing with ωi .
C There is no right answer. It depends on the horizon T .
D It is too complicated.
![Page 43: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/43.jpg)
. . . . . .
The unknown reward case
Say you keep a running average of the reward obtained by each arm
ρt,i = Rt,i/nt,i
where nt,i is the number of times you played arm i and Rt,i the total rewardreceived from i so that whenever you play at = i :
Rt+1,i = Rt,i + rt , nt+1,i = nt,i + 1.
You could then choose to play the strategy
at = argmaxi
ρt,i .
What should the initial values n0,i ,R0,i be?
![Page 44: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/44.jpg)
. . . . . .
The uniform policy
..
0
.
0.2
.
0.4
.
0.6
.
0.8
.
1
.0
.100
.200
.300
.400
.500
.600
.700
.800
.900
.1000
.
ρ1
.
ρ2
.
ρ1
.
ρ2
.
∑tk=1 rk/t
![Page 45: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/45.jpg)
. . . . . .
The greedy policy
..
0
.
0.2
.
0.4
.
0.6
.
0.8
.
1
.0
.100
.200
.300
.400
.500
.600
.700
.800
.900
.1000
.
ρ1
.
ρ2
.
ρ1
.
ρ2
.
∑tk=1 rk/t
For n0,i = R0,i = 0
![Page 46: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/46.jpg)
. . . . . .
The greedy policy
..
0
.
0.2
.
0.4
.
0.6
.
0.8
.
1
.0
.100
.200
.300
.400
.500
.600
.700
.800
.900
.1000
.
ρ1
.
ρ2
.
ρ1
.
ρ2
.
∑tk=1 rk/t
For n0,i = R0,i = 1
![Page 47: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/47.jpg)
. . . . . .
The greedy policy
..
0
.
0.2
.
0.4
.
0.6
.
0.8
.
1
.0
.100
.200
.300
.400
.500
.600
.700
.800
.900
.1000
.
ρ1
.
ρ2
.
ρ1
.
ρ2
.
∑tk=1 rk/t
Forn0,i = R0,i = 10
![Page 48: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/48.jpg)
. . . . . .
ContentsSubjective probability and utility
Subjective probabilityRewards and preferences
Bandit problemsIntroductionBernoulli bandits
Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples
Episodic problemsPolicy evaluationBackwards induction
Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms
Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC
![Page 49: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/49.jpg)
. . . . . .
A Markov processes
![Page 50: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/50.jpg)
. . . . . .
Markov process
..st−1. st. st+1
Definition 11 (Markov Process – or Markov Chain)
The sequence st | t = 1, . . . of random variables st : Ω → S is a Markovprocess if
P(st+1 | st , . . . , s1) = P(st+1 | st). (3.1)
st is state of the Markov process at time t.
P(st+1 | st) is the transition kernel of the process.
The state of an algorithm
Observe that the R, n vectors of our greedy bandit algorithm form a Markovprocess. They also summarise our belief about which arm is the best.
![Page 51: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/51.jpg)
. . . . . .
Reinforcement learning
The reinforcement learning problem.
Learning to act in an unknown environment, by interaction and reinforcement.
The environment has a changing state st .
The agents observes the state st (simplest case).
The agent takes action at .
It receives rewards rt .
The goal (informally)
Maximise total reward∑
t rt
Types of environments
Markov decision processes (MDPs).
Partially observable MDPs (POMDPs).
(Partially observable) Markov games.
First deal with the case when µ is known.
![Page 52: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/52.jpg)
. . . . . .
Markov decision processes
Markov decision processes (MDP).
At each time step t:
We observe state st ∈ S. We take action at ∈ A. We receive a reward rt ∈ R. .. at.
st
.
st+1
.
rt
Markov property of the reward and state distribution
Pµ(st+1 | st , at) (Transition distribution)
Pµ(rt | st , at) (Reward distribution)
![Page 53: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/53.jpg)
. . . . . .
The agent
The agent’s policy π
Pπ(at | st , . . . , s1, at−1, . . . , a1) (history-dependent policy)
Pπ(at | st) (Markov policy)
Definition 12 (Utility)
Given a horizon T , the utility can be defined as
Ut ≜T−t∑k=0
rt+k (3.2)
The agent wants to to find π maximising the expected total future reward
Eπµ Ut = Eπ
µ
T−t∑k=0
rt+k . (expected utility)
![Page 54: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/54.jpg)
. . . . . .
State value function
V πµ,t(s) ≜ Eπ
µ(Ut | st = s) (3.3)
The optimal policy π∗
π∗(µ) : Vπ∗(µ)t,µ (s) ≥ V π
t,µ(s) ∀π, t, s (3.4)
dominates all other policies π everywhere in S.The optimal value function V ∗
V ∗t,µ(s) ≜ V
π∗(µ)t,µ (s), (3.5)
is the value function of the optimal policy π∗.
![Page 55: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/55.jpg)
. . . . . .
Deterministic shortest-path problems
X
Properties
T →∞.
rt = −1 unless st = X , in which casert = 0.
Pµ(st+1 = X |st = X ) = 1.
A = North, South,East,West Transitions are deterministic and walls
block.
![Page 56: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/56.jpg)
. . . . . .
14 13 12 11 10 9 8 7
15 13 6
16 15 14 4 3 4 5
17 2
18 19 20 2 1 2
19 21 1 0 1
20 22
21 23 24 25 26 27 28
Properties
γ = 1, T →∞.
rt = −1 unless st = X , in which casert = 0.
The length of the shortest path from sequals the negative value of the optimalpolicy.
Also called cost-to-go.
![Page 57: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/57.jpg)
. . . . . .
Stochastic shortest path problem with a pit
O X
Properties
T →∞.
rt = −1, but rt = 0 at X and −100 at Oand the problem ends.
Pµ(st+1 = X |st = X ) = 1.
A = North, South,East,West Moves to a random direction with
probability ω. Walls block.
![Page 58: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/58.jpg)
. . . . . .
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
(a) ω = 0.1
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
(b) ω = 0.5
0.51
1.52
2.5
-120 -100 -80 -60 -40 -20 0
(c) value
Figure: Pit maze solutions for two values of ω.
Exercise 2
Why should we only take the shortcut in (a)?
Why does the agent commit suicide at the bottom?
![Page 59: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/59.jpg)
. . . . . .
ContentsSubjective probability and utility
Subjective probabilityRewards and preferences
Bandit problemsIntroductionBernoulli bandits
Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples
Episodic problemsPolicy evaluationBackwards induction
Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms
Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC
![Page 60: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/60.jpg)
. . . . . .
How to evaluate a policy
V πµ,t(s) ≜ Eπ
µ(Ut | st = s) (4.1)
(4.2)
This derivation directly gives a number of policy evaluation algorithms.
![Page 61: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/61.jpg)
. . . . . .
How to evaluate a policy
V πµ,t(s) ≜ Eπ
µ(Ut | st = s) (4.1)
=T−t∑k=0
Eπµ(rt+k | st = s) (4.2)
(4.3)
This derivation directly gives a number of policy evaluation algorithms.
![Page 62: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/62.jpg)
. . . . . .
How to evaluate a policy
V πµ,t(s) ≜ Eπ
µ(Ut | st = s) (4.1)
=T−t∑k=0
Eπµ(rt+k | st = s) (4.2)
= Eπµ(rt | st = s) + Eπ
µ(Ut+1 | st = s) (4.3)
(4.4)
This derivation directly gives a number of policy evaluation algorithms.
![Page 63: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/63.jpg)
. . . . . .
How to evaluate a policy
V πµ,t(s) ≜ Eπ
µ(Ut | st = s) (4.1)
=T−t∑k=0
Eπµ(rt+k | st = s) (4.2)
= Eπµ(rt | st = s) + Eπ
µ(Ut+1 | st = s) (4.3)
= Eπµ(rt | st = s) +
∑i∈S
V πµ,t+1(i)Pπ
µ(st+1 = i |st = s). (4.4)
This derivation directly gives a number of policy evaluation algorithms.
![Page 64: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/64.jpg)
. . . . . .
Monte-Carlo Policy evaluation
for s ∈ S do
for k = 1, . . . ,K doExecute policy π and record total reward K times:
Rk(s) =T∑t=1
rt,k .
end forCalculate estimate:
v1(s) =1
K
K∑k=1
Rk(s).
end for
![Page 65: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/65.jpg)
. . . . . .
Monte-Carlo Policy evaluation
for s ∈ S dofor k = 1, . . . ,K do
Execute policy π and record total reward K times:
Rk(s) =T∑t=1
rt,k .
end for
Calculate estimate:
v1(s) =1
K
K∑k=1
Rk(s).
end for
![Page 66: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/66.jpg)
. . . . . .
Monte-Carlo Policy evaluation
for s ∈ S dofor k = 1, . . . ,K do
Execute policy π and record total reward K times:
Rk(s) =T∑t=1
rt,k .
end forCalculate estimate:
v1(s) =1
K
K∑k=1
Rk(s).
end for
![Page 67: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/67.jpg)
. . . . . .
Backwards induction policy evaluation
for State s ∈ S , t = T , . . . , 1 doUpdate values
vt(s) = Eπµ(rt | st = s) +
∑j∈S
Pπµ(st+1 = j | st = s)vt+1(j), (4.5)
end for
..
st
.
at
.
rt
.
st+1
.?.
?
.
?
.
1
.
0
.
1
.
0
.0.5
.
0.5
.0.7
.
0.3
.
0.4
. 0.6
![Page 68: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/68.jpg)
. . . . . .
Backwards induction policy evaluation
for State s ∈ S , t = T , . . . , 1 doUpdate values
vt(s) = Eπµ(rt | st = s) +
∑j∈S
Pπµ(st+1 = j | st = s)vt+1(j), (4.5)
end for
..
st
.
at
.
rt
.
st+1
.?.
0.7
.
?
.
1
.
0
.
1
.
0
.0.5
.
0.5
.0.7
.
0.3
.
0.4
. 0.6
![Page 69: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/69.jpg)
. . . . . .
Backwards induction policy evaluation
for State s ∈ S , t = T , . . . , 1 doUpdate values
vt(s) = Eπµ(rt | st = s) +
∑j∈S
Pπµ(st+1 = j | st = s)vt+1(j), (4.5)
end for
..
st
.
at
.
rt
.
st+1
.?.
0.7
.
1.4
.
1
.
0
.
1
.
0
.0.5
.
0.5
.0.7
.
0.3
.
0.4
. 0.6
![Page 70: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/70.jpg)
. . . . . .
Backwards induction policy evaluation
for State s ∈ S , t = T , . . . , 1 doUpdate values
vt(s) = Eπµ(rt | st = s) +
∑j∈S
Pπµ(st+1 = j | st = s)vt+1(j), (4.5)
end for
..
st
.
at
.
rt
.
st+1
.1.05.
0.7
.
1.4
.
1
.
0
.
1
.
0
.0.5
.
0.5
.0.7
.
0.3
.
0.4
. 0.6
![Page 71: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/71.jpg)
. . . . . .
Backwards induction policy optimization
for State s ∈ S , t = T , . . . , 1 doUpdate values
vt(s) = maxa
Eµ(rt | st = s, at = a)+∑j∈S
Pµ(st+1 = j | st = s, at = a)vt+1(j),
(4.6)end for
..
st
.
at
.
rt
.
st+1
.?.
0.7
.
1.4
.
1
.
0
.
1
.
0
.?
.
?
.0.7
.
0.3
.
0.4
. 0.6
![Page 72: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/72.jpg)
. . . . . .
Backwards induction policy optimization
for State s ∈ S , t = T , . . . , 1 doUpdate values
vt(s) = maxa
Eµ(rt | st = s, at = a)+∑j∈S
Pµ(st+1 = j | st = s, at = a)vt+1(j),
(4.6)end for
..
st
.
at
.
rt
.
st+1
.1.4.
0.7
.
1.4
.
1
.
0
.
1
.
0
.0
.
1
.0.7
.
0.3
.
0.4
. 0.6
![Page 73: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/73.jpg)
. . . . . .
ContentsSubjective probability and utility
Subjective probabilityRewards and preferences
Bandit problemsIntroductionBernoulli bandits
Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples
Episodic problemsPolicy evaluationBackwards induction
Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms
Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC
![Page 74: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/74.jpg)
. . . . . .
Discounted total reward.
Ut = limT→∞
T∑k=t
γk rk , γ ∈ (0, 1)
Definition 13
A policy π is stationary if π(at | st) does not depend on t.
Remark 1
We can use the Markov chain kernel Pµ,π to write the expected utility vector as
vπ =∞∑t=0
γtP tµ,πr (5.1)
![Page 75: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/75.jpg)
. . . . . .
Theorem 14
For any stationary policy π, vπ is the unique solution of
v = r + γPµ,πv. ← fixed point (5.2)
In addition, the solution is:
vπ = (I − γPµ,π)−1r. (5.3)
Example 15
Similar to the geometric series:
∞∑t=0
αt =1
1− α
![Page 76: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/76.jpg)
. . . . . .
Backward induction for discounted infinite horizon problems
We can also apply backwards induction to the infinite case.
The resulting policy is stationary.
So memory does not grow with T .
Value iteration
for n = 1, 2, . . . and s ∈ S dovn(s) = maxa r(s, a) + γ
∑s′∈S Pµ(s
′ | s, a)vn−1(s′)
end for
![Page 77: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/77.jpg)
. . . . . .
Policy Iteration
Input µ, S.Initialise v0.for n = 1, 2, . . . do
πn+1 = argmaxπ r + γPπvn // policy improvement
vn+1 = Vπn+1µ // policy evaluation
break if πn+1 = πn.end forReturn πn,vn.
![Page 78: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/78.jpg)
. . . . . .
Summary
Markov decision processes model controllable dynamical systems. Optimal policies maximise expected utility can be found with:
Backwards induction / value iteration. Policy iteration.
The MDP state can be seen as The state of a dynamic controllable process. The internal state of an agent.
![Page 79: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/79.jpg)
. . . . . .
ContentsSubjective probability and utility
Subjective probabilityRewards and preferences
Bandit problemsIntroductionBernoulli bandits
Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples
Episodic problemsPolicy evaluationBackwards induction
Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms
Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC
![Page 80: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/80.jpg)
. . . . . .
The reinforcement learning problem
Learning to act in an unknown world, by interaction and reinforcement.
World µ; Policy π; at time t
µ generates observation xt ∈ X . We take action at ∈ A using π.
µ gives us reward rt ∈ R.
Definition 16 ()
Eπµ
Ut =
Eπµ
T∑k=t
rk
![Page 81: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/81.jpg)
. . . . . .
The reinforcement learning problem
Learning to act in an unknown world, by interaction and reinforcement.
World µ; Policy π; at time t
µ generates observation xt ∈ X . We take action at ∈ A using π.
µ gives us reward rt ∈ R.
Definition 16 ()
Eπµ
Ut =
Eπµ
T∑k=t
rk
![Page 82: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/82.jpg)
. . . . . .
The reinforcement learning problem
Learning to act in an unknown world, by interaction and reinforcement.
World µ; Policy π; at time t
µ generates observation xt ∈ X . We take action at ∈ A using π.
µ gives us reward rt ∈ R.
Definition 16 (Utility)
Eπµ
Ut =
Eπµ
T∑k=t
rk
![Page 83: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/83.jpg)
. . . . . .
The reinforcement learning problem
Learning to act in an unknown world, by interaction and reinforcement.
World µ; Policy π; at time t
µ generates observation xt ∈ X . We take action at ∈ A using π.
µ gives us reward rt ∈ R.
Definition 16 (Expected utility)
Eπµ Ut = Eπ
µ
T∑k=t
rk
When µ is known, calculate maxπ Eπµ U.
![Page 84: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/84.jpg)
. . . . . .
The reinforcement learning problem
Learning to act in an unknown world, by interaction and reinforcement.
World µ; Policy π; at time t
µ generates observation xt ∈ X . We take action at ∈ A using π.
µ gives us reward rt ∈ R.
Definition 16 (Expected utility)
Eπµ Ut = Eπ
µ
T∑k=t
rk
Knowing µ is contrary to the problem definition
![Page 85: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/85.jpg)
. . . . . .
When µ is not known
Bayesian idea: use a subjective belief ξ(µ) on M
Initial belief ξ(µ).
The probability of observing history h is Pπµ(h).
We can use this to adjust our belief via Bayes’ theorem:
ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)
We can thus conclude which µ is more likely.
The subjective expected utility
U∗ξ ≜ max
π
Eπξ U
=
maxπ
∑µ
(Eπ
µ U)ξ(µ).
Integrates planning and learning, and the exploration-exploitation trade-off
![Page 86: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/86.jpg)
. . . . . .
When µ is not known
Bayesian idea: use a subjective belief ξ(µ) on M
Initial belief ξ(µ).
The probability of observing history h is Pπµ(h).
We can use this to adjust our belief via Bayes’ theorem:
ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)
We can thus conclude which µ is more likely.
The subjective expected utility
U∗ξ ≜ max
π
Eπξ U
=
maxπ
∑µ
(Eπ
µ U)ξ(µ).
Integrates planning and learning, and the exploration-exploitation trade-off
![Page 87: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/87.jpg)
. . . . . .
When µ is not known
Bayesian idea: use a subjective belief ξ(µ) on M
Initial belief ξ(µ).
The probability of observing history h is Pπµ(h).
We can use this to adjust our belief via Bayes’ theorem:
ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)
We can thus conclude which µ is more likely.
The subjective expected utility
U∗ξ ≜ max
π
Eπξ U
=
maxπ
∑µ
(Eπ
µ U)ξ(µ).
Integrates planning and learning, and the exploration-exploitation trade-off
![Page 88: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/88.jpg)
. . . . . .
When µ is not known
Bayesian idea: use a subjective belief ξ(µ) on M
Initial belief ξ(µ).
The probability of observing history h is Pπµ(h).
We can use this to adjust our belief via Bayes’ theorem:
ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)
We can thus conclude which µ is more likely.
The subjective expected utility
U∗ξ ≜ max
π
Eπξ U
=
maxπ
∑µ
(Eπ
µ U)ξ(µ).
Integrates planning and learning, and the exploration-exploitation trade-off
![Page 89: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/89.jpg)
. . . . . .
When µ is not known
Bayesian idea: use a subjective belief ξ(µ) on M
Initial belief ξ(µ).
The probability of observing history h is Pπµ(h).
We can use this to adjust our belief via Bayes’ theorem:
ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)
We can thus conclude which µ is more likely.
The subjective expected utility
U∗ξ ≜ max
π
Eπξ U =
maxπ
∑µ
(Eπ
µ U)ξ(µ).
Integrates planning and learning, and the exploration-exploitation trade-off
![Page 90: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/90.jpg)
. . . . . .
When µ is not known
Bayesian idea: use a subjective belief ξ(µ) on M
Initial belief ξ(µ).
The probability of observing history h is Pπµ(h).
We can use this to adjust our belief via Bayes’ theorem:
ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)
We can thus conclude which µ is more likely.
The subjective expected utility
U∗ξ ≜ max
πEπ
ξ U = maxπ
∑µ
(Eπ
µ U)ξ(µ).
Integrates planning and learning, and the exploration-exploitation trade-off
![Page 91: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/91.jpg)
. . . . . .
Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ
ξ U
..
EU
. ξ.
U∗µ1: No trap
.
U∗µ2: Trap
![Page 92: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/92.jpg)
. . . . . .
Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ
ξ U
..
EU
. ξ.
U∗µ1: No trap
.
U∗µ2: Trap
.
π1
![Page 93: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/93.jpg)
. . . . . .
Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ
ξ U
..
EU
. ξ.
U∗µ1: No trap
.
U∗µ2: Trap
.
π1
.
π2
![Page 94: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/94.jpg)
. . . . . .
Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ
ξ U
..
EU
. ξ.
U∗µ1: No trap
.
U∗µ2: Trap
.
π1
.
π2
.
ξ1
![Page 95: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/95.jpg)
. . . . . .
Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ
ξ U
..
EU
. ξ.
U∗µ1: No trap
.
U∗µ2: Trap
.
π1
.
π2
.
π∗ξ1
.
U∗ξ1
.
ξ1
![Page 96: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/96.jpg)
. . . . . .
Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ
ξ U
..
EU
. ξ.
U∗µ1: No trap
.
U∗µ2: Trap
.
π1
.
π2
.
π∗ξ1
.
U∗ξ1
.
ξ1
.
U∗ξ
![Page 97: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/97.jpg)
. . . . . .
Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ
ξ U
..
EU
. ξ.
U∗µ1: No trap
.
U∗µ2: Trap
.
π1
.
π2
.
π∗ξ1
.
U∗ξ1
.
ξ1
.
U∗ξ
![Page 98: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/98.jpg)
. . . . . .
ABC (Approximate Bayesian Computation) RL1
How to deal with an arbitrary model space M
The models µ ∈M may be non-probabilistic simulators.
We may not know how to choose the simulator parameters.
Overview of the approach
Place a prior on the simulator parameters.
Observe some data h on the real system.
Approximate the posterior by statistics on simulated data.
Calculate a near-optimal policy for the posterior.
Results
We prove soundness with general properties on the statistics.
In practice, can require much less data than a general model.
1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013
![Page 99: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/99.jpg)
. . . . . .
ABC (Approximate Bayesian Computation) RL1
How to deal with an arbitrary model space M
The models µ ∈M may be non-probabilistic simulators.
We may not know how to choose the simulator parameters.
Overview of the approach
Place a prior on the simulator parameters.
Observe some data h on the real system.
Approximate the posterior by statistics on simulated data.
Calculate a near-optimal policy for the posterior.
Results
We prove soundness with general properties on the statistics.
In practice, can require much less data than a general model.
1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013
![Page 100: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/100.jpg)
. . . . . .
ABC (Approximate Bayesian Computation) RL1
How to deal with an arbitrary model space M
The models µ ∈M may be non-probabilistic simulators.
We may not know how to choose the simulator parameters.
Overview of the approach
Place a prior on the simulator parameters.
Observe some data h on the real system.
Approximate the posterior by statistics on simulated data.
Calculate a near-optimal policy for the posterior.
Results
We prove soundness with general properties on the statistics.
In practice, can require much less data than a general model.
1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013
![Page 101: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/101.jpg)
. . . . . .
ABC (Approximate Bayesian Computation) RL1
How to deal with an arbitrary model space M
The models µ ∈M may be non-probabilistic simulators.
We may not know how to choose the simulator parameters.
Overview of the approach
Place a prior on the simulator parameters.
Observe some data h on the real system.
Approximate the posterior by statistics on simulated data.
Calculate a near-optimal policy for the posterior.
Results
We prove soundness with general properties on the statistics.
In practice, can require much less data than a general model.
1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013
![Page 102: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/102.jpg)
. . . . . .
Cover tree Bayesian reinforcement learning
The model idea
Cover the space using a covertree.
A linear model for each set.
The tree defines a distribution onpiecewise-linear models.
Algorithm overview
Build the tree online
Do Bayesian inference on the tree.
Sample a model from the tree.
Get a policy for the model.
..c0.
c1
.
c2
.
c3
.
c4
![Page 103: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/103.jpg)
. . . . . .
Cover tree Bayesian reinforcement learning
The model idea
Cover the space using a covertree.
A linear model for each set.
The tree defines a distribution onpiecewise-linear models.
Algorithm overview
Build the tree online
Do Bayesian inference on the tree.
Sample a model from the tree.
Get a policy for the model.
..c0.
c1
.
c2
.
c3
.
c4
..st. st+1.
at
.
ct
.
θt
![Page 104: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/104.jpg)
. . . . . .
Cover tree Bayesian reinforcement learning
The model idea
Cover the space using a covertree.
A linear model for each set.
The tree defines a distribution onpiecewise-linear models.
Algorithm overview
Build the tree online
Do Bayesian inference on the tree.
Sample a model from the tree.
Get a policy for the model.
..c0.
c1
.
c2
.
c3
.
c4
.....−4
.−2
.0
.2
.4
.−1.
0
.
1
.
st
.st+
1.
104 samples
![Page 105: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/105.jpg)
. . . . . .
Cover tree Bayesian reinforcement learning
The model idea
Cover the space using a covertree.
A linear model for each set.
The tree defines a distribution onpiecewise-linear models.
Algorithm overview
Build the tree online
Do Bayesian inference on the tree.
Sample a model from the tree.
Get a policy for the model.
...
..
101
.
102
.
103
.
102
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
![Page 106: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/106.jpg)
. . . . . .
A comparison
ABC RL
Any simulator can be used ⇒ enables detailed prior knowledge
Our theoretical results prove soundness of ABC.
Downside: Computationally intensive.
Cover Tree Bayesian RL
Very general model.
Inference in logarithmic time due to the tree strcuture.
Downside: Hard to insert domain-specific prior knowledge.
Future work
Advanced algorithms (e.g. tree or gradient methods) for policy optimisation.
![Page 107: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/107.jpg)
. . . . . .
Unknown MDPs can be handled in a Bayesian framework. This defines a belief-augmented MDP with
A state for the MDP. A state for the agent’s belief.
The Bayes-optimal utility is convex, enabling approximations.
A big problem in specifying the “right” prior.
Questions?
![Page 108: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/108.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 109: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/109.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 110: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/110.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
Example 17 (Cumulative features)
Feature function ϕ : X → Rk .
f (h) ≜∑t
ϕ(xt)
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 111: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/111.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
Example 17 (Utility)
f (h) ≜∑t
rt
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 112: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/112.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 113: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/113.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 114: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/114.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 115: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/115.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 116: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/116.jpg)
. . . . . .
ABC (Approximate Bayesian Computation)
When there is no probabilistic model (Pµ is not available): ABC!
A prior ξ on a class of simulatorsM History h ∈ H from policy π.
Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.
ABC-RL using Thompson sampling
do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history
until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close
µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ
µ(k) Ut // approximate optimal policy for sample
![Page 117: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/117.jpg)
. . . . . .
The approximate posterior ξϵ(· | h)
Corollary 17
If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).
Assumption 4 (A1. Lipschitz log-probabilities)
For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ
µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥
Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:
D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)
where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.
![Page 118: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/118.jpg)
. . . . . .
The approximate posterior ξϵ(· | h)
Corollary 17
If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).
Assumption 4 (A1. Lipschitz log-probabilities)
For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ
µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥
Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:
D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)
where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.
![Page 119: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/119.jpg)
. . . . . .
The approximate posterior ξϵ(· | h)
Corollary 17
If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).
Assumption 4 (A1. Lipschitz log-probabilities)
For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ
µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥
Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:
D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)
where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.
![Page 120: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/120.jpg)
. . . . . .
The approximate posterior ξϵ(· | h)
Corollary 17
If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).
Assumption 4 (A1. Lipschitz log-probabilities)
For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ
µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥
Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:
D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)
where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.
![Page 121: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/121.jpg)
. . . . . .
[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite time analysis ofthe multiarmed bandit problem. Machine Learning, 47(2/3):235–256,2002.
[2] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control.Athena Scientific, 2001.
[3] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-DynamicProgramming. Athena Scientific, 1996.
[4] Herman Chernoff. Sequential design of experiments. Annals ofMathematical Statistics, 30(3):755–770, 1959.
[5] Herman Chernoff. Sequential Models for Clinical Trials. In Proceedings ofthe Fifth Berkeley Symposium on Mathematical Statistics and Probability,Vol.4, pages 805–812. Univ. of Calif Press, 1966.
[6] Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons,1970.
[7] Milton Friedman and Leonard J. Savage. The expected-utility hypothesisand the measurability of utility. The Journal of Political Economy,60(6):463, 1952.
[8] Marting L. Puterman. Markov Decision Processes : Discrete StochasticDynamic Programming. John Wiley & Sons, New Jersey, US, 1994.
[9] Leonard J. Savage. The Foundations of Statistics. Dover Publications,1972.
[10] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger.Gaussian process optimization in the bandit setting: No regret andexperimental design. In ICML 2010, 2010.
![Page 122: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . . . .](https://reader031.fdocuments.us/reader031/viewer/2022041021/5ed099d8e2e54a54101262ae/html5/thumbnails/122.jpg)
. . . . . .
[11] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. MIT Press, 1998.