Impulsive noise mitigation in OFDM systems using sparse bayesian learning
Bayesian Sparse Sampling for On-line Reward Optimization
description
Transcript of Bayesian Sparse Sampling for On-line Reward Optimization
![Page 1: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/1.jpg)
Bayesian Sparse Sampling for On-line Reward Optimization
Presented in the Value of Information Seminar at NIPS 2005
Based on previous paper from ICML 2005, written by Dale Schuurmans et all
![Page 2: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/2.jpg)
Background Perspective
Be Bayesian about reinforcement learning Ideal representation of uncertainty for
action selection
Computational barriers
![Page 3: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/3.jpg)
Exploration vs. Exploitation
Bayes decision theory Value of information measured by ultimate return
in reward Choose actions to max expected value
Exploration/exploitation tradeoff implicitly handled as side effect
![Page 4: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/4.jpg)
Bayesian Approach
conceptually cleanbut
computationally disasterous
conceptually disasterousbut
computationally clean
versus
![Page 5: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/5.jpg)
Bayesian Approach
conceptually cleanbut
computationally disasterous
conceptually disasterousbut
computationally clean
versus
![Page 6: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/6.jpg)
Overview
Efficient lookahead search for Bayesian RL Sparser sparse sampling Controllable computational cost
Higher quality action selection than current methods
GreedyEpsilon - greedyBoltzmann Thompson SamplingBayes optimal Interval estimation Myopic value of perfect info. Standard sparse sampling Péret & Garcia
(Luce 1959)(Thompson 1933)(Hee 1978)(Lai 1987, Kaelbling 1994)(Dearden, Friedman, Andre 1999)(Kearns, Mansour, Ng 2001)(Péret & Garcia 2004)
![Page 7: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/7.jpg)
Sequential Decision Making
),(sup)( asQa
sV )]'([,|',
),( sVrassr
EasQ
aMAX
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s,a
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s,a
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s’
s
a a
a’ a’ a’ a’ a’ a’ a’ a’
r r r r
r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’
assrE
,|',
'aMAX
','|'',' assrE
Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)
V(s’) V(s’) V(s’) V(s’)
Q(s,a) Q(s,a)
V(s)How to make an optimal decision?
General case: Fixed point equations:
“Planning”
Requires model P(r,s’|s,a)
This is: finite horizon, finite action, finite reward case
![Page 8: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/8.jpg)
Reinforcement Learning
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s,a
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s,a
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s’
s
a a
a’ a’ a’ a’ a’ a’ a’ a’
r r r r
r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)
V(s’) V(s’) V(s’) V(s’)
Q(s,a) Q(s,a)
V(s) Do not have model P(r,s’|s,a)
![Page 9: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/9.jpg)
Reinforcement Learning
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s,a
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s,a
s’
s’’
s’,a’
s’’ s’’
s’,a’
s’’
s’
s
a a
a’ a’ a’ a’ a’ a’ a’ a’
r r r r
r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)
V(s’) V(s’) V(s’) V(s’)
Q(s,a) Q(s,a)
V(s) Do not have model P(r,s’|s,a)
Cannot Compute
assrE
,|',
![Page 10: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/10.jpg)
Bayesian Reinforcement Learning
s b
actions
s b +a
rewards
s’ b’
actions
s’ b’ +a’
rewards
s’’ b’’
decisiona
outcomer, s’, b’
Meta-level ModelP(r,s’b’|s b,a)
a
a’
r
r’
decisiona’
outcomer’, s’’, b’’
Meta-level ModelP(r’,s’’b’’|s’ b’,a’)
Meta-level MDP
Choose actionto maximize
long term reward
meta-level state
Belief state b=P(θ)
Have a model for meta-level transitions!
- based on posterior update and expectations over base-level MDPs
Prior P(θ) on model P(rs’ |sa, θ)
![Page 11: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/11.jpg)
Bayesian RL Decision Making
Q(s b, a) Q(s b, a)
V(s b)
a a
r r r r
V(s’ b’)V(s’ b’)V(s’ b’)V(s’ b’)
How to make an optimal decision?
Solve planning problemin meta-level MDP:
- Optimal Q,V values
Problem: meta-level MDP much larger than base-level MDP
Bayes optimalaction selection
Impractical
![Page 12: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/12.jpg)
Bayesian RL Decision Making
s b, a
s’ b’ s’ b’
s b, a
s’ b’ s’ b’
s b
a a
r r r r
s, a
s’ s’
s, a
s’ s’
s
a a
r r r r
Greedy approach:
current b → mean base-level MDP model
→ point estimate for Q, V
→ choose greedy action
Thompson approach:current b → sample a base-level MDP model
→ point estimate for Q, V
(Choose action proportional to probability it is max Q)
Current approximation strategies:
But doesn’t consider uncertainty
Draw a base-level MDP
☺ Exploration is basedon uncertainty
But still myopic
Consider current belief state b
![Page 13: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/13.jpg)
Their Approach
Try to better approximate Bayes optimal action selection by performing lookahead
Adapt “sparse sampling” (Kearns, Mansour ,Ng)
Make some practical improvements
![Page 14: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/14.jpg)
Sparse Sampling(Kearns, Mansour, Ng 2001)
![Page 15: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/15.jpg)
Bayesian Sparse SamplingObservation 1
Action value estimates are not equally important Need better Q value
estimates for some actions but not all
Preferentially expand tree under actions that might be optimal
Biased tree growth
Use Thompson sampling to select actions to expand
![Page 16: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/16.jpg)
Bayesian Sparse SamplingObservation 2
Correct leaf value estimates to same depth
Use mean MDP Q-value multiplied by remaining depth
t=1
t=2
t=3
effective horizon N=3
)(ˆ tNQ
![Page 17: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/17.jpg)
Bayesian Sparse SamplingTree growing procedure
Descend sparse tree from root Thompson sample actions Sample outcome
Until new node added Repeat until tree size
limit reached
Control computation by controlling tree size
s,b,a
s’b’
s,b1. Sample prior for a model2. Solve action values 3. Select the optimal action
Execute actionObserve reward
![Page 18: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/18.jpg)
Simple experiments
5 Bernoulli bandits Beta priors Sampled model from prior Run action selection strategies Repeat 3000 times Average accumulated reward per step
51,...,aa
![Page 19: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/19.jpg)
Five Bernoulli Bandits
0.45
0.5
0.55
0.6
0.65
0.7
0.75
5 10 15 20Horizon
Av
era
ge
Re
wa
rd p
er
Ste
p
eps-GreedyBoltzmannInterval Est.ThompsonMVPISparse Samp.Bayes Samp.
![Page 20: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/20.jpg)
Five Gaussian Bandits
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
5 10 15 20Horizon
Av
era
ge
Re
wa
rd p
er
Ste
p
eps-GreedyBoltzmannInterval Est.ThompsonMVPISparse Samp.Bayes Samp.
![Page 21: Bayesian Sparse Sampling for On-line Reward Optimization](https://reader035.fdocuments.us/reader035/viewer/2022070413/56814c08550346895db90794/html5/thumbnails/21.jpg)