Welcome!
description
Transcript of Welcome!
![Page 1: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/1.jpg)
NIP
S 2
00
7 W
orksh
op
Welcome!
Hierarchical organization of behavior
•Thank you for coming
•Apologies to the skiers…
•Why we will be strict about timing
•Why we want the workshop to be
interactive
![Page 2: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/2.jpg)
Rewards/punishments may be delayedOutcomes may depend on sequence of actions Credit assignment problem
RL: Decision making
Goal: maximize reward (minimize punishment)Goal: maximize reward (minimize punishment)
![Page 3: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/3.jpg)
RL in a nutshell: formalization
states - actions - transitions - rewards - policy - long term values
Com
ponen
ts of a
n R
L ta
sk
Policy: p(S,a)State values: V(S)State-action values: Q(S,a)
Policy: p(S,a)State values: V(S)State-action values: Q(S,a)
S1
S3S2
44 00 22 22
RL
![Page 4: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/4.jpg)
RL in a nutshell: forward search
S1
S3
S2L
R
L
RL
R
= 4
= 0
= 2
= 2
Model b
ase
d R
L
learn model through experience (cognitive map)choosing actions is hardgoal directed behavior; cortical
Model = T(ransitions) and R(ewards)
S1
S3S2
44 00 22 22
RL
![Page 5: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/5.jpg)
Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)
Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)
RL in a nutshell: cached valuesM
odel-fre
e R
L
temporal difference learning
Q(S,a) = r(S,a) + max Q(S’,a’)
TD learning:start with initial (wrong) Q(S,a)
PE = r(S,a) + max Q(S’,a’) - Q(S,a)
Q(S,a)new = Q(S,a)old + PE
S1
S3S2
44 00 22 22
RL
![Page 6: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/6.jpg)
RL in a nutshell: cached valuesM
odel-fre
e R
L
choosing actions is easy (but need lots of practice to learn)habitual behavior; basal ganglia
temporal difference learning
S1
S3S2
44 00 22 22
RL
Trick #2: Can learn values without a model
Trick #2: Can learn values without a model
Q(S1,L) 4
Q(S1,R) 2
Q(S2,L) 4
Q(S2,R) 0
Q(S3,L) 2
Q(S3,R) 2
![Page 7: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/7.jpg)
RL in real world tasks…
model based vs. model free learning and control
Q(S1,L) 4
Q(S1,R) 2
Q(S2,L) 4
Q(S2,R) 0
Q(S3,L) 2
Q(S3,R) 2 S1
S3
S2L
R
L
RL
R
= 4
= 0
= 2
= 2
S1
S3S2
44 00 22 22
RL
Scaling problem!
Scaling problem!
![Page 8: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/8.jpg)
Real-world behavior is hierarchicalH
iera
rchica
l RL: W
hat is
it?
1. set water temp
2. get wet
3. shampoo
4. soap
5. turn off water
6. dry off
add hot
success
add coldwait 5sec
too c
old
too hot
changejust right
simplified control, disambiguation, encapsulation
1. pour coffee
2. add sugar
3. add milk
4. stir
![Page 9: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/9.jpg)
HRL: (in)formal framework
Termination condition = (sub)goal stateOption policy learning: via pseudo reward (model based or model free)
Hie
rarch
ical R
L: Wh
at is
it?
options - skills - macros - temporally abstract actions(Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)
Option: set water temperature
S1
S2S8
…
S1
0.80.10.1
S2
0.10.10.8
S3
010
S1 (0.1)
S2 (0.1)
S3 (0.9)
…initiation set policy
termination
conditions
![Page 10: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/10.jpg)
S: start G: goalOptions: going to doorsActions: + 2 door options
HRL: a toy exampleH
iera
rchica
l RL: W
hat is
it?
![Page 11: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/11.jpg)
Advantages of HRL1. Faster learning
(mitigates scaling problem)
Hie
rarch
ical R
L: Wh
at is
it?
RL: no longer ‘tabula rasa’
2. Transfer of knowledge from previous tasks(generalization, shaping)
![Page 12: Welcome!](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815a81550346895dc7ec20/html5/thumbnails/12.jpg)
Disadvantages (or: the cost) of HRLH
iera
rchica
l RL: W
hat is
it?
1. Need ‘right’ options - how to learn them?2. Suboptimal behavior (“negative transfer”;
habits)3. More complex learning/control structure
no free lunches…