POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3
description
Transcript of POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3
![Page 1: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/1.jpg)
• POMDPs: 5• Reward Shaping: 4• Intrinsic RL: 4• Function Approximation: 3
![Page 2: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/2.jpg)
• https://www.youtube.com/watch?v=ek0FrCaogcs
![Page 3: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/3.jpg)
Evaluation Metrics
• Asymptotic improvement• Jumpstart improvement• Speed improvement
– Total reward– Slope of line– Time to threshold
![Page 4: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/4.jpg)
Task 2 from Scratch Task 2 with Transfer Task 1 + Task 2 withTransfer
Time
Target: no Transfer Target: with Transfer Target + Source: with Transfer
Two distinct scenarios:1. Target Time Metric: Successful if target task learning time reduced
Time to Threshold
2. Total Time Metric: Successful if total (source + target) time reduced
“Sunk Cost” is ignoredSource task(s) independently useful
Effectively utilize past knowledge
Only care about TargetSource Task(s) not usefulMinimize total training
![Page 5: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/5.jpg)
K2
K3 T2
T1
K1
Both takers move towards player with ball
Goal: Maintain possession of ball
5 agents3 (stochastic) actions13 (noisy & continuous) state variables
Keeper with ball may hold ball or pass to either teammate
Keepaway [Stone, Sutton, and Kuhlmann 2005]
4 vs. 3:
7 agents4 actions19 state variables
![Page 6: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/6.jpg)
Learning Keepaway• Sarsa update
– CMAC, RBF, and neural network approximation successful• Qπ(s,a): Predicted number of steps episode will last
– Reward = +1 for every timestep
![Page 7: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/7.jpg)
’s Effect on CMACs
4 vs. 33 vs. 2
• For each weight in 4 vs. 3 function approximator:o Use inter-task mapping to find corresponding 3 vs. 2 weight
![Page 8: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/8.jpg)
Keepaway Hand-coded χA
• Hold4v3 Hold3v2 • Pass14v3 Pass13v2
• Pass24v3 Pass23v2
• Pass34v3 Pass23v2
Actions in 4 vs. 3 have “similar” actions in 3 vs. 2
![Page 9: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/9.jpg)
Value Function Transfer
ρ(QS (SS, AS)) = QT (ST, AT)
Action-Value function transferred
ρ is task-dependant: relies on inter-task mappings
ρ
Q not defined on ST and ATSource Task
Target Task
Environment
Agent
ActionT StateT RewardT
Environment
Agent
ActionS StateS RewardS
QS: SS×AS→ℜQT: ST×AT→ℜ
![Page 10: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/10.jpg)
0 10 50 100 250 500 1000 3000 60000
5
10
15
20
25
30
35
Avg. 4 vs. 3 timeAvg. 3 vs. 2 time
# 3 vs. 2 Episodes
Sim
ulat
or H
ours
Value Function Transfer: Time to threshold in 4 vs. 3
Total Time
No TransferTarget Task
Time
}
![Page 11: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/11.jpg)
• For similar target task, the transferred knowledge … [can] significantly improve its performance.
• But how do we define the similar task more specifically? – Same state-action space– similar objectives
![Page 12: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/12.jpg)
– Is transfer beneficial for a given pair of tasks? • Avoid Negative Transfer?• Reduce total time metric?
Effects of Task Similarity
Transfer trivial Transfer impossible
Source identical to Target Source unrelated to Target
![Page 13: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/13.jpg)
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]
![Page 14: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/14.jpg)
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]
![Page 15: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/15.jpg)
All tasks are drawn from the same domaino Task: An MDPo Domain: Setting for semantically similar tasks
oWhat about Cross-Domain Transfer?o Source task could be much simplero Show that source and target can be less similar
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]
![Page 16: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/16.jpg)
Source Task: Ringworld
RingworldGoal: avoid being tagged
2 agents3 actions7 state variablesFully ObservableDiscrete State Space (Q-table with ~8,100 s,a pairs)Stochastic Actions
Opponent moves directly towards player
Player may stay or run towards a pre-defined location
K2
K3 T2
T1
K1
3 vs. 2 KeepawayGoal: Maintain possession of ball
5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions
![Page 17: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/17.jpg)
Rule Transfer Overview
1. Learn a policy (π : S → A) in the source task– TD, Policy Search, Model-Based, etc.
2. Learn a decision list, Dsource, summarizing π3. Translate (Dsource) → Dtarget (applies to target task)
– State variables and actions can differ in two tasks4. Use Dtarget to learn a policy in target task
Allows for different learning methods and function approximators in source and target tasks
Learn π Learn DsourceTranslate (Dsource)
→ Dtarget Use Dtarget
![Page 18: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/18.jpg)
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
Environment
Agent
Action State Reward
Source Task
• In this work we use Sarsao Q : S × A → Return
• Other learning methods possible
![Page 19: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/19.jpg)
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Use learned policy to record S, A pairs• Use JRip (RIPPER in Weka) to learn a decision list
Environment
Agent
Action State RewardAction State
State Action
… …
• IF s1 < 4 and s2 > 5 → a1
• ELSEIF s1 < 3 → a2
• ELSEIF s3 > 7 → a1
• …
![Page 20: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/20.jpg)
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Inter-task Mappings• χx: starget→ssource
– Given state variable in target task (some x from s = x1, x2, … xn)
– Return corresponding state variable in source task
• χA: atarget→asource
– Similar, but for actions
rule rule’translate
χ A
χ x
![Page 21: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/21.jpg)
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
K2
K3 T2
T1
K1
Stay Hold BallRunNear Pass to K2
RunFar Pass to K3
dist(Player, Opponent) dist(K1,T1)… …
χx
χA
IF dist(Player, Opponent) > 4 → Stay
IF dist(K1,T1) > 4 → Hold Ball
![Page 22: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/22.jpg)
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Many possible ways to use Dtarget
o Value Bonuso Extra Actiono Extra Variable
• Assuming TD learner in target tasko Should generalize to other learning methods
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4
Dtarget(s) = a2
+ 8
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4Q(s1, s2, a4) = 7 (take action a2)
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, s3, a1) = 5Q(s1, s2, s3, a2) = 3Q(s1, s2, s3, a3) = 4
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a2, a1) = 5Q(s1, s2, a2, a2) = 9Q(s1, s2, a2, a3) = 4
(shaping)(initially force agent to select)(initially force agent to select)
![Page 23: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/23.jpg)
Comparison of Rule Transfer Methods
Without Transfer Only Follow Rules
Rules from 5 hours of training
Extra VariableExtra ActionValue Bonus
![Page 24: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/24.jpg)
Inter-domain Transfer: Averaged Results
Training Time (simulator hours)
Episo
de D
urati
on (s
imul
ator
seco
nds)
Ringworld: 20,000 episodes (~1 minute wall clock time)
Success: Four types of transfer improvement!
![Page 25: POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3](https://reader030.fdocuments.us/reader030/viewer/2022033104/568160e3550346895dd01456/html5/thumbnails/25.jpg)
Future Work
• Theoretical Guarantees / Bounds• Avoiding Negative Transfer• Curriculum Learning• Autonomously selecting inter-task mappings• Leverage supervised learning techniques• Simulation to Physical Robots• Humans?