Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep...
Transcript of Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep...
![Page 1: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/1.jpg)
Deep Multiagent Reinforcement Learning for Partially Observable
Parameterized EnvironmentsPeter Stone
Department of Computer Science The University of Texas at Austin
Joint work with Matthew Hausknecht
1
![Page 2: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/2.jpg)
Hausknecht and Stone, UT Austin
Intelligent decision making is at the heart of AI.
Motivation
2
![Page 3: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/3.jpg)
Hausknecht and Stone, UT Austin
Outline1. Background
2. Recurrent Q-Learning for partially observable MDPs
3. Deep Multiagent RL in Half-Field-Offense
4. Future Work
3
![Page 4: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/4.jpg)
Hausknecht and Stone, UT Austin
Markov Decision Process
Action at
State st
Reward rt
Markov Property ensures st+1 depends only on st
Learning an optimal policy π* requires no memory4
![Page 5: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/5.jpg)
Hausknecht and Stone, UT Austin
Partially Observable MDP (POMDP)
Action at
Observation ot
Reward rt
Observations provide noisy or incomplete information
Memory may help to learn a better policy5
![Page 6: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/6.jpg)
Hausknecht and Stone, UT Austin
Reinforcement LearningReinforcement Learning provides a general framework for sequential decision making.
Objective: Learn a policy that maximizes discounted sum of future rewards.
Deterministic policy π is a mapping from states/observations to actions.
For each encountered state/observation, what is the best action to perform.
6
![Page 7: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/7.jpg)
Hausknecht and Stone, UT Austin
Q-Value FunctionEstimates the expected return from a given state-action:
Answers the question: “How good is action a from state s.”
Optimal Q-Value function yields an optimal policy.
Q⇡(s, a) = E⇥rt+1 + �rt+2 + �2rt+3 + . . . |s, a
⇤
7
![Page 8: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/8.jpg)
Hausknecht and Stone, UT Austin
Deep Neural Network
Parametric model with stacked layers of representation.
Powerful, general purpose function approximator.
Parameters optimized via backpropagation.
Input
Output
✓
8
![Page 9: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/9.jpg)
Hausknecht and Stone, UT Austin
Outline1. Background
2. Recurrent Q-Learning for partially observable MDPs
3. Deep Multiagent RL in Half-Field-Offense
4. Future Work
9
![Page 10: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/10.jpg)
Hausknecht and Stone, UT Austin
Atari Environment
Action at
Observation ot
Reward rt
Resolution 160x210x318 discrete actions
Reward is change in game score
10
![Page 11: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/11.jpg)
Hausknecht and Stone, UT Austin
Atari: MDP or POMDP?
Depends on the number game screens used in the state representation.
Many games PO with a single frame.
11
![Page 12: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/12.jpg)
Hausknecht and Stone, UT Austin
Neural network estimates Q-Values Q(s,a) for all 18 actions:
Learns via temporal difference:
Accepts the last 4 screens as input.
Deep Q-Network (DQN)
Convolution 1
Convolution 2
Convolution 3
Fully Connected
Fully Connected
Q-Values
Q(s|✓) = (Qs,a1 . . . Qs,an)
yi = rt + �max(Q(st+1|✓))
12
Li(✓i) = E(st,at,rt,st+1)⇠D
h�yi �Q(st|✓i)
�2i
![Page 13: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/13.jpg)
Hausknecht and Stone, UT Austin
Flickering AtariHow well does DQN perform on POMDPs?
Induce partial observability by stochastically obscuring the game screen
Game state must be inferred from past observations
ot =
⇢st with p =
12
< 0, . . . , 0 > otherwise
13
![Page 14: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/14.jpg)
Hausknecht and Stone, UT Austin
DQN Pong
True Game Screen Observed Game Screen14
![Page 15: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/15.jpg)
Hausknecht and Stone, UT Austin
DQN Flickering Pong
True Game Screen Observed Game Screen15
![Page 16: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/16.jpg)
Hausknecht and Stone, UT Austin
Uses a Long Short Term Memory (LSTM) to selectively remember past game screens.
Architecture identical to DQN except: 1. Replaces FC layer with LSTM 2. Single frame as input each
timestep
Trained end-to-end using BPTT for last 10 timesteps.
Deep Recurrent Q-Network
Convolution 1
Convolution 2
Convolution 3
LSTM
Fully Connected
Q-Values
16
![Page 17: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/17.jpg)
Hausknecht and Stone, UT Austin
DRQN Flickering Pong
True Game Screen Observed Game Screen17
![Page 18: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/18.jpg)
Hausknecht and Stone, UT Austin18
![Page 19: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/19.jpg)
Hausknecht and Stone, UT Austin
LSTM infers velocity
19
![Page 20: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/20.jpg)
Hausknecht and Stone, UT Austin
DRQN Frostbite
20
![Page 21: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/21.jpg)
21
![Page 22: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/22.jpg)
Hausknecht and Stone, UT Austin
ExtensionsDRQN has been extended in several ways:
• Addressable Memory: Control of Memory, Active Perception, and Action in Minecraft; Oh et al. in ICML ’16
• Continuous Action Space: Memory Based Control with Recurrent Neural Networks; Heess et al., 2016
[Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht et al, 2015; ArXiv]
22
![Page 23: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/23.jpg)
Hausknecht and Stone, UT Austin
Outline1. Background
2. Recurrent Q-Learning for partially observable MDPs
3. Deep Multiagent RL in Half-Field-Offense
4. Future Work
23
![Page 24: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/24.jpg)
Hausknecht and Stone, UT Austin
Half Field OffenseCooperative multiagent soccer domain built on the libraries used by the RoboCup competition
Objective: Learn a goal scoring policy for the offense agents
Features continuous actions, partial observability, and opportunities for multiagent coordination
24
![Page 25: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/25.jpg)
Hausknecht and Stone, UT Austin
Half Field Offense
25
![Page 26: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/26.jpg)
26
![Page 27: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/27.jpg)
27
![Page 28: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/28.jpg)
State Action Spaces
58 continuous state features encoding distances and angles to points of interest
Parameterized-Continuous Action Space: Dash(direction, power) Turn(direction)Tackle(direction) Kick(direction, power)
Choose one discrete action + parameters every timestep
28
![Page 29: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/29.jpg)
Hausknecht and Stone, UT Austin
Exploration is Hard
29
![Page 30: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/30.jpg)
Hausknecht and Stone, UT Austin
Reward Signal
rt = -ᵂd(Agent, Ball) + Ikick + -3ᵂd(Ball, Goal) + 5IGoal
Go to Ball Kick to Goal
30
With only goal-scoring reward, agent never learns to approach the ball or dribble.
![Page 31: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/31.jpg)
Deep Deterministic Policy Gradients
Model-free Deep Actor Critic architecture [Lillicrap ’15]
Actor learns a policy π, Critic learns to estimate Q-values
Actor outputs all 6 possible parameters.
at = max(4 actions) + associated parameter(s)
State
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Q-Value
1024
ReLU
256
ReLU
512
ReLU
128
ReLU Actor
Critic
31
![Page 32: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/32.jpg)
TrainingCritic trained using temporal difference:
Actor trained via Critic gradients:
State
ᵘθμ
4 Actions 6 Parameters
ᵘθQ
Q-Value
Actor
Critic
auQ(s,a)L = ||Q(st, µ(st)|✓Q)� y||22
y = rt + �(Q(st+1, µ(st+1)|✓Q))
r✓µµ(s) = raQ(s, a|✓Q)r✓µµ(s|✓µ)
32
![Page 33: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/33.jpg)
Hausknecht and Stone, UT Austin
Bounded Action SpaceHFO’s continuous parameters are bounded
Dash(direction, power) Turn(direction) Tackle(direction) Kick(direction, power)
Direction in [-180,180], Power in [0, 100]
Exceeding these ranges results in no action
If DDPG is unaware of the bounds, it will invariably exceed them
33
![Page 34: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/34.jpg)
Hausknecht and Stone, UT Austin
We examine 3 approaches for bounding the DDPG’s action space:
1. Squash Gradients
2. Zero Gradients
3. Invert Gradients
Bounded DDPG
34
![Page 35: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/35.jpg)
Hausknecht and Stone, UT Austin
Squashing Gradients1. Use Tanh non-linearity to bound parameter output
2. Rescale into desired range
35
![Page 36: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/36.jpg)
Hausknecht and Stone, UT Austin
Squashing Gradients
36
![Page 37: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/37.jpg)
Hausknecht and Stone, UT Austin
Each continuous parameter has a range: [pmin, pmax]
Let p denote current value of parameter, and the suggested gradient.
Then:
Zeroing Gradients
rp =
(rp if p
min
< p < pmax
0 otherwise
rp
37
![Page 38: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/38.jpg)
Hausknecht and Stone, UT Austin
Zeroing Gradients
38
![Page 39: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/39.jpg)
Hausknecht and Stone, UT Austin
Inverting Gradients
rp = rp ·((p
max
� p)/(pmax
� pmin
) if rp suggests increasing p
(p� pmin
)/(pmax
� pmin
) otherwise
For each parameter:
Allows parameters to approach the bounds of the ranges without exceeding them
Parameters don’t get “stuck” or saturate
39
![Page 40: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/40.jpg)
Hausknecht and Stone, UT Austin
Inverting Gradients
40
![Page 41: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/41.jpg)
Hausknecht and Stone, UT Austin
Results
41
![Page 42: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/42.jpg)
Hausknecht and Stone, UT Austin
ResultsScoring Avg. StepsPercent to Goal
DDPG1 1.0 108.0DDPG2 .99 107.1DDPG3 .98 104.8DDPG4 .96 112.3
Helios’ Champion .96 72.0DDPG5 .94 119.1DDPG6 .84 113.2SARSA .81 70.7DDPG7 .80 118.2
42
[Deep Reinforcement Learning in Parameterized Action Space, Hausknecht and Stone, in ICLR ‘16]
![Page 43: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/43.jpg)
Hausknecht and Stone, UT Austin
Deep Multiagent RLCan multiple Deep RL agents cooperate to achieve a shared goal?
Examine several baseline architectures:
Decentralized: Independent agents
Centralized: Single controller for multiple agents
Parameter Sharing: Layers shared between agents
43
![Page 44: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/44.jpg)
Hausknecht and Stone, UT Austin
Centralized
Both agents are controlled by a single DDPG
State & Action spaces are concatenated
Learning is more challenging for this reason
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Q-Value
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Actor
Critic
State
6 Parameters4 Actions
State
44
![Page 45: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/45.jpg)
45
![Page 46: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/46.jpg)
Parameter Sharing
Shared weights between layers in Actor networks. Separate sharing between Critic networks.
Reduces total number of parameters!
Encourages both agents to participate even though 2v0 is solvable by a single agent.
State
4 Actions 6 Parameters
256
ReLU
128
ReLU
Q-Value
256
ReLU
128
ReLU
4 Actions 6 Parameters
256
ReLU
128
ReLU
Q-Value
256
ReLU
128
ReLU
State
1024
ReLU
512
ReLU
1024
ReLU
512
ReLU
Critics
Actors
46
![Page 47: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/47.jpg)
47
![Page 48: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/48.jpg)
![Page 49: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/49.jpg)
49
![Page 50: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/50.jpg)
Hausknecht and Stone, UT Austin
Related Work• Multiagent Cooperation and Competition with Deep
Reinforcement Learning; Tampuu et. al, 2015
• Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks; Foerster et al., 2016
• Learning to Communicate with Deep Multi-Agent Reinforcement Learning; Foerster et al., 2016
50
![Page 51: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://reader033.fdocuments.us/reader033/viewer/2022042220/5ec69b80edbea83c5a4165b4/html5/thumbnails/51.jpg)
Hausknecht and Stone, UT Austin
Thanks!
State
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Q-Value
1024
ReLU
256
ReLU
512
ReLU
128
ReLU Actor
CriticConvolution 1
Convolution 2
Convolution 3
LSTM
Fully Connected
Q-Values
51