Dueling Network Architectures for DeepReinforcement Learning (ICML 2016)
Yoonho Lee
Department of Computer Science and EngineeringPohang University of Science and Technology
October 11, 2016
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Definition of RLgeneral setting of RL
Definition of RLatari setting
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Mathematical formulationsobjective of RL
DefinitionReturn Gt is the cumulative discounted reward from time t
Gt = rt+1 + γrt+2 + γ2rt+3 + . . .
DefinitionA policy π is a function that selects actions given states
π(s) = a
I The goal of RL is to find π that maximizes G0
Mathematical formulationsobjective of RL
DefinitionReturn Gt is the cumulative discounted reward from time t
Gt = rt+1 + γrt+2 + γ2rt+3 + . . .
DefinitionA policy π is a function that selects actions given states
π(s) = a
I The goal of RL is to find π that maximizes G0
Mathematical formulationsQ-Value
Gt =∞∑i=0
γk rt+i+1
DefinitionThe action-value (Q-value) function Qπ(s, a) is the expectation ofGt under taking action a, and then following policy π afterwards
Qπ(s, a) = Eπ[Gt |St = s,At = a,At+i = π(St+i )∀i ∈ N]
I ”How good is action a in state s?”
Mathematical formulationsOptimal Q-value
DefinitionThe optimal Q-value function Q∗(s, a) is the maximum Q-valueover all policies
Q∗(s, a) = maxπ
Qπ(s, a)
TheoremThere exists a policy π∗ such that Qπ∗(s, a) = Q∗(s, a) for all s, a
I Thus, it suffices to find Q∗
Mathematical formulationsOptimal Q-value
DefinitionThe optimal Q-value function Q∗(s, a) is the maximum Q-valueover all policies
Q∗(s, a) = maxπ
Qπ(s, a)
TheoremThere exists a policy π∗ such that Qπ∗(s, a) = Q∗(s, a) for all s, a
I Thus, it suffices to find Q∗
Mathematical formulationsQ-Learning
Bellman Optimality Equation
Q∗(s, a) satisfies the following equation:
Q∗(s, a) = R(s, a) + γ∑s′
P(s ′|s, a)maxa′
Q∗(s′, a′)
Q-Learning
Let a be ε-greedy w.r.t. Q, and a′ be optimal w.r.t. Q. Qconverges to Q∗ if we iteratively apply the following update:
Q(s, a)← α(R(s, a) + γQ(s ′, a′)) + (1− α)Q(s, a)
Mathematical formulationsQ-Learning
Bellman Optimality Equation
Q∗(s, a) satisfies the following equation:
Q∗(s, a) = R(s, a) + γ∑s′
P(s ′|s, a)maxa′
Q∗(s′, a′)
Q-Learning
Let a be ε-greedy w.r.t. Q, and a′ be optimal w.r.t. Q. Qconverges to Q∗ if we iteratively apply the following update:
Q(s, a)← α(R(s, a) + γQ(s ′, a′)) + (1− α)Q(s, a)
Mathematical formulationsOther approcahes to RL
Value-Based RL
I Estimate Q∗(s, a)
I Deep Q Network
Policy-Based RL
I Search directly for optimal policy π∗I DDPG, TRPO. . .
Model-Based RL
I Use the (learned or given) transition model of environment
I Tree Search, DYNA . . .
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Neural Fitted Q Iteration
I Supervised learning on (s,a,r,s’)
Neural Fitted Q Iterationinput
Neural Fitted Q Iterationtarget equation
Neural Fitted Q Iteration
Shortcomings
I Exploration is independent of experience
I Exploitation does not occur at all
I Policy evaluation does not occur at all
I Even in exact(table lookup) case, not guaranteed to convergeto Q∗
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Deep Q Networkaction policy
I Choose an ε-greedy policy w.r.t. Q
Deep Q Networknetwork freezing
I Get a copy of the network every C steps for stability
Deep Q Network
Deep Q Network
Deep Q Networkperformance
Deep Q Networkoveroptimism
I What happens when we overestimate Q?
Deep Q Network
Problems
I Overestimation of the Q function at any s spills over toactions that lead to s → Double DQN
I Sampling transitions uniformly from D is inefficient →Prioritized Experience Replay
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Double Deep Q Networktarget
We can write the DQN target as:
Y DQNt = Rt+1 + γQ(St+1, argmax
aQ(St+1, a; θ−t ); θ−t )
Double DQN’s target is:
Y DDQNt = Rt+1 + γQ(St+1, argmax
aQ(St+1, a; θt); θ−t )
This has the effect of decoupling action selection and actionevaluation
Double Deep Q Networkperformance
I Much stabler with very little change in code
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Prioritized Experience Replay
I Update ’more surprising’ experiences more frequently
Outline
Reinforcement LearningDefinition of RLMathematical formulations
AlgorithmsNeural Fitted Q IterationDeep Q NetworkDouble Deep Q NetworkPrioritized Experience ReplayDueling Network
Dueling Network
I The scalar approximates V , and the vector approximates A
Dueling Networkforward pass equation
The exact forward pass equation is:
Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α)−maxa′
A(s, a′; θ, α))
The following module was found to be more stable without losingmuch performance:
Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α)− 1
|A|∑a′
A(s, a′; θ, α))
Dueling Networkattention
I Advantage network only pays attention when current actioncrucial
Dueling Networkperformance
I Achieves state of the art in the Atari domain among DQNalgorithms
Dueling Network
Summary
I Since this is an improvement only in network architecture,methods that improve DQN(e.g. Double DQN) are allapplicable here as well
I Solves problem of V and A typically being of different scale
I Updates Q values more frequently than a single-stream DQN,where only a single Q value is updated for each observation
I Implicitly splits the credit assignment problem into a recursivebinary problem of “now or later”
Dueling Network
Shortcomings
I Only works for |A| <∞I Still not able to solve tasks involving long-term planning
I Better than DQN, but sample complexity is still high
I ε-greedy exploration is essentially random guessing
Top Related