Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf ·...

29
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 1 of 1

Transcript of Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf ·...

Page 1: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Reinforcement Learning for NLP

Advanced Machine Learning for NLPJordan Boyd-GraberREINFORCEMENT OVERVIEW, POLICY GRADIENT

Adapted from slides by David Silver, Pieter Abbeel, and John Schulman

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 1 of 1

Page 2: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

• I used to say that RL wasn’t used in NLP . . .

• Now it’s all over the place

• Part of much of ML hype

• But what is reinforcement learning?

◦ RL is a general-purpose framework for decision-making◦ RL is for an agent with the capacity to act◦ Each action influences the agent’s future state◦ Success is measured by a scalar reward signal◦ Goal: select actions to maximise future reward

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 2 of 1

Page 3: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

• I used to say that RL wasn’t used in NLP . . .

• Now it’s all over the place

• Part of much of ML hype

• But what is reinforcement learning?

◦ RL is a general-purpose framework for decision-making◦ RL is for an agent with the capacity to act◦ Each action influences the agent’s future state◦ Success is measured by a scalar reward signal◦ Goal: select actions to maximise future reward

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 2 of 1

Page 4: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

• At each step t the agent:

◦ Executes action at

◦ Receives observation ot

◦ Receives scalar reward rt

• The environment:

◦ Receives action at

◦ Emits observation ot+1

◦ Emits scalar reward rt+1

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 3 of 1

Page 5: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Example

QA MTState Words Seen Foreign Words Seen

Reward Answer Accuracy Translation QualityActions Answer / Wait Translate / Wait

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 4 of 1

Page 6: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

State

• Experience is a sequence of observations, actions, rewards

o1, r1, a1, . . . , at 1, ot , rt (1)

• The state is a summary of experience

st = f (o1, r1, a1, . . . , at 1, ot , rt ) (2)

• In a fully observed environment

st = f (ot ) (3)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 5 of 1

Page 7: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

What makes an RL agent?

• Policy: agent’s behaviour function

• Value function: how good is each state and/or action

• Model: agent’s representation of the environment

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 6 of 1

Page 8: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Policy

• A policy is the agent’s behavior

◦ It is a map from state to action:◦ Deterministic policy: a =π(s )◦ Stochastic policy: π(a | s ) = p (a | s )

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 7 of 1

Page 9: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Value Function

• A value function is a prediction of future reward: “How much reward will Iget from action a in state s?”

• Q -value function gives expected total reward

◦ from state s and action a◦ under policy π◦ with discount factor γ (future rewards mean less than immediate)

Qπ(s , a ) =E�

rt+1+γrt+2+γ2rt+3+ . . . | s , a

(4)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 8 of 1

Page 10: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

A Value Function is Great!

• An optimal value function is the maximum achievable value

Q ∗(s , a ) =maxπ

Qπ(s , a ) =Qπ∗ (s , a ) (5)

• If you know the value function, you can derive policy

π∗ = arg maxa

Q (s , a ) (6)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 9 of 1

Page 11: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Approaches to RL

Value-based RL

• Estimate the optimal value function Q (s , a )• This is the maximum value achievable under any policy

Policy-based RL

• Search directly for the optimal policy π∗

• This is the policy achieving maximum future reward

Model-based RL

• Build a model of the environment

• Plan (e.g. by lookahead) using model

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 10 of 1

Page 12: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Deep Q Learning

• Optimal Q -values should obey equation

Q ∗(s , a ) =Es ′�

r +γQ (s ′, a ′) | s , a�

(7)

• Treat as regression problem

• Minimize:�

r +γmaxa Q (s ′, a ′, ~w )−Q (s , a , ~w )�2

• Converges to Q using table lookup representation

• But diverges using neural networks due to:

◦ Correlations between samples◦ Non-stationary targets

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 11 of 1

Page 13: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Deep RL in Atari

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 12 of 1

Page 14: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

DQN in Atari

• End-to-end learning of values Q (s , a ) from pixels s

• Input state s is stack of raw pixels from last four frames

• Output is Q (s , a ) for 18 joystick/button positions

• Reward is change in score for that step

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 13 of 1

Page 15: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Atari Results

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 14 of 1

Page 16: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Policy-Based RL

• Advantages:

◦ Better convergence properties◦ Effective in high-dimensional or continuous action spaces◦ Can learn stochastic policies

• Disadvantages:

◦ Typically converge to a local rather than global optimum◦ Evaluating a policy is typically inefficient and high variance

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 15 of 1

Page 17: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Optimal Policies Sometimes Stochastic

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 16 of 1

Page 18: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Optimal Policies Sometimes Stochastic

(Cannot distinguish gray states)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 16 of 1

Page 19: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Optimal Policies Sometimes Stochastic

Deterministic

(Cannot distinguish gray states)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 16 of 1

Page 20: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Optimal Policies Sometimes Stochastic

Deterministic

(Cannot distinguish gray states)Value-based RL learns near deterministic policy!

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 16 of 1

Page 21: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Optimal Policies Sometimes Stochastic

Stochastic

(Cannot distinguish gray states, so flip a coin!)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 16 of 1

Page 22: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Likelihood Ratio Policy Gradient

Let τ be state-action s0, u0, . . . , sH , uH . Utility of policy π parametrized byθ is

U (θ ) =Eπθ ,U

H∑

t

R (st , ut );πθ

=∑

t a u

P (τ;θ )R (τ). (8)

Our goal is to find θ :

maxθ

U (θ ) =maxθ

t

p (τ;θ )R (τ) (9)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 17 of 1

Page 23: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Likelihood Ratio Policy Gradient

t

p (τ;θ )R (τ) (10)

Taking the gradient wrt θ :

(11)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 18 of 1

Page 24: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Likelihood Ratio Policy Gradient

t

p (τ;θ )R (τ) (10)

Taking the gradient wrt θ :

∇θU (θ ) =∑

τ

R (τ)P (τ;θ )P (τ;θ )

∇θP (τ;θ ) (11)

(12)

Move differentiation inside sum (ignore R (τ) and then add in term thatcancels out

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 18 of 1

Page 25: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Likelihood Ratio Policy Gradient

t

p (τ;θ )R (τ) (10)

Taking the gradient wrt θ :

∇θU (θ ) =∑

τ

R (τ)P (τ;θ )P (τ;θ )

∇θP (τ;θ ) (11)

=∑

τ

P (τ;θ )∇θP (τ;θ )

P (τ;θ )R (τ) (12)

(13)

Move derivative over probability

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 18 of 1

Page 26: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Likelihood Ratio Policy Gradient

t

p (τ;θ )R (τ) (10)

Taking the gradient wrt θ :

∇θU (θ ) =∑

τ

R (τ)P (τ;θ )P (τ;θ )

∇θP (τ;θ ) (11)

=∑

τ

P (τ;θ )∇θP (τ;θ )

P (τ;θ )R (τ) (12)

=∑

τ

P (τ;θ )∇θ�

log P (τ;θ )�

R (τ) (13)

Assume softmax form

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 18 of 1

Page 27: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Likelihood Ratio Policy Gradient

t

p (τ;θ )R (τ) (10)

Taking the gradient wrt θ :

=∑

τ

P (τ;θ )∇θ�

log P (τ;θ )�

R (τ) (11)

Approximate with empirical estimate for m sample paths from π

∇θU (θ )≈1

m

m∑

1

∇θ log P (r i ;θ )R (τi ) (12)

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 18 of 1

Page 28: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Policy Gradient Intuition

• Increase probability of paths with positive R

• Decrease probability of paths with negagive R

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 19 of 1

Page 29: Reinforcement Learning for NLP - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CSCI_7000/11a.pdf · Advanced Machine Learning for NLP jBoyd-Graber Reinforcement Learning for NLP 10 of

Extensions

• Consider baseline b (e.g., path averaging)

∇θU (θ )≈1

m

m∑

1

∇θ log P (r i ;θ )(R (τi )− b (τ)) (13)

• Combine with value estimation (critic)

◦ Critic: Updates action-value function parameters◦ Actor: Updates policy parameters in direction suggested by critic

Advanced Machine Learning for NLP | Boyd-Graber Reinforcement Learning for NLP | 20 of 1