Quasi-Recurrent Neural Networks -...

32
Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce Research A simple, fast GPU-accelerated architecture for deep learning on sequence data

Transcript of Quasi-Recurrent Neural Networks -...

Page 1: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Quasi-Recurrent Neural Networks

James Bradbury and Stephen Merity Salesforce Research

A simple, fast GPU-accelerated architecture for deep learning on sequence data

Page 2: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Deep Learning for Sequences

• Text

• Speech

• Time Series

• Logfiles/anomaly detection

• Finance

Salesforce Einstein

Page 3: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Deep Learning for Sequences

• Text

• Speech

• Time Series

• Logfiles/anomaly detection

• Finance

Salesforce Einstein

e.g. 10s of millions of sentence pairs

e.g. 10,000 hours of audio

e.g. multiple years of server logs

e.g. decades of high-resolution prices

Page 4: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Deep Learning for Sequences

• Text

• Speech

• Time Series

• Logfiles/anomaly detection

• Finance

Salesforce Einstein

e.g. 10s of millions of sentence pairs

e.g. 10,000 hours of audio

e.g. multiple years of server logs

e.g. decades of high-resolution pricesSLOW

Page 5: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Conventional Neural Sequence Models• Recurrent neural networks or LSTMs

*Figure from Chris Olah, used with permission

Page 6: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Conventional Neural Sequence Models

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• Recurrent neural networks or LSTMs

• Problem: hard to effectively utilize the full parallelism of the GPU

Page 7: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

LSTM in detail

z

t

= tanh(W

z

x

t

+V

z

h

t�1 + b

z

)

i

t

= sigmoid(W

i

x

t

+V

i

h

t�1 + b

i

)

f

t

= sigmoid(W

f

x

t

+V

f

h

t�1 + b

f

)

o

t

= sigmoid(W

o

x

t

+V

o

h

t�1 + b

o

)

c

t

= i

t

� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

1

Page 8: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

LSTM in detail

z

t

= tanh(W

z

x

t

+V

z

h

t�1 + b

z

)

i

t

= sigmoid(W

i

x

t

+V

i

h

t�1 + b

i

)

f

t

= sigmoid(W

f

x

t

+V

f

h

t�1 + b

f

)

o

t

= sigmoid(W

o

x

t

+V

o

h

t�1 + b

o

)

c

t

= i

t

� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

Page 9: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Optimizing the LSTM• Individual components of LSTM equations are fast on GPU

• Matrix multiply, sigmoid, etc.

• But the overall cell computation isn’t a natural GPU workload

• E.g. lots of small BLAS (matrix multiply) calls

• Throughput limited by CUDA kernel/cuBLAS launch overhead

Page 10: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Optimizing the LSTM• There are some things we can do:

• Batching matrix multiplications (Wzxt can be batched ahead of time for all t)

• Fusing element-wise operations (everything in green can be a single kernel)

• Persistent kernels:

• Technically challenging, limits hidden size, difficult to generalize

• Introduced for Pascal GPUs in cuDNN 6

Page 11: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Performance vs. Flexibility• Optimized LSTM implementations like cuDNN’s are inherently inflexible

• New papers add modifications to LSTM architecture every few months

• Variational dropout (Gal and Ghahramani, December 2015)

• Recurrent batch normalization (Cooijmans et al., March 2016)

• Recurrent dropout (Semeniuta et al., March 2016)

• Zoneout (Krueger et al., June 2016)

• Multiplicative LSTM (Krause et al., September 2016)

Page 12: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Example: regularization• We’ll discuss recurrent regularization as an example of needed flexibility

• Why is recurrent regularization required?Standard dropout doesn’t work on an RNN’s hidden state

Page 13: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Example: regularization• We’ll discuss recurrent regularization as an example of needed flexibility

• Why is recurrent regularization required?Standard dropout doesn’t work on an RNN’s hidden state

Page 14: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Variational dropout• Variational dropout (Gal and Ghahramani, 2015) “locks” the dropout

mask

• Prevents excessive loss on the hidden state, works incredibly well, is only a dozen lines of code - assuming you can modify your LSTM

Page 15: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Zoneout• Zoneout (Krueger et al. 2016) stochastically forces some of the recurrent

units in h to maintain their previous values

• Intuitively, imagine this as a faulty update mechanismwhere δ (delta) is the update and m is the dropout mask

• Again, a minor modification (only a dozen lines) if you have the flexibility

Page 16: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Example: recurrent dropout• Recurrent dropout is now standard in achieving state of the art results on

many different tasks - likely a good idea for most tasks regardless

• Example: in word level language modeling (lower perplexity is better), recurrent dropout allows small models to compete with a far larger LSTM

• Good news: NVIDIA engineers move quickly with an ever evolving roadmap - but research scientists can’t wait for flexibility to be given to them

Page 17: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Back to the Fundamental Problem

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN• It’s hard to effectively utilize the GPU

with lots of small BLAS calls

• We want an architecture that’s inherently more parallel.

Page 18: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

How do ConvNets Solve This?

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• In the world of images and video, there’s a fully parallel approach

Page 19: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

How do ConvNets Solve This?

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• In the world of images and video, there’s a fully parallel approach

• But sequence data is usually much more sensitive to ordering!

Page 20: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Solution: Quasi-Recurrent Architecture

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• Use a convolution, but replace the pooling component with something that’s sensitive to order

Page 21: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

QRNN in detail• Start with 1D convolution

• parallel across timesteps

• produces all values, including gates + candidate updates

• All that needs to be computed recurrently is a simple element-wise pooling function inspired by the LSTM

• Can be fused across time without having to alternate with BLAS operations

z

t

= tanh(W

z

⇤X+ b

z

)

[i

t

= sigmoid(W

i

⇤X+ b

i

)]

f

t

= sigmoid(W

f

⇤X+ b

f

)

[o

t

= sigmoid(W

o

⇤X+ b

o

)]

f -pooling:

h

t

= (1� f

t

)� z

t

+ f

t

� h

t�1

fo-pooling:

c

t

= (1� f

t

)� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

ifo-pooling:

c

t

= i

t

� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

1

Page 22: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

QRNN implementation• Efficient 1D convolution is built into

most deep learning frameworks

• Automatically parallel across time

• Pooling component is implemented in 40 total lines of CUDA C

• Fused across time into one GPU kernel with a simple for loop

Page 23: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

QRNN Speed

(speed advantage over cuDNN 5.1 LSTM)

Page 24: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

QRNN Speed

(language modeling using Chainer deep learning framework and cuDNN 5.1)

Page 25: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

QRNN Results

References

Model DevPerplexity TestPerplexity

LSTM 85.7 82.0

QRNN 82.9 79.9

QRNNw/zoneout 82.1 78.3

PerplexityonPennTreebanklanguagemodelingdataset(lowerisbetter).Allmodelshave2layersof640hiddenunits.

Model Time/Epoch(s) TestAccuracy(%)

LSTM 480 90.9

QRNN 150 91.4

AccuracyonIMDbbinarysentimentclassificationdataset.Allmodelshave4densely-connectedlayersof256hiddenunits.

Model Time/Epoch(hr) BLEU(TED.tst2014)

LSTM 4.2 16.5

QRNN 1.0 19.4

PerformanceonIWSLTGerman-Englishcharacter-levelmachinetranslation.Allmodelshave4layersof320hidden

units;theQRNN’s1st layerhasfiltersize6.

• Does this dramatically more minimalist architecture reduce model performance and accuracy?

• Apparently not!

• Some ideas as to why:

• No recurrent weight matrix means no vanishing or exploding gradient problem

• Elimination of recurrent degrees of freedom is a form of regularization

Page 26: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Caveats?• Must stack more than one QRNN layer

• Probably not a good choice for tasks requiring complex hidden state interaction

• Controlling a reinforcement learning agent

• Performance increase isn’t present at inference time in sequence generation tasks

• Each generated sequence element depends on the previous one

Page 27: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Modularity means Flexibility• The QRNN separates trainable components (the convolution) from fixed components

(the recurrent pooling function)

• The two components can be swapped out separately

• Different filter size for the convolution

• Different pooling functions

• Inserting additional ops between the components makes it easy to implement

• Dropout with both time-locked (variational) and unlocked masks

• Zoneout (dropout on the forget gate!)

Page 28: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

QRNN interpretability• “Interpretability” always a minefield in the context of deep learning models

• But in the QRNN, the recurrent state can only interact element-wise

• So individual channels are more likely to be separable and have distinct meanings

Page 29: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Applications• Baidu’s Silicon Valley AI Lab found QRNNs to give a substantial

improvement in accuracy and regularization for a component of their speech generation pipeline

• NVIDIA engineers have applied QRNNs to logfile anomaly detection

• A high-frequency trading firm we probably can’t name uses QRNNs

Page 30: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

History• Similar ideas have been floating around for a long time

• Echo State Networks (Jaeger 2001 through Jaeger 2007)

• Fix RNN recurrent weights randomly, train only non-recurrent weights

• Our work was inspired by several papers whose models can be seen as special cases of QRNNs:

• PixelCNNs (van den Oord et al., January 2016)

• Strongly Typed RNNs (Balduzzi and Ghifary, February 2016)

• Query-Reduction Networks (Seo et al., June 2016)

Page 31: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

QRNNs• Drop-in replacement for LSTM or GRU as a deep learning model for

sequences

• Fast even compared to highly tuned custom LSTM kernels

• Fully fused implementation in 40 lines of CUDA

• Easy to train and regularize

• Accuracy equal to or better than LSTM for every task we tried

Page 32: Quasi-Recurrent Neural Networks - NVIDIAon-demand.gputechconf.com/...bradbury-quasi-recurrent-neural-netw… · Quasi-Recurrent Neural Networks James Bradbury and Stephen Merity Salesforce

Salesforce Einstein

Read the paper at https://openreview.net/pdf?id=H1zJ-v5xl

Check out our blogpost with code at https://metamind.io/research/

Any questions?

P.S. we’re hiring published research scientists + research engineers with CUDA experience—apply at https://metamind.io/careers