Post on 21-May-2020
Quasi-Recurrent Neural Networks
James Bradbury and Stephen Merity Salesforce Research
A simple, fast GPU-accelerated architecture for deep learning on sequence data
Deep Learning for Sequences
• Text
• Speech
• Time Series
• Logfiles/anomaly detection
• Finance
Salesforce Einstein
Deep Learning for Sequences
• Text
• Speech
• Time Series
• Logfiles/anomaly detection
• Finance
Salesforce Einstein
e.g. 10s of millions of sentence pairs
e.g. 10,000 hours of audio
e.g. multiple years of server logs
e.g. decades of high-resolution prices
Deep Learning for Sequences
• Text
• Speech
• Time Series
• Logfiles/anomaly detection
• Finance
Salesforce Einstein
e.g. 10s of millions of sentence pairs
e.g. 10,000 hours of audio
e.g. multiple years of server logs
e.g. decades of high-resolution pricesSLOW
Salesforce Einstein
Conventional Neural Sequence Models• Recurrent neural networks or LSTMs
*Figure from Chris Olah, used with permission
Salesforce Einstein
Conventional Neural Sequence Models
LSTM CNN
LSTM/Linear
Linear
LSTM/Linear
Linear
fo-Pool
Convolution
fo-Pool
Convolution
Max-Pool
Convolution
Max-Pool
Convolution
QRNN
• Recurrent neural networks or LSTMs
• Problem: hard to effectively utilize the full parallelism of the GPU
Salesforce Einstein
LSTM in detail
z
t
= tanh(W
z
x
t
+V
z
h
t�1 + b
z
)
i
t
= sigmoid(W
i
x
t
+V
i
h
t�1 + b
i
)
f
t
= sigmoid(W
f
x
t
+V
f
h
t�1 + b
f
)
o
t
= sigmoid(W
o
x
t
+V
o
h
t�1 + b
o
)
c
t
= i
t
� z
t
+ f
t
� c
t�1
h
t
= o
t
� tanh(c
t
)
1
Salesforce Einstein
LSTM in detail
z
t
= tanh(W
z
x
t
+V
z
h
t�1 + b
z
)
i
t
= sigmoid(W
i
x
t
+V
i
h
t�1 + b
i
)
f
t
= sigmoid(W
f
x
t
+V
f
h
t�1 + b
f
)
o
t
= sigmoid(W
o
x
t
+V
o
h
t�1 + b
o
)
c
t
= i
t
� z
t
+ f
t
� c
t�1
h
t
= o
t
� tanh(c
t
)
Salesforce Einstein
Optimizing the LSTM• Individual components of LSTM equations are fast on GPU
• Matrix multiply, sigmoid, etc.
• But the overall cell computation isn’t a natural GPU workload
• E.g. lots of small BLAS (matrix multiply) calls
• Throughput limited by CUDA kernel/cuBLAS launch overhead
Salesforce Einstein
Optimizing the LSTM• There are some things we can do:
• Batching matrix multiplications (Wzxt can be batched ahead of time for all t)
• Fusing element-wise operations (everything in green can be a single kernel)
• Persistent kernels:
• Technically challenging, limits hidden size, difficult to generalize
• Introduced for Pascal GPUs in cuDNN 6
Salesforce Einstein
Performance vs. Flexibility• Optimized LSTM implementations like cuDNN’s are inherently inflexible
• New papers add modifications to LSTM architecture every few months
• Variational dropout (Gal and Ghahramani, December 2015)
• Recurrent batch normalization (Cooijmans et al., March 2016)
• Recurrent dropout (Semeniuta et al., March 2016)
• Zoneout (Krueger et al., June 2016)
• Multiplicative LSTM (Krause et al., September 2016)
Salesforce Einstein
Example: regularization• We’ll discuss recurrent regularization as an example of needed flexibility
• Why is recurrent regularization required?Standard dropout doesn’t work on an RNN’s hidden state
Salesforce Einstein
Example: regularization• We’ll discuss recurrent regularization as an example of needed flexibility
• Why is recurrent regularization required?Standard dropout doesn’t work on an RNN’s hidden state
Salesforce Einstein
Variational dropout• Variational dropout (Gal and Ghahramani, 2015) “locks” the dropout
mask
• Prevents excessive loss on the hidden state, works incredibly well, is only a dozen lines of code - assuming you can modify your LSTM
Salesforce Einstein
Zoneout• Zoneout (Krueger et al. 2016) stochastically forces some of the recurrent
units in h to maintain their previous values
• Intuitively, imagine this as a faulty update mechanismwhere δ (delta) is the update and m is the dropout mask
• Again, a minor modification (only a dozen lines) if you have the flexibility
Salesforce Einstein
Example: recurrent dropout• Recurrent dropout is now standard in achieving state of the art results on
many different tasks - likely a good idea for most tasks regardless
• Example: in word level language modeling (lower perplexity is better), recurrent dropout allows small models to compete with a far larger LSTM
• Good news: NVIDIA engineers move quickly with an ever evolving roadmap - but research scientists can’t wait for flexibility to be given to them
Salesforce Einstein
Back to the Fundamental Problem
LSTM CNN
LSTM/Linear
Linear
LSTM/Linear
Linear
fo-Pool
Convolution
fo-Pool
Convolution
Max-Pool
Convolution
Max-Pool
Convolution
QRNN• It’s hard to effectively utilize the GPU
with lots of small BLAS calls
• We want an architecture that’s inherently more parallel.
Salesforce Einstein
How do ConvNets Solve This?
LSTM CNN
LSTM/Linear
Linear
LSTM/Linear
Linear
fo-Pool
Convolution
fo-Pool
Convolution
Max-Pool
Convolution
Max-Pool
Convolution
QRNN
• In the world of images and video, there’s a fully parallel approach
Salesforce Einstein
How do ConvNets Solve This?
LSTM CNN
LSTM/Linear
Linear
LSTM/Linear
Linear
fo-Pool
Convolution
fo-Pool
Convolution
Max-Pool
Convolution
Max-Pool
Convolution
QRNN
• In the world of images and video, there’s a fully parallel approach
• But sequence data is usually much more sensitive to ordering!
Salesforce Einstein
Solution: Quasi-Recurrent Architecture
LSTM CNN
LSTM/Linear
Linear
LSTM/Linear
Linear
fo-Pool
Convolution
fo-Pool
Convolution
Max-Pool
Convolution
Max-Pool
Convolution
QRNN
• Use a convolution, but replace the pooling component with something that’s sensitive to order
Salesforce Einstein
QRNN in detail• Start with 1D convolution
• parallel across timesteps
• produces all values, including gates + candidate updates
• All that needs to be computed recurrently is a simple element-wise pooling function inspired by the LSTM
• Can be fused across time without having to alternate with BLAS operations
z
t
= tanh(W
z
⇤X+ b
z
)
[i
t
= sigmoid(W
i
⇤X+ b
i
)]
f
t
= sigmoid(W
f
⇤X+ b
f
)
[o
t
= sigmoid(W
o
⇤X+ b
o
)]
f -pooling:
h
t
= (1� f
t
)� z
t
+ f
t
� h
t�1
fo-pooling:
c
t
= (1� f
t
)� z
t
+ f
t
� c
t�1
h
t
= o
t
� tanh(c
t
)
ifo-pooling:
c
t
= i
t
� z
t
+ f
t
� c
t�1
h
t
= o
t
� tanh(c
t
)
1
Salesforce Einstein
QRNN implementation• Efficient 1D convolution is built into
most deep learning frameworks
• Automatically parallel across time
• Pooling component is implemented in 40 total lines of CUDA C
• Fused across time into one GPU kernel with a simple for loop
Salesforce Einstein
QRNN Speed
(speed advantage over cuDNN 5.1 LSTM)
Salesforce Einstein
QRNN Speed
(language modeling using Chainer deep learning framework and cuDNN 5.1)
Salesforce Einstein
QRNN Results
References
Model DevPerplexity TestPerplexity
LSTM 85.7 82.0
QRNN 82.9 79.9
QRNNw/zoneout 82.1 78.3
PerplexityonPennTreebanklanguagemodelingdataset(lowerisbetter).Allmodelshave2layersof640hiddenunits.
Model Time/Epoch(s) TestAccuracy(%)
LSTM 480 90.9
QRNN 150 91.4
AccuracyonIMDbbinarysentimentclassificationdataset.Allmodelshave4densely-connectedlayersof256hiddenunits.
Model Time/Epoch(hr) BLEU(TED.tst2014)
LSTM 4.2 16.5
QRNN 1.0 19.4
PerformanceonIWSLTGerman-Englishcharacter-levelmachinetranslation.Allmodelshave4layersof320hidden
units;theQRNN’s1st layerhasfiltersize6.
• Does this dramatically more minimalist architecture reduce model performance and accuracy?
• Apparently not!
• Some ideas as to why:
• No recurrent weight matrix means no vanishing or exploding gradient problem
• Elimination of recurrent degrees of freedom is a form of regularization
Salesforce Einstein
Caveats?• Must stack more than one QRNN layer
• Probably not a good choice for tasks requiring complex hidden state interaction
• Controlling a reinforcement learning agent
• Performance increase isn’t present at inference time in sequence generation tasks
• Each generated sequence element depends on the previous one
Salesforce Einstein
Modularity means Flexibility• The QRNN separates trainable components (the convolution) from fixed components
(the recurrent pooling function)
• The two components can be swapped out separately
• Different filter size for the convolution
• Different pooling functions
• Inserting additional ops between the components makes it easy to implement
• Dropout with both time-locked (variational) and unlocked masks
• Zoneout (dropout on the forget gate!)
Salesforce Einstein
QRNN interpretability• “Interpretability” always a minefield in the context of deep learning models
• But in the QRNN, the recurrent state can only interact element-wise
• So individual channels are more likely to be separable and have distinct meanings
Salesforce Einstein
Applications• Baidu’s Silicon Valley AI Lab found QRNNs to give a substantial
improvement in accuracy and regularization for a component of their speech generation pipeline
• NVIDIA engineers have applied QRNNs to logfile anomaly detection
• A high-frequency trading firm we probably can’t name uses QRNNs
Salesforce Einstein
History• Similar ideas have been floating around for a long time
• Echo State Networks (Jaeger 2001 through Jaeger 2007)
• Fix RNN recurrent weights randomly, train only non-recurrent weights
• Our work was inspired by several papers whose models can be seen as special cases of QRNNs:
• PixelCNNs (van den Oord et al., January 2016)
• Strongly Typed RNNs (Balduzzi and Ghifary, February 2016)
• Query-Reduction Networks (Seo et al., June 2016)
Salesforce Einstein
QRNNs• Drop-in replacement for LSTM or GRU as a deep learning model for
sequences
• Fast even compared to highly tuned custom LSTM kernels
• Fully fused implementation in 40 lines of CUDA
• Easy to train and regularize
• Accuracy equal to or better than LSTM for every task we tried
Salesforce Einstein
Read the paper at https://openreview.net/pdf?id=H1zJ-v5xl
Check out our blogpost with code at https://metamind.io/research/
Any questions?
P.S. we’re hiring published research scientists + research engineers with CUDA experience—apply at https://metamind.io/careers