Quasi-Recurrent Neural Networks -...

Post on 21-May-2020

23 views 0 download

Transcript of Quasi-Recurrent Neural Networks -...

Quasi-Recurrent Neural Networks

James Bradbury and Stephen Merity Salesforce Research

A simple, fast GPU-accelerated architecture for deep learning on sequence data

Deep Learning for Sequences

• Text

• Speech

• Time Series

• Logfiles/anomaly detection

• Finance

Salesforce Einstein

Deep Learning for Sequences

• Text

• Speech

• Time Series

• Logfiles/anomaly detection

• Finance

Salesforce Einstein

e.g. 10s of millions of sentence pairs

e.g. 10,000 hours of audio

e.g. multiple years of server logs

e.g. decades of high-resolution prices

Deep Learning for Sequences

• Text

• Speech

• Time Series

• Logfiles/anomaly detection

• Finance

Salesforce Einstein

e.g. 10s of millions of sentence pairs

e.g. 10,000 hours of audio

e.g. multiple years of server logs

e.g. decades of high-resolution pricesSLOW

Salesforce Einstein

Conventional Neural Sequence Models• Recurrent neural networks or LSTMs

*Figure from Chris Olah, used with permission

Salesforce Einstein

Conventional Neural Sequence Models

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• Recurrent neural networks or LSTMs

• Problem: hard to effectively utilize the full parallelism of the GPU

Salesforce Einstein

LSTM in detail

z

t

= tanh(W

z

x

t

+V

z

h

t�1 + b

z

)

i

t

= sigmoid(W

i

x

t

+V

i

h

t�1 + b

i

)

f

t

= sigmoid(W

f

x

t

+V

f

h

t�1 + b

f

)

o

t

= sigmoid(W

o

x

t

+V

o

h

t�1 + b

o

)

c

t

= i

t

� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

1

Salesforce Einstein

LSTM in detail

z

t

= tanh(W

z

x

t

+V

z

h

t�1 + b

z

)

i

t

= sigmoid(W

i

x

t

+V

i

h

t�1 + b

i

)

f

t

= sigmoid(W

f

x

t

+V

f

h

t�1 + b

f

)

o

t

= sigmoid(W

o

x

t

+V

o

h

t�1 + b

o

)

c

t

= i

t

� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

Salesforce Einstein

Optimizing the LSTM• Individual components of LSTM equations are fast on GPU

• Matrix multiply, sigmoid, etc.

• But the overall cell computation isn’t a natural GPU workload

• E.g. lots of small BLAS (matrix multiply) calls

• Throughput limited by CUDA kernel/cuBLAS launch overhead

Salesforce Einstein

Optimizing the LSTM• There are some things we can do:

• Batching matrix multiplications (Wzxt can be batched ahead of time for all t)

• Fusing element-wise operations (everything in green can be a single kernel)

• Persistent kernels:

• Technically challenging, limits hidden size, difficult to generalize

• Introduced for Pascal GPUs in cuDNN 6

Salesforce Einstein

Performance vs. Flexibility• Optimized LSTM implementations like cuDNN’s are inherently inflexible

• New papers add modifications to LSTM architecture every few months

• Variational dropout (Gal and Ghahramani, December 2015)

• Recurrent batch normalization (Cooijmans et al., March 2016)

• Recurrent dropout (Semeniuta et al., March 2016)

• Zoneout (Krueger et al., June 2016)

• Multiplicative LSTM (Krause et al., September 2016)

Salesforce Einstein

Example: regularization• We’ll discuss recurrent regularization as an example of needed flexibility

• Why is recurrent regularization required?Standard dropout doesn’t work on an RNN’s hidden state

Salesforce Einstein

Example: regularization• We’ll discuss recurrent regularization as an example of needed flexibility

• Why is recurrent regularization required?Standard dropout doesn’t work on an RNN’s hidden state

Salesforce Einstein

Variational dropout• Variational dropout (Gal and Ghahramani, 2015) “locks” the dropout

mask

• Prevents excessive loss on the hidden state, works incredibly well, is only a dozen lines of code - assuming you can modify your LSTM

Salesforce Einstein

Zoneout• Zoneout (Krueger et al. 2016) stochastically forces some of the recurrent

units in h to maintain their previous values

• Intuitively, imagine this as a faulty update mechanismwhere δ (delta) is the update and m is the dropout mask

• Again, a minor modification (only a dozen lines) if you have the flexibility

Salesforce Einstein

Example: recurrent dropout• Recurrent dropout is now standard in achieving state of the art results on

many different tasks - likely a good idea for most tasks regardless

• Example: in word level language modeling (lower perplexity is better), recurrent dropout allows small models to compete with a far larger LSTM

• Good news: NVIDIA engineers move quickly with an ever evolving roadmap - but research scientists can’t wait for flexibility to be given to them

Salesforce Einstein

Back to the Fundamental Problem

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN• It’s hard to effectively utilize the GPU

with lots of small BLAS calls

• We want an architecture that’s inherently more parallel.

Salesforce Einstein

How do ConvNets Solve This?

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• In the world of images and video, there’s a fully parallel approach

Salesforce Einstein

How do ConvNets Solve This?

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• In the world of images and video, there’s a fully parallel approach

• But sequence data is usually much more sensitive to ordering!

Salesforce Einstein

Solution: Quasi-Recurrent Architecture

LSTM CNN

LSTM/Linear

Linear

LSTM/Linear

Linear

fo-Pool

Convolution

fo-Pool

Convolution

Max-Pool

Convolution

Max-Pool

Convolution

QRNN

• Use a convolution, but replace the pooling component with something that’s sensitive to order

Salesforce Einstein

QRNN in detail• Start with 1D convolution

• parallel across timesteps

• produces all values, including gates + candidate updates

• All that needs to be computed recurrently is a simple element-wise pooling function inspired by the LSTM

• Can be fused across time without having to alternate with BLAS operations

z

t

= tanh(W

z

⇤X+ b

z

)

[i

t

= sigmoid(W

i

⇤X+ b

i

)]

f

t

= sigmoid(W

f

⇤X+ b

f

)

[o

t

= sigmoid(W

o

⇤X+ b

o

)]

f -pooling:

h

t

= (1� f

t

)� z

t

+ f

t

� h

t�1

fo-pooling:

c

t

= (1� f

t

)� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

ifo-pooling:

c

t

= i

t

� z

t

+ f

t

� c

t�1

h

t

= o

t

� tanh(c

t

)

1

Salesforce Einstein

QRNN implementation• Efficient 1D convolution is built into

most deep learning frameworks

• Automatically parallel across time

• Pooling component is implemented in 40 total lines of CUDA C

• Fused across time into one GPU kernel with a simple for loop

Salesforce Einstein

QRNN Speed

(speed advantage over cuDNN 5.1 LSTM)

Salesforce Einstein

QRNN Speed

(language modeling using Chainer deep learning framework and cuDNN 5.1)

Salesforce Einstein

QRNN Results

References

Model DevPerplexity TestPerplexity

LSTM 85.7 82.0

QRNN 82.9 79.9

QRNNw/zoneout 82.1 78.3

PerplexityonPennTreebanklanguagemodelingdataset(lowerisbetter).Allmodelshave2layersof640hiddenunits.

Model Time/Epoch(s) TestAccuracy(%)

LSTM 480 90.9

QRNN 150 91.4

AccuracyonIMDbbinarysentimentclassificationdataset.Allmodelshave4densely-connectedlayersof256hiddenunits.

Model Time/Epoch(hr) BLEU(TED.tst2014)

LSTM 4.2 16.5

QRNN 1.0 19.4

PerformanceonIWSLTGerman-Englishcharacter-levelmachinetranslation.Allmodelshave4layersof320hidden

units;theQRNN’s1st layerhasfiltersize6.

• Does this dramatically more minimalist architecture reduce model performance and accuracy?

• Apparently not!

• Some ideas as to why:

• No recurrent weight matrix means no vanishing or exploding gradient problem

• Elimination of recurrent degrees of freedom is a form of regularization

Salesforce Einstein

Caveats?• Must stack more than one QRNN layer

• Probably not a good choice for tasks requiring complex hidden state interaction

• Controlling a reinforcement learning agent

• Performance increase isn’t present at inference time in sequence generation tasks

• Each generated sequence element depends on the previous one

Salesforce Einstein

Modularity means Flexibility• The QRNN separates trainable components (the convolution) from fixed components

(the recurrent pooling function)

• The two components can be swapped out separately

• Different filter size for the convolution

• Different pooling functions

• Inserting additional ops between the components makes it easy to implement

• Dropout with both time-locked (variational) and unlocked masks

• Zoneout (dropout on the forget gate!)

Salesforce Einstein

QRNN interpretability• “Interpretability” always a minefield in the context of deep learning models

• But in the QRNN, the recurrent state can only interact element-wise

• So individual channels are more likely to be separable and have distinct meanings

Salesforce Einstein

Applications• Baidu’s Silicon Valley AI Lab found QRNNs to give a substantial

improvement in accuracy and regularization for a component of their speech generation pipeline

• NVIDIA engineers have applied QRNNs to logfile anomaly detection

• A high-frequency trading firm we probably can’t name uses QRNNs

Salesforce Einstein

History• Similar ideas have been floating around for a long time

• Echo State Networks (Jaeger 2001 through Jaeger 2007)

• Fix RNN recurrent weights randomly, train only non-recurrent weights

• Our work was inspired by several papers whose models can be seen as special cases of QRNNs:

• PixelCNNs (van den Oord et al., January 2016)

• Strongly Typed RNNs (Balduzzi and Ghifary, February 2016)

• Query-Reduction Networks (Seo et al., June 2016)

Salesforce Einstein

QRNNs• Drop-in replacement for LSTM or GRU as a deep learning model for

sequences

• Fast even compared to highly tuned custom LSTM kernels

• Fully fused implementation in 40 lines of CUDA

• Easy to train and regularize

• Accuracy equal to or better than LSTM for every task we tried

Salesforce Einstein

Read the paper at https://openreview.net/pdf?id=H1zJ-v5xl

Check out our blogpost with code at https://metamind.io/research/

Any questions?

P.S. we’re hiring published research scientists + research engineers with CUDA experience—apply at https://metamind.io/careers