Machine Translation with LSTMs - Accueil - Département d...

53
Machine Translation with LSTMs Ilya Sutskever Oriol Vinyals Quoc Le Google Inc.

Transcript of Machine Translation with LSTMs - Accueil - Département d...

Machine Translation with LSTMs

Ilya SutskeverOriol Vinyals

Quoc Le

Google Inc.

Deep Neural Networks

1. Can perform an astonishingly wide range of computations

2. Can be learned from data

powerfulmodels

learnable models

deep neural networks

Powerful models are necessary

● A weak model will never get good performance● Examples of weak models:

○ Single layer logistic regression○ Linear SVM○ CRFs○ Small neural nets○ Small conv nets

● A neural network needs to be large and deep to be powerful

powerfulmodels

learnable models

deep neural networks

A trainable model is necessary

● What’s the use of a powerful model if we can’t train it?

● That’s why supervised backpropagation is so important

● 10-layer neural nets easily trainable with backprop

powerfulmodels

learnable models

deep neural networks

-0.5

Why are deep nets powerful?

● A single neuron can implement boolean logic, and general thus computation and computers

+1+1

OR

+1+1

-1.5

AND

-1

+0.5

NOT

Why are deep nets powerful?

● A single neuron can implement boolean logic, and general thus computation and computers

● Mid-sized 2-hidden layer neural network can sort N N-bit numbers○ Intuitively, sorting requires log N parallel steps○ It’s amazing, try it at home with backpropagation!

Why are deep nets powerful?

● A single neuron can implement boolean logic, and general thus computation and computers

● Mid-sized 2-hidden layer neural network can sort N N-bit numbers○ Intuitively, sorting requires log N parallel steps○ It’s amazing, try it at home with backpropagation!

→ Neurons are more economical than boolean logic

“The Deep Learning Hypothesis”

● Human perception is fast○ Neurons fire at most 100 times a second○ Humans solve perception in 0.1 seconds

→ our neurons fire 10 times, at most

● 10-layer neural networks can be trained well in practice

Anything a human can do in 0.1 seconds, a big 10-layer neural network can do, too!

powerfulmodels

learnable models

deep neural networks

How to solve any problem?

● Use a lot of good AND labelled training data● Use a big deep neural network

● → Success is the only possible outcome, literally○ Otherwise the neural network is too small

powerfulmodels

learnable models

deep neural networks

The deep learning hypothesis is true!

● Big deep nets get the best results ever on: ○ Speech recognition○ Object recognition

● Deep learning really works!○ But there are other problems, too, such as MT

Deep nets can’t solve all problems

● Inputs and outputs must be of fixed dimensionality ○ Great for images: input is a big image of a fixed size

output is a 1-of-N encoding of category

● Bad news for machine translation and speech recognition

Unit-specificconnections

Input

output

Goal: a general sequence to sequence neural network

● The hope: a generic method that can be successfully applied to any sequence-to-sequence problem, and achieve excellent results○ MT, Q&A, ASR, squiggle recognition, etc

● Manage expectations: we don’t beat the state of the art, but we are close to a strong MT baseline system on a large publicly available dataset!

Recurrent Neural Networks (RNNs)

● RNNs can work with sequences

inp

hid

out

inp

hid

out

inp

hid

out

inp

hid

out

inp

hid

out

inp

hid

out

Time

Key idea: each timestep is a different layer with the same weights

t=1 t=2 t=3 t=4 t=5 t=6

Recurrent Neural Networks (RNNs)

● Neural networks that can process sequences well○ Very expressive models

● Use backpropagation ○ Fun fact: recurrent neural networks were trained in the original

backpropagation paper in 1986○

● Sadly RNNs are hard to train with backpropagation ○ unstable○ Has trouble learning “long-term dependencies”○ Vanishing gradient problems (Hochreiter 1991; Bengio et al., 1994)

● There are ways to learn RNNs but they are hard to use

Long Short-Term Memory (LSTM)● An RNN architecture that is good at long-term

dependencies

M M

H H

X

I1 I2 F O

X

X

X

output

+

The heart of the LSTM

● Addition has nice gradients○ All terms in a sum contribute equally

● LSTM is good at noticing long-range correlations○ Because of the nice gradients of addition

● Main advantage (over HF): requires little tuning○ Hugely important in new applications

RNNs overwrite the hidden state

LSTMs add to the hidden state

How to use an LSTM to map sequences to sequences?

● Normal formulations of LSTMs and RNNs have issues:○ length of input sequence = length of output sequence○ Not good for either ASR and MT

● Every strategy for mapping sequences to sequences has an HMM-like component○ Normal ASR approaches have a big complicated transducer○ The Connectionist Sequence Classification (CTC) assumes monotonic

alignments

● But we want something simpler and more generic○ Should be applicable to any sequence-to-sequence problem

■ Including MT, where words can be reordered in many ways

Main idea

● Neural nets are excellent at learning very complicated functions

● “Coerce” a neural network to read one sequence and produce another

● Learning should take care of the rest

● All neural networks are equivalent anyway○ So the important thing is to provide the neural network with all the

information○ And to make it trainable and big

Main idea

A B C D __ X Y Z

X Y Z Q

Input sequence

Target sequence

That’s it!

● The LSTM needs to read the entire input sequence, and then produce the target sequence “from memory”

● The input sequence is stored by a single LSTM hidden state

● Surely the LSTM’s state could store only a handful of words and nothing else?

Step 1: can the LSTM reconstruct its input?

● Can this scheme learn the identity function?

● Answer: it can, and it can do it very easily. It just does it effortlessly. Test perplexity of 1.03

A B C D __ A B C

A B C D

Target sequence

Step 2: small dataset experiments: EuroParl

● French to English○ Low-entropy parliament language○ 20M words in total○ Small vocabulary○ Sentence length no longer than 25

● Although: 25 words is not that small

● Early results were encouraging

Digression: decoding

● Formally, given an input sentence, the LSTM defines a distribution over output sentences

● Therefore, we should produce the sentence with the highest probability

● But there are exponentially many sentences, how to find it?

● Search problem: use simple greedy beam search

Decoding in a nutshell

● Proceed left to right● Maintain N partial translations● Expand each translation with possible next words● Discard all but the top N new partial translations

IMy

I decidedMy decisionI thoughtI triedMy thinkingMy direction

I decidedMy decision

2 partial hypothesis expand hypotheses 2 new partial hypotheses

prune … expandand sort

Why does simple beam-search work?

● The LSTM is trained to predict the next word given previous words

● If next step prediction is good, truth should be among the most likely next words○ Empirically, the small beam seems to work fairly well (so we think!)

● But: we have decoding failures○ The net produces zero-length sentences○ Fixable with a heuristic

Model for big experiments

● 160K input words ● 80k output words ● 4 layers of 1000D LSTM● different LSTMs for input and output language● 384M parameters

The model

A B C D __ A B C

A B C D 80k softmax by1000 dimsThis is very big!

1000 LSTM cells2000 dims pertimestep

2000 x 4 = 8k dims persentence

160k vocab in input language

Parallelization

● Parallelization is important● More parallelization is better -- ongoing work● 8 GPUs● More details in the upcoming paper

Results on a big dataset

● Corpus: WMT’14 English → French● 680M words● about 50K test words● An average of 6 models gets a BLEU score of 30.73● Strong SMT baseline gets 33.3● State of the art is 35.8

● When we rescore n-best lists of the baseline using an average of 6 we get 36.36

Our system suffers on rare words

It also suffers on long sentences

Break long sentences into pieces

Representations

Representations

Examples

● Due to a technicality, the following examples were generated from a model that did not converge ○ Actual translations are better

Examples

● FR: Les avionneurs se querellent au sujet de la largeur des sièges alors que de grosses commandes sont en jeu

● GT: Aircraft manufacturers are quarreling about the seat width as large orders are at stake

● LSTM: Aircraft manufacturers are concerned about the width of seats while large orders are at stake

● TRUTH: Jet makers feud over seat width with big orders at stake

Example● FR: La dispute fait rage entre les grands constructeurs aéronautiques à

propos de la largeur des sièges de la classe touriste sur les vols long-courriers , ouvrant la voie à une confrontation amère lors du salon aéronautique de Dubaï qui a lieu de mois-ci .

● GT: The dispute rages between the major aircraft manufacturers about the width of seats in the economy class on long-haul flights, paving the way for a bitter confrontation at the Dubai Airshow which takes place month.

● LSTM: The dispute is raging between large aircraft manufacturers on the size of the tourist seats on the long-haul flights , leading to a bitter confrontation at the Dubai Airshow in the month of October .

● TRUTH: A row has flared up between leading plane makers over the width of tourist-class seats on long-distance flights , setting the tone for a bitter confrontation at this month 's Dubai Airshow

Example● FR: Le conflit porte sur la largeur des sièges proposés sur les vols long-

courriers aux passagers de la classe économique – qui ne sont pas toujours les plus courtisés par les compagnies aériennes , mais auxquels l' espace alloué est essentiel pour augmenter les gains d' efficacité dans les derniers appareils présentés par Airbus SAS et Boeing Co .

● GT: The conflict is about the width of the proposed long-haul passengers in economy class seats - which are not always the most sought by airlines, but that & apos; allocated space is essential to increase the gains & apos; efficiency in the latest devices made by Airbus SAS and Boeing Co.

● LSTM: The dispute is about the size of the seats on the long-haul passenger airlines , which are not always the same as the airlines that are in the air , but the most important thing is to increase the efficiency of the aircraft in the latest Airbus and Boeing SAS aircraft

● TRUTH: The dispute focuses on the width of seats provided on long-haul flights for economy passengers - not always the ones most courted by airlines , but whose allocated space holds the key to efficiency claims for the latest jets offered by Airbus SAS and Boeing Co .

Example (out of vocabulary word)● FR: Aujourd' hui , Airbus en appelle directement au public avant le

salon aéronautique de Dubaï , où le 777X devrait prendre le pas sur ses concurrents avec plus de 100 commandes .

● GT: Today, Airbus is appealing directly to the public before the Dubai Airshow, where the 777X should take precedence over its competitors with over 100 commands.

● LSTM: Today , Airbus is calling directly to the public before the Dubai Airshow , where it is expected to take over its competitors with more than 100 orders

● TRUTH: Now , Airbus is appealing directly to the public ahead of the Dubai Airshow , where the 777X is expected to dominate with more than 100 orders .

Example● FR: « Nous sommes heureux que la FAA reconnaisse qu' une

expérience passager agréable n' est pas incompatible avec la sécurité » , a déclaré Roger Dow , Président et directeur de l' U.S. Travel Association .

● GT: "We are pleased that the FAA recognizes that a pleasant passenger experience is not incompatible with safety," said Roger Dow, president and director of the US Travel Association

● LSTM: " We are pleased that the FAA recognizes that a pleasant passenger experience is not incompatible with security , " said Roger Dow , President and CEO of the U.S. Travel Association .

● TRUTH: " We 're pleased the FAA recognizes that an enjoyable passenger experience is not incompatible with safety and security , " said Roger Dow , CEO of the U.S. Travel Association .

Example (meh)● FR: Un groupe de l' industrie du voyage s' est félicité des

changements , les appelant des arrangements pleins de bon sens pour des voyageurs équipés de technologies .

● GT: A group of travel industry welcomed the changes, calling for full arrangements of sense for travelers equipped technologies.

● LSTM: A travel industry group welcomed the changes , calling for common arrangements for travellers with technology .

● TRUTH: A travel industry group welcomed the changes , calling them common-sense accommodations for a traveling public now bristling with technology .

Example (nonsensical translation)● FR: Toutefois , le comité a recommandé que la FAA autorise les pilotes à

demander aux passagers d' éteindre leurs appareils pendant les atterrissages aux instruments dans des conditions de faible visibilité .

● GT: However, the committee recommended that the FAA allows pilots to ask passengers to turn off their devices during instrument landings in low visibility conditions.

● LSTM: However , the Committee recommended that the FAA allow pilots to ask passengers to stop their aircraft from landing in instruments under low visibility conditions .

● TRUTH: However , the committee recommended the FAA allow pilots to order passengers to shut off devices during instrument landings in low visibility .

Example● FR: En 2011 , les dirigeants de la société ont chargé un avion de ligne de

lecteurs de livres électroniques Kindle et l' ont fait voler pour voir s' il y avait des problèmes , mais il n' y en a eu aucun .

● GT: In 2011, company officials have charged an airliner of Kindle e-book readers and they flew to see if there were problems, but there were none.

● LSTM: In 2011 , the leaders of the company set up an electronic book readers ' book , and flew it to see if there were problems , but there were none .

● TRUTH: In 2011 , company officials loaded an airliner full of their Kindle e-readers and flew it around to test for problems but found none .

Example (inversion of meaning)● FR: Amazon.com fait partie de ceux qui font pression pour un

assouplissement des restrictions concernant l' utilisation des appareils électroniques par les passagers .

● GT: Amazon.com is among those pushing for a relaxation of restrictions on the use of electronic devices by passengers.

● LSTM: Amazon.com is one of those who lobby for restrictions on the use of electronic devices by passengers .

● TRUTH: Among those pressing for a relaxation of restrictions on passengers ' use of the devices has been Amazon.com.

Example● FR: Les compagnies aériennes permettent à leurs passagers d'

utiliser le Wi-Fi aux altitudes de croisière depuis plusieurs années .

● GT: Airlines allow their passengers to use Wi-Fi at cruising altitudes for several years.

● LSTM: The airlines allow their passengers to use the Wi-Fi at cruising altitudes for several years .

● TRUTH: Airlines have been offering Wi-Fi use at cruising altitudes to passengers for several years .

Example (forgetting a name)● FR: « Une fois la nouvelle politique mise en œuvre en toute sécurité – et

nous allons travailler en étroite collaboration avec les transporteurs pour cela – ce sera profitable à tout le monde » , a expliqué Glading dans une déclaration .

● GT: "Once the new policy implemented safely - and we will work closely with carriers for this - it will benefit everyone," said Glading in a statement.

● LSTM: " Once the new security policy is implemented - and we will work closely with the carriers in this way - it will be beneficial for everyone , " he said in a statement .

● TRUTH: " Once the new policy is safely implemented - and we 're going to work closely with the carrier to do that - it will be a win-win , " Glading said in a statement

Example● FR: « En fonction du plan , nous pourrions approuver une utilisation élargie

des appareils électroniques très prochainement » , a indiqué la FAA dans une déclaration .

● GT: "Depending on the plan, we could approve wider use of electronic devices in the near future," said the FAA in a statement.

● LSTM: " Depending on the plan , we could approve an extended use of electronic devices in the near future , " said FAA in a statement .

● TRUTH: " Depending on the condition of the plan , we could approve expanded use of electronic devices very soon , " the FAA said in a statement .

Example● FR: Les compagnies aériennes devront prouver à la FAA que leurs avions

respectent les nouvelles directives et qu' elles ont mis à jour les manuels de formation des membres d' équipage et les règles concernant le rangement des appareils pour être en conformité .

● GT: Airlines will have to prove to the FAA that the planes meet the new guidelines and they have updated training manuals for crew members and the rules for storage devices for compliance.

● LSTM: The airlines will have to prove to the FAA that their aircraft comply with the new guidelines and have updated the training manuals for crew members and the rules for the storage of aircraft in order to comply .

● TRUTH: Airlines will have to show the FAA how their airplanes meet the new guidelines and that they 've updating their flight crew training manuals and rules for stowing devices to reflect the new guidelines .

Example● FR: FAA : Les passagers aériens peuvent désormais utiliser des gadgets à

bord des avions ( mais pas passer un appel avec leur téléphone portable )

● GT: FAA: Air passengers can now use gadgets on planes (but not make a call on their cell phone)

● LSTM: FAA : Air passengers can now use gadgets on board ( but do not use a phone call ) .

● TRUTH: FAA : Air passengers can now use gadgets on planes ( but not make cell phone calls )

Remember, this model hasn’t converged

Results will be better by the end of the week

Weakness of model

● Large vocabulary: we have a vector for each word in the vocabulary○ So large vocabularies become expensive and require a lot of training

● Long sentences: the LSTM’s state has limited capacity

● Solution: ○ Train on chunks, translate chunks in order○ Rare word problem: use a huge vocab and train the LSTM on a huge

amount of data■ And the rare word will become a frequent word

Conclusion

● We showed that regular LSTMs can translate short sentences pretty well

● On short sentences and small vocabularies, our BLUE score is worse than state of the art, but not but that much

● Our method applies to any sequence to sequence problem

● We will succeed.

In closing ...

● Deep learning theory is confirmed yet again● MT will probably be solved soon● Can now map sequences to sequences, no need to limit

ourselves to vectors

● “If your deep net doesn’t work, train a bigger deeper net”

THE END!