Neural Machine Translation: A Machine Learning · PDF file Neural Machine Translation: A...

Click here to load reader

  • date post

    21-Jul-2020
  • Category

    Documents

  • view

    9
  • download

    0

Embed Size (px)

Transcript of Neural Machine Translation: A Machine Learning · PDF file Neural Machine Translation: A...

  • Neural Machine Translation: A Machine Learning Perspective

    Tie-Yan Liu

    Principal Researcher, Microsoft Research

    IEEE Fellow, ACM Distinguished Member

  • Neural Machine Translation

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 2

  • Neural Machine Translation

    Encoder: from input word sequence to intermediate context

    Decoder: from intermediate context to distribution of output word sequence

    FNN

    CNN

    RNN

    Various choices of implementing the encoder or decoder

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 3

  • Neural Machine Translation

    β€’ Example: RNN-based implementation

    β€’ Attention mechanism β€’ Using personalized context vector 𝑐𝑑 = σ𝑗=1

    𝑇π‘₯ π›Όπ‘‘π‘—β„Žπ‘— ,

    where 𝛼𝑑𝑗 is importance of π‘₯𝑗 is to 𝑦𝑑

    (Bengio, ICLR 2015)

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 4

  • Fast Development of NMT - GNMT

    β€’ RNN as encoder/decoder β€’ Stacked LSTM-RNN (8 layers for encoder and decoder respectively)

    β€’ Each layer is trained on a separate GPU for speed-up

    β€’ Standard attention model

    β€’ Residual connection for better gradient flow

    β€’ Significant improvement over shallow models β€’ 39.92 vs. 31.3 (Bengio, ICLR 2015) on En-Fr

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 5

  • Fast Development of NMT – ConvS2S

    β€’ CNN as encoder/decoder β€’ Convolutional block structure

    β€’ Gated linear unit + Residual connection

    β€’ 15 layers for encoder and decoder respectively

    β€’ Multi-step attention β€’ Separate attention mechanism for each

    decoding layer

    β€’ Comparable to (slightly better than) RNN-based NMT models β€’ 40.46 vs. 39.92 (GNMT) on En-Fr

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 6

  • Fast Development of NMT - Transformer

    β€’ FNN as encoder/decoder β€’ 6 layers (each with two sub-layers) for

    encoder and decoder respectively

    β€’ Relying entirely on attention (including multi-head self-attention) to draw global dependency between input and output

    β€’ Comparable to (slightly better than) RNN-based and CNN-based NMT models β€’ 41.0 vs. 40.46 (ConvS2S) vs. 39.92 (GNMT) on En-Fr

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 7

  • Fast Development of NMT- Summary

    Algorithms Framework Algorithm #layers:

    encoder- decoder

    English->French (36M pairs)

    English->German (4.5M pairs)

    BLEU Training cost BLEU Training cost

    Bengio, ICLR 2015

    Theano (open source)

    GRU-RNN 1-1 31.3 - - -

    GNMT TensorFlow (no code)

    LSTM-RNN 8-8 39.92 96 K80, 6 days

    24.6 -

    Transformer TensorFlow

    (open source) FNN+

    attention 12-12 41.0

    8 P100, 4.5 days

    28.4 8 P100

    3.5 days

    ConvS2S Torch

    (open source) CNN 15-15 40.46

    8 M40, 37 days

    25.16 1 M40

    18.5 days

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 8

  • What’s Done?

    β€’ These works verified the strong representation power of deep neural networks: β€’ No matter FNN, CNN, or RNN, all can be used

    to fit bilingual training data, and achieve good translation performance when sufficiently large training data are given.

    β€’ However, this is not surprising at all β€’ Already indicated by the universal

    approximation theorem, decades ago.

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 9

  • What’s Missing?

    β€’ Many unique challenges of machine translation have not been addressed β€’ Relying on huge amount of bilingual

    training data

    β€’ Relying on myopic beam search during inference

    β€’ Using likelihood maximization for both training and inference, which differs from true evaluation measure (BLEU)

    β€’ … …

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 10

  • Leveraging Reinforcement Learning to Tackle these Challenges

    Dual learning β€’ Leveraging the symmetric structure of

    machine translation to enable effective learning from monolingual data through reinforcement learning.

    Predictive inference β€’ Using end-to-end BLEU as delayed reward

    to train value networks

    β€’ Using value networks to guide forward- looking search along the decoding tree

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 11

  • Dual Learning for NMT NIPS 2016, IJCAI 2017, ICML 2017

    7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 12

  • Traditional Solutions to Insufficient Training Data

    Tie-Yan Liu @ Microsoft Research Asia 13

    Label propagation Transductive learning

    Multi-task learning Transfer Learning

    7/25/2017

  • A New View: The Beauty of Symmetry

    β€’ Symmetry is almost everywhere in our world, and also in machine translation!

    Tie-Yan Liu @ Microsoft Research Asia 14

    Hello! δ½ ε₯½οΌ

    7/25/2017

  • Dual Learning

    β€’ A new learning framework that leverages the symmetric (primal-dual) structure of AI tasks to obtain effective feedback or regularization signals to enhance the learning process, especially when lacking labeled training data.

    Tie-Yan Liu @ Microsoft Research Asia 157/25/2017

  • Dual Learning for Machine Translation

    Tie-Yan Liu @ Microsoft Research Asia 16

    English sentence π‘₯ Chinese sentence 𝑦 = 𝑓(π‘₯)New English sentence

    π‘₯β€² = 𝑔(𝑦)

    Feedback signals during the loop: β€’ 𝑅 π‘₯, π‘₯β€²; 𝑓, 𝑔 : BLEU of π‘₯β€² given π‘₯; β€’ 𝐿(𝑦; 𝑓), 𝐿(π‘₯β€²; 𝑔): likelihood and syntactic

    correctness of 𝑦 and π‘₯’; β€’ 𝑅 π‘₯, 𝑦; 𝑓 , 𝑅 𝑦, π‘₯β€²; 𝑔 : dictionary based

    translation correspondence, etc.

    Primal Task 𝑓: π‘₯ β†’ 𝑦

    Dual Task 𝑔: 𝑦 β†’ π‘₯

    Environment

    Agent

    Environment

    Agent

    Ch->En translation

    En->Ch translation

    Policy gradient is used to improve both primal and dual models according to feedback signals

    (NIPS 2016)

    7/25/2017

  • Experimental Setting

    β€’ Baseline: β€’ State-of-art NMT model, trained

    using 100% bilingual data β€’ Neural Machine Translation by Jointly

    Learning to Align and Translate, by Bengio’s group (ICLR 2015)

    β€’ Our algorithm: β€’ Step 1: Initialization

    β€’ Start from a weak NMT model learned from only 10% training data

    β€’ Step 2: Dual learning β€’ Use the policy gradient algorithm to

    update the dual models based on monolingual data

    Tie-Yan Liu @ Microsoft Research Asia 17

    NMT with 10% bilingual data

    Dual learning with 10% bilingual data

    NMT with 100% bilingual data

    BLEU score: French->English

    ↑ 0.3

    ↓ 5.0

    Starting from initial models obtained from only 10% bilingual data, dual learning can achieve similar accuracy as the NMT model learned from 100% bilingual data!

    7/25/2017

  • Probabilistic Nature

    β€’ The primal-dual structure implies strong probabilistic connections between the two tasks.

    β€’ This can also be used to improve supervised learning, and perhaps even inference β€’ Structural regularizer to enhance supervised learning β€’ Additional criterion to improve inference

    Tie-Yan Liu @ Microsoft Research Asia 18

    𝑃 π‘₯, 𝑦 = 𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 = 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔

    Primal View Dual View

    7/25/2017

  • β€œDual” Supervised Learning

    Tie-Yan Liu @ Microsoft Research Asia 19

    Labeled data π‘₯ Predicted label

    𝑦 = 𝑓(π‘₯)Reconstructed data π‘₯β€² = 𝑔(𝑦)

    Feedback signals during the loop: β€’ 𝑅 π‘₯, 𝑓, 𝑔 = |𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 βˆ’ 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔 |: the gap between

    the joint probability 𝑃(π‘₯, 𝑦) obtained in two directions

    Primal Task 𝑓: π‘₯ β†’ 𝑦

    Dual Task 𝑔: 𝑦 β†’ π‘₯

    Environment

    Bob

    Environment

    Alice

    min |𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 βˆ’ 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔 |

    max log𝑃(𝑦|π‘₯; 𝑓)

    max log𝑃(π‘₯|𝑦; 𝑔)

    (ICML 2017)

    7/25/2017

  • Experimental Results

    Tie-Yan Liu @ Microsoft Research Asia 20

    Theoretical Analysis β€’ Dual supervised learning generalizes better than standard supervised learning

    The product space of the two models satisfying probabilistic duality: 𝑃 π‘₯ 𝑃 𝑦 π‘₯; 𝑓 = 𝑃 𝑦 𝑃 π‘₯ 𝑦; 𝑔

    En->Fr Fr->En En->De De->En

    NMT Dual learning↑2.1 ↑0.9

    ↑1.4 ↑0.1

    7/25/2017

  • β€œDual” Inference

    Tie-Yan Liu @ Microsoft Research Asia 21

    Test data π‘₯ Predicted label

    𝑦 = 𝑓(π‘₯)Reconstructed data π‘₯β€² = 𝑔(𝑦)

    Primal Task 𝑓: π‘₯ β†’ 𝑦

    Dual Task 𝑔: 𝑦 β†’ π‘₯

    Environment

    Bob

    Environment

    Alice

    𝑃 𝑦 π‘₯ = 𝑃 π‘₯ 𝑦 𝑃 𝑦

    𝑃 π‘₯

    Choose the 𝑦 that can maximize 𝑃(𝑦|π‘₯; 𝑓)Standard inference

    Choose the 𝑦 that can maximize both 𝑃(𝑦|π‘₯; 𝑓) and 𝑃