Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval)...

38
Deep Learning and Natural Language Processing: A Review and Outlook Hang Li Bytedance AI Lab

Transcript of Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval)...

Page 1: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Deep Learning and Natural Language Processing:A Review and Outlook

Hang LiBytedance AI Lab

Page 2: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Two Questions

• Why does Deep Learning work very well for Natural Language Processing?

• What will be next beyond Deep Learning for Natural Language Processing?

• This talk tries to answer the questions

Page 3: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Outline

• Human Language Processing• Deep Learning• Deep Learning for Natural Language

Processing• Attention: Soft Association Mechanism• Future of Natural Language Processing• Summary

Page 4: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Damasio’s Hypothesis

Having a mind means that an organism forms neural representations which can become images, be manipulated in a process called thought.

Antonio Damasio

Page 5: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Embodied SimulaPon Hypothesis

• Question: “Does a gorilla have a nose?”• To answer this question, one evokes the image

of gorilla in consciousness• Concepts are stored in memory as associated

visual, auditory, motor images• Language understanding is simulation on the

basis of images of related concepts

Page 6: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Thinking and Neural ProcessingBrain/Mind

Consciousness

Sub-consciousness

Image

Neural Representation

Thinking is neural processing in sub-consciousness that generates images in consciousness.

Page 7: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Symbols for Humans

• Words and symbols are based on topographically organized representations and are images as well

• Mathematicians and physicists describe their thinking as dominated by images

Page 8: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Outline

• Human Language Processing• Deep Learning• Deep Learning for Natural Language

Processing• Attention: Soft Association Mechanism• Future of Natural Language Processing• Summary

Page 9: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Advantages• Function Approximation• Sample Efficiency• Generalization

Disadvantages• Robustness• “Appropriateness”

Page 10: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Function Approximation• Universal function approximation theorem:– For continuous function 𝐹: [0, 1](→ ℛ and 𝜀 > 0,

there exists

𝑓(𝑥) = 𝛼3𝜎 𝒘𝒙 + 𝑏

=9:

𝛼: 𝜎 9;

𝑤:; 𝑥: + 𝑏:

such that for all 𝑥 in [0,1]( , 𝐹 𝑥 − 𝑓 𝑥 < 𝜀 holds

Cybenko 1989

Page 11: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Sample Efficiency

• Theorem:– There exist Boolean functions computable with a

polynomial size logic gates circuit of depth 𝑘 that require exponential size when restricted to depth 𝑘 − 1

• Deep networks have better sample efficiency than shallow networks

Hastad 1986

Page 12: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Generalization

• Deep neural networks exhibit remarkably generalizaPon ability

• Findings– Deep neural networks easily fit random labels– Explicit regularizers like dropout and weight-decay

may not be essenPal for generalizaPon – SGD may act as an implicit regularizer

Zhang et al. 2017

Page 13: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Generalization• Theorem:– Two-layer overparameterized ReLU neural network

for multi-class classification – Stochastic gradient descent (SGD) from random

initialization– If data is from mixtures of well-separated

distributions– Then SGD learns a network with small

generalization error

Li & Liang 2018

Page 14: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

No Free Lunch Theorem

• Theorem– ℱ = set of all possible functions, 𝑦 = 𝑓 𝑥– Given any distribution 𝒟 on (𝑥, 𝑦) and training

data set 𝒮

– For any leaner 𝐿, R|ℱ|∑ℱ 𝐴𝑐𝑐 𝐿 = R

Wholds, where

𝐴𝑐𝑐 is generalization accuracy• Corollary– For any two learners 𝐿R, 𝐿W– If ∃ function, s.t. 𝐴𝑐𝑐 𝐿R > 𝐴𝑐𝑐 𝐿W– Then ∃ function, s.t. 𝐴𝑐𝑐 𝐿W > 𝐴𝑐𝑐 𝐿R

Wolpert & Macready 1997

Page 15: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Robustness (强健性)• Adversarial robustness• min

Z𝐸\ max

| \^\_ |`ab𝐿(𝜃, 𝑥′)

• Theorem:– If data distribution of binary classification is two

Gaussians– Then sample complexity of robust generalization is

significantly larger than that of standard generalization

• More training data is needed for robust classification

Schmidt et al. 2018

Page 16: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Bengio’s Comment

What can we conclude about the failure of our current systems? I would say the strongest thing I see is that they are learning in a way that exploits superficial clues that help to do the task they are asked to do. But often these are not the clues humans would consider to be the most important. Yoshua Bengio

Page 17: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Appropriateness• Learned representation might not be appropriate,

due to– Data bias– Model bias– Training bias

• Deep networks may exhibit pathological behavior

Male or female?

Page 18: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Interpretability

• Should neural networks have interpretability?• It depends on applications, e.g., health care,

finance• We are not consciously aware how our minds

process information

Page 19: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Outline

• Human Language Processing• Deep Learning• Deep Learning for Natural Language

Processing• Attention: Soft Association Mechanism• Future of Natural Language Processing• Summary

Page 20: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Natural Language Processing Problems

• Classification: 𝑥 → 𝑐• Matching: 𝑥, 𝑦 → ℛ• Sequence-to-Sequence: 𝑥 → 𝑦• Structured Prediction: 𝑥 → [𝑥]• Sequential Decision Process: 𝜋: 𝑠 → 𝑎

Li 2017

Page 21: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Natural Language Problems• Classification

– Text classification– Sentiment analysis

• Matching– Search– Question answering– Single-turn dialogue

(retrieval)• Sequence to Sequence

– Machine translation– Summarization– Single-turn dialogue

(generation)

• Structured Prediction– Sequential labeling– Semantic parsing

• Sequential Decision Process– Multi turn dialogue

Page 22: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Deep Learning for Natural Language Processing

𝑥 𝑦

𝑥 𝑦

maxh𝑃h (𝑦|𝑥)

𝑦 = 𝑓(𝑥)

A game of mimicking human behaviors using neural processing tools

Page 23: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Neural Processing Techniques • Models

– Feedforward Neural Network– Convolutional Neural Network– Recurrent Neural Network– Sequence-to-Sequence Model– Attention– …..

• Input: word embedding• Output: softmax function• Loss function: cross entropy• Learning algorithm: stochastic gradient descent• Regularization, e.g., dropout

Page 24: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

McAllester’s Argument

Progress toward AI is coming from advances in general purpose (differentiable) programming language features, including residual connections, gating, attention, GAN, VAE.

David McAllester

Page 25: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Outline

• Human Language Processing• Deep Learning• Deep Learning for Natural Language

Processing• Attention: Soft Association Mechanism• Future of Natural Language Processing• Summary

Page 26: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Attention Is All You Need?• Attention (including self-attention) is powerful

mechanism, used in many models including Transformer

• Attention = ‘soft’ association mechanism, cf., associative memory

𝑘𝑒𝑦R 𝑣𝑎𝑙𝑢𝑒R

𝑘𝑒𝑦W 𝑣𝑎𝑙𝑢𝑒W

𝑘𝑒𝑦( 𝑣𝑎𝑙𝑢𝑒(

……

𝑞𝑢𝑒𝑟𝑦

𝑣𝑎𝑙𝑢𝑒 =9:pR

(

𝜋(𝑞𝑢𝑒𝑟𝑦, 𝑘𝑒𝑦:) q 𝑣𝑎𝑙𝑢𝑒:

Page 27: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Sequence-to-Sequence Model: Transformer

• Encoder + decoder• Multi-head attention• Multi-layer encoder and

decoder• Three types of attention• Parallel processing• Position embedding• BERT: Transformer

encoder

Figure 1: The Transformer - model architecture.

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has twosub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each ofthe two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer isLayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layeritself. To facilitate these residual connections, all sub-layers in the model, as well as the embeddinglayers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.

3

Page 28: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Transformer Builds Hierarchical Sentence Representation with Attention

Page 29: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Outline

• Human Language Processing• Deep Learning• Deep Learning for Natural Language

Processing• Attention: Soft Association Mechanism• Future of Natural Language Processing• Summary

Page 30: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

New Opportunities in Future

• Language GeneraPon• MulPmoal Processing• Prior RepresentaPon Learning• Neural symbolic processing?

Page 31: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Prior Representation Learning

• BERT: prior language representation learning• Enhanced state-of-the-art models in many tasks

𝑥 𝑦

𝑦 = 𝑓(𝑥, ℎ(𝑥′))

𝑥′

Page 32: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Neural Symbolic Processing

𝑥 𝑦

𝑦 = 𝑓(𝑥, 𝑔(𝑠))

𝑠

• Incorporate knowledge (structured symbols) into neural networks

• Still very challenging

Page 33: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Hinton’s Comment• Combining neural

processing and symbolic processing is just like combining electric cars and gasoline cars (paraphrase)

• Personal communication, NeurIPS 2018 Geoffrey Hinton

Page 34: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Interpretability

𝑥 𝑦

𝑦 = 𝑓 𝑥

𝑠

• Extract knowledge (structured symbols) from neural networks

• Another neural symbolic processing problem

Page 35: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Outline

• Human Language Processing• Deep Learning• Deep Learning for Natural Language

Processing• Attention: Soft Association Mechanism• Future of Natural Language Processing• Summary

Page 36: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Summary• Human language processing is neural processing• Advantages of DL: function approximation, sample

efficiency, generalization• Disadvantages of DL: robustness and appropriateness• DL for NLP is game of mimicking human behaviors using

neural processing tools• Attention: soft association mechanism• Future directions include generation, multimodality, and

prior representation learning• Neural symbolic processing is a challenging yet

important problem

Page 37: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

References• 李航,智能与计算,计算机学会通讯,2019年第一期。• Chiyuan Zhang, Samy Bengio, Moritz Hardt,Benjamin Recht,and

OriolVinyals. Understanding deep learning requires rethinking generalization, 2017.

• Yuanzhi Li, Yingyu Liang, Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data, 2018.

• Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, Aleksander Mądry, Adversarially Robust Generalization Requires More Data, 2018.

• Hang Li, Deep Learning for Natural Language Processing, National Science Review, Perspective, 2017.

• David McAllester, Universality in Deep Learning and Models of Computation, Second Workshop on Symbolic-Neural Learning, Nagoya, 2018.

Page 38: Deep Learning and Natural Language Processing€¦ · –Single-turn dialogue (retrieval) •Sequence to Sequence –Machine translation –Summarization –Single-turn dialogue (generation)

Thank you!

[email protected]