Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and...
Transcript of Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and...
Language Models and Transfer LearningYifeng Tao
School of Computer ScienceCarnegie Mellon University
Slides adapted from various sources (see reference page)
Carnegie Mellon University 1Yifeng Tao
Introduction to Machine Learning
What is a Language Model
oA statistical language model is a probability distribution over sequences of words
oGiven such a sequence, say of length m, it assigns a probability to the whole sequence:
oMain problem: data sparsity
Yifeng Tao Carnegie Mellon University 2
[Slide from https://en.wikipedia.org/wiki/Language_model.]
Unigram model: Bag of words
oGeneral probability distribution:
oUnigram model assumption:
oEssentially, bag of words modeloEstimation of unigram params: count word
frequency in the doc
Yifeng Tao Carnegie Mellon University 3
[Slide from https://en.wikipedia.org/wiki/Language_model.]
n-gram model
on-gram assumption:
oEstimation of n-gram params:
Yifeng Tao Carnegie Mellon University 4
[Slide from https://en.wikipedia.org/wiki/Language_model.]
Word2Vec
oWord2Vec: Learns distributed representations of wordsoContinuous bag-of-words (CBOW)
oPredicts current word from a window of surrounding context wordsoContinuous skip-gram
oUses current word to predict surrounding window of context wordsoSlower but does a better job for infrequent words
Yifeng Tao Carnegie Mellon University 5
[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]
Skip-gram Word2Vec
oAll words: oParameters of skip-gram word2vec model
oWord embedding for each word:oContext embedding for each word:
oAssumption:
Yifeng Tao Carnegie Mellon University 6
[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]
Distributed Representations of Words
oThe trained parameters of words in skip-gram word2vec modeloSemantics and embedding space
Yifeng Tao Carnegie Mellon University 7
[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]
Word Embeddings in Transfer Learning
oTransfer learning:oLabeled data are limitedoUnlabeled text corpus enormousoPretrained word embeddings can be transferred to other supervised tasks.
E.g., POS, NER, QA, MT, Sentiment classification
Yifeng Tao Carnegie Mellon University 8
[Slide from Matt Gormley.]
SOTA Language Models: ELMo
oEmbeddings from Language Models: ELMooFits full conditional probability in forward direction:
oFits full conditional probability in both directions using LSTM:
Yifeng Tao Carnegie Mellon University 9
[Slide from https://arxiv.org/pdf/1802.05365.pdf andhttps://arxiv.org/abs/1810.04805.]
SOTA Language Models: OpenAI GPT & BERT
oUses transformer other than LSTM to model languageoOpenAI GPT: single directionoBERT: bi-direction
Yifeng Tao Carnegie Mellon University 10
[Slide from https://arxiv.org/abs/1810.04805.]
SOTA Language Models: BERT
oAdditional language modeling task: predict whether sentences come from same paragraph.
Yifeng Tao Carnegie Mellon University 11
[Slide from https://arxiv.org/abs/1810.04805.]
SOTA Language Models: BERT
oInstead of extract embeddings and hidden layer outputs, can be fine-tuned to specific supervised learning tasks.
Yifeng Tao Carnegie Mellon University 12
[Slide from https://arxiv.org/abs/1810.04805.]
The Transformer and Attention Mechanism
oAn encoder-decoder structureoOur focus: encoder and attention mechanism
Yifeng Tao Carnegie Mellon University 13
[Slide from https://jalammar.github.io/illustrated-transformer/.]
The Transformer and Attention Mechanism
oSelf-attentiono Ignores positions of words, assign weights globally.oCan be parallelized, in contrast to LSTM.
oE.g., the attention weights related to word “it_”:
Yifeng Tao Carnegie Mellon University 14
[Slide from https://jalammar.github.io/illustrated-transformer/.]
Self-attention Mechanism
Yifeng Tao Carnegie Mellon University 15
[Slide from https://jalammar.github.io/illustrated-transformer/.]
Self-attention Mechanism
oMore…ohttps://jalammar.githu
b.io/illustrated-transformer/
Yifeng Tao Carnegie Mellon University 16
[Slide from https://jalammar.github.io/illustrated-transformer/.]
Take home message
oLanguage models suffer from data sparsityoWord2vec portrays language probability using distributed word
embedding parametersoELMo, OpenAI GPT, BERT model language using deep neural
networksoPre-trained language models or their parameters can be transferred
to supervised learning problems in NLPoSelf-attention has the advantage over LSTM that it can be
parallelized and consider interactions across the whole sentence
Carnegie Mellon University 17Yifeng Tao
References
oWikipedia: https://en.wikipedia.org/wiki/Language_modeloTensorflow. Vector Representations of Words:
https://www.tensorflow.org/tutorials/representation/word2vecoMatt Gormley. 10601 Introduction to Machine Learning:
http://www.cs.cmu.edu/~mgormley/courses/10601/index.htmloMatthew E. Peters et al. Deep contextualized word representations:
https://arxiv.org/pdf/1802.05365.pdfoJacob Devlin et al. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding: https://arxiv.org/abs/1810.04805
oJay Alammar. The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
Carnegie Mellon University 18Yifeng Tao