A Mathematical Exploration of Language Models

36
A Mathematical Exploration of Language Models Nikunj Saunshi Princeton University Center of Mathematical Sciences and Applications, Harvard University 10 th February 2021

Transcript of A Mathematical Exploration of Language Models

Page 1: A Mathematical Exploration of Language Models

A Mathematical Exploration of Language Models

Nikunj SaunshiPrinceton University

Center of Mathematical Sciences and Applications, Harvard University10th February 2021

Page 2: A Mathematical Exploration of Language Models

Language Models

LanguageModel

Context๐‘  = โ€œI went to the cafรฉ and ordered aโ€

Distribution๐‘โ‹…|#

0.3โ€ฆ0.2

0.05

0.0001

๐‘โ‹…|#(โ€œlatteโ€)

๐‘โ‹…|#(โ€œbagelโ€)

๐‘โ‹…|#(โ€œdolphinโ€)

Next word predictionFor context ๐‘ , predict what word ๐‘ค would follow it

Cross-entropy objectiveAssign high ๐‘โ‹…|# ๐‘ค for observed (๐‘ , ๐‘ค) pairs

๐”ผ(#,&) โˆ’ log ๐‘โ‹…|# ๐‘ค(๐‘ , ๐‘ค) pairs

Unlabeled dataGenerate (๐‘ , ๐‘ค) pairs using sentences

Page 3: A Mathematical Exploration of Language Models

Success of Language Models

LanguageModel

Distribution: ๐‘โ‹…|#Context: ๐‘ 

Architecture: TransformerParameters: 175 B

Architecture: TransformerParameters: 1542 M

Architecture: RNNParameters: 24 M

Train using Cross-entropy

Text Generation Question Answering

Machine Translation

Downstream tasks

Sentence Classification

I bought coffee

J'ai achetรฉ du cafรฉ

โ€œScienceโ€ vs โ€œPoliticsโ€The capital of Spain is __

It was bright sunny day in โ€ฆโ€ฆโ€ฆโ€ฆ..

Page 4: A Mathematical Exploration of Language Models

Main QuestionWhy should solving the next word prediction task help solve

seemingly unrelated downstream tasks with very little labeled data?

Rest of the talk

More general framework of โ€œsolving Task A helps with Task Bโ€

Our results for Language Models based on recent paperA Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

Saunshi, Malladi, Arora, To Appear in ICLR 2021

Page 5: A Mathematical Exploration of Language Models

Solving Task A helps with Task B

Page 6: A Mathematical Exploration of Language Models

Solving Task A helps with Task B

โ€ข Humans can use the โ€experienceโ€ and โ€œskillsโ€ acquired from Task A to for new Task B efficiently

Language ModelingTask A: Next word predictionTask B: Downstream NLP task

โ†’

Ride Bicycle Ride Motorcycle

โ†’

Get a Math degree

Do well inlaw school later

โ†’

Do basic chores Excel at Karate

The Karate Kid

Page 7: A Mathematical Exploration of Language Models

โ€ข Adapted in Machine Learning

โ€ข More data efficient than supervised learningโ€ข Requires fewer labeled samples than solving Task B from scratch using supervised learning

Initialize a model andfine-tune using labeled data

Extract features and learn a classifier using labeled data

Pretrain Model on Task AStage 1

Use Model on Task BStage 2

Other innovative ways of using pretrained model

Solving Task A helps with Task BLanguage Modeling

Task A: Next word predictionTask B: Downstream NLP task

Page 8: A Mathematical Exploration of Language Models

โ€ข Transfer learningโ€ข Task A: Large supervised learning problem (ImageNet)โ€ข Task B: Object detection, Disease detection using X-ray images

โ€ข Meta-learningโ€ข Task A: Many small tasks related to Task Bโ€ข Task B: Related tasks (classify characters from a new language)

โ€ข Self-supervised learning (e.g. language modeling)โ€ข Task A: Constructed using unlabeled data โ€ข Task B: Downstream tasks of interest

Requires some labeled data in Task A

Requires only unlabeled data in Task A

Solving Task A helps with Task B

โ€œThis is the single most important problem to solve in AI todayโ€- Yann LeCun

https://www.wsj.com/articles/facebook-ai-chief-pushes-the-technologys-limits-11597334361

Page 9: A Mathematical Exploration of Language Models

Self-Supervised Learning

Motivated by following observationsโ€ข Humans learn by observing/interacting with the world, without explicit supervisionโ€ข Supervised learning with labels is successful, but human annotations can be expensiveโ€ข Unlabeled data is available in abundance and is cheap to obtain

โ€ข Many practical algorithms following this principle do well on standard benchmarks, sometimes beating even supervised learning!

PrincipleUse unlabeled data to generate labels and construct supervised learning tasks

Page 10: A Mathematical Exploration of Language Models

Self-Supervised Learning

Examples in practiceโ€ข Images

โ€ข Predict color of image from b/w versionโ€ข Reconstructing part of image from the rest of itโ€ข Predict the rotation applied to an image

โ€ข Textโ€ข Making representation of consecutive sentences in Wikipedia closeโ€ข Next word predictionโ€ข Fill in the multiple blanks in a sentence

Task A: Constructed from unlabeled dataTask B: Downstream task of interest

Just need raw images

Just need a large text corpus

Page 11: A Mathematical Exploration of Language Models

Theory for Self-Supervised Learning

โ€ข We have very little mathematical understanding of this important problem.

โ€ข Theory can potentially helpโ€ข Formalize notions of โ€œskill learningโ€ from tasksโ€ข Ground existing intuitions in mathโ€ข Give new insights that can improve/design practical algorithms

โ€ข Existing theoretical frameworks fail to capture this settingโ€ข Task A and Task B are very different โ€ข Task A is agnostic to Task B

โ€ข We try to gain some understanding for one such method: language modeling

Page 12: A Mathematical Exploration of Language Models

A Mathematical Exploration of Why Language Models Help Solve Downstream TasksSaunshi, Malladi, Arora, To Appear in ICLR 2021

Page 13: A Mathematical Exploration of Language Models

Theory for Language ModelsTask A: Next word predictionTask B: Downstream NLP task

LanguageModel

Distribution over words:๐‘โ‹…|#

Context:๐‘  = โ€œI went to the cafรฉ and ordered aโ€

Why should solving the next word prediction task help solve seemingly unrelated downstream tasks with very little labeled data?

Stage 1 Pretrain Language

Model using on Next Word Prediction

Stage 2 Use LanguageModel for Downstream

Task

Page 14: A Mathematical Exploration of Language Models

Theoretical setting

Representation LearningPerspective

Role of task & objective

Sentence classification

โœ“ Extract features from LM, learn linear classifiers: effective, data-efficient, can do math

โœ˜ Finetuning: Hard to quantify its benefit using current deep learning theory

โœ“ Why next word prediction (w/ cross-entropy objective) intrinsically helps

โœ˜ Inductive biases of architecture/algorithm: current tools are insufficient

โœ“ First-cut analysis. Already gives interesting insights

โœ˜ Other NLP tasks (question answering, etc.)

LanguageModel

Distribution: ๐‘โ‹…|#Context: ๐‘ 

What aspects of pretraining help?

What are downstream tasks?

How to use a pretrained model?

Page 15: A Mathematical Exploration of Language Models

Theoretical setting

Why can language models that do well on cross-entropy objective learn featuresthat are useful for linear classification tasks?

LanguageModel

Distribution: ๐‘โ‹…|#Context: ๐‘ 

Extract ๐‘‘-dim features ๐‘“(๐‘ )

๐‘“

โ€œIt was an utter waste of time.โ€

โ€œI would recommend this movie.โ€

โ€œNegativeโ€

โ€œPositiveโ€

Page 16: A Mathematical Exploration of Language Models

Result overview

Key idea

Formalization

Verification

Classifications tasks can be rephrased as sentence completion problems, thus making next word prediction a meaningful pretraining task

Show that LM that is ๐-optimal in cross-entropy learn featuresthat linearly solve such tasks up to ๐’ช ๐œ–

Experimentally verify theoretical insights (also design a new objective function)

Why can language models that do well on cross-entropyobjective learn features that are useful for linear

classification tasks?

Page 17: A Mathematical Exploration of Language Models

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Page 18: A Mathematical Exploration of Language Models

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Page 19: A Mathematical Exploration of Language Models

Language Modeling: Cross-entropy

LanguageModel

Predicted dist.๐‘โ‹…|# โˆˆ โ„$

Context๐‘  = โ€œI went to the cafรฉ and ordered aโ€

0.3โ€ฆ0.2

0.05

0.0001

0.35โ€ฆ0.18

0.047

0.00005

True dist.๐‘โ‹…|#โˆ— โˆˆ โ„$

โ„“()*+ ๐‘โ‹…|# = ๐”ผ#,& โˆ’ log ๐‘โ‹…|# ๐‘ค

Optimal solution: Minimizer of โ„“&'() ๐‘โ‹…|# is ๐‘โ‹…|# = ๐‘โ‹…|#โˆ—

Proof: Can rewrite as โ„“&'() ๐‘โ‹…|# = ๐”ผ# ๐พ๐ฟ ๐‘โ‹…|#โˆ— , ๐‘โ‹…|# + ๐ถcross-entropy

Samples fromWhat does the best language model learn?

(minimizer of cross-entropy)

Page 20: A Mathematical Exploration of Language Models

Language Modeling: Softmax

0.35โ€ฆ0.18

0.047

0.00005

True dist.๐‘โ‹…|#โˆ— โˆˆ โ„$

min,,-

โ„“()*+ ๐‘,(#)Optimal solution: For fixed ฮฆ, ๐‘“โˆ— that minimizes โ„“&'()satisfies ฮฆ๐‘*โˆ—(#) = ฮฆ๐‘โ‹…|#โˆ—

Can we still learn ๐‘*(#) = ๐‘โ‹…|#โˆ— exactly when ๐‘‘ < ๐‘‰?

Language Model

Context: ๐‘ 

0.3โ€ฆ0.2

0.05

0.0001

๐’‡ ๐’” โˆˆ โ„๐’…softmax

on ๐šฝ/๐’‡(s)๐šฝ โˆˆ โ„๐’…ร—๐‘ฝ

Softmax dist.๐‘*(#) โˆˆ โ„$

Word embeddingsFeatures

Proof: Use first-order condition (gradient = 0)โˆ‡- ๐พ๐ฟ ๐‘โ‹…|#โˆ— , ๐‘- = โˆ’ฮฆ๐‘- +ฮฆ๐‘โ‹…|#โˆ—

Only guaranteed to learn ๐‘โ‹…|#โˆ— on the ๐‘‘-dimensional subspace spanned by ๐šฝ

0.1โˆ’0.521.230.04

โˆ’0.3 โ‹ฏ 0.2โ‹ฎ โ‹ฑ โ‹ฎ0.8 โ‹ฏ โˆ’0.1

Page 21: A Mathematical Exploration of Language Models

LMs with cross-entropy aims to learn ๐‘โ‹…|Eโˆ—

Softmax LMs with word embeddings ฮฆ can only be guaranteed to learn ฮฆ๐‘โ‹…|Eโˆ—

Page 22: A Mathematical Exploration of Language Models

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Page 23: A Mathematical Exploration of Language Models

Classification task โ†’ Sentence completion

โ€ข Binary classification task ๐’ฏ. E.g. {(โ€œI would recommend this movie.โ€, +1), โ€ฆ, (โ€œIt was an utter waste of time.โ€, -1)}

โ€ข Language models aim to learn ๐‘โ‹…|#โˆ— (or on subspace). Can ๐‘โ‹…|#โˆ— even help solve ๐’ฏ

I would recommend this movie. ___

๐‘โ‹…|#โˆ— J โˆ’ ๐‘โ‹…|#โˆ— L > 0

+1, . . , โˆ’1, โ€ฆ , 0 /

๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€

> 0

๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€

๐‘ฃ% ๐‘โ‹…|#โˆ— > 0Linear classifier over ๐‘โ‹…|#โˆ—

Page 24: A Mathematical Exploration of Language Models

Classification task โ†’ Sentence completion

I would recommend this movie. ___๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€

๐‘โ‹…|#โˆ—

Page 25: A Mathematical Exploration of Language Models

Classification task โ†’ Sentence completion

I would recommend this movie. This movie was ___

๐‘โ‹…|#โˆ— (โ€œ๐‘Ÿ๐‘œ๐‘๐‘˜โ€)๐‘โ‹…|#โˆ— (โ€œ๐‘”๐‘œ๐‘œ๐‘‘โ€)

๐‘โ‹…|#โˆ— (โ€œ๐‘๐‘Ÿ๐‘–๐‘™๐‘™๐‘–๐‘Ž๐‘›๐‘กโ€)โ€ฆ

๐‘โ‹…|#โˆ— (โ€œ๐‘”๐‘Ž๐‘Ÿ๐‘๐‘Ž๐‘”๐‘’โ€)๐‘โ‹…|#โˆ— (โ€œ๐‘๐‘œ๐‘Ÿ๐‘–๐‘›๐‘”โ€)๐‘โ‹…|#โˆ— (โ€œโ„Ž๐‘’๐‘™๐‘™๐‘œโ€)

Prompt

I would recommend this movie. ___

๐‘โ‹…|#โˆ—

024โ€ฆโˆ’3โˆ’20

๐‘ฃ

๐‘ฃ% ๐‘โ‹…|#โˆ— > 0

Allows for a larger set of words that are grammatically correct completions

Extendable to other classification tasks (e.g., topic classification)

Page 26: A Mathematical Exploration of Language Models

Experimental verification

โ€ข Verify sentence completion intuition (๐‘โ‹…|Eโˆ— can solve a task)โ€ข Task: SST is movie review sentiment classification taskโ€ข Learn linear classifier on subset of words on ๐‘&(#) from pretrained LM*

With prompt:โ€œThis movie isโ€ *Used GPT-2 (117M parameters)

๐‘˜ ๐‘&(#)(๐‘˜ words)

๐‘&(#)(~ 20 words)

๐‘“ ๐‘ (768 dim)

๐‘“)*)+ ๐‘ (768 dim)

Bag-of-words

SST 2 76.4 78.2 87.6 58.1 80.7

SST* 2 79.4 83.5 89.5 56.7 -

๐‘*(#) J

๐‘*(#) L

๐‘!(#) โ€๐‘”๐‘œ๐‘œ๐‘‘โ€๐‘!(#) โ€๐‘”๐‘Ÿ๐‘’๐‘Ž๐‘กโ€

โ€ฆ๐‘!(#) โ€๐‘๐‘œ๐‘Ÿ๐‘–๐‘›๐‘”โ€๐‘!(#) โ€๐‘๐‘Ž๐‘‘โ€

Features from LM

Features from random init LM

non-LM baseline

Page 27: A Mathematical Exploration of Language Models

Classification tasks can be rephrased as sentence completion problems

This is the same as solving the task using a linear classifier on ๐‘โ‹…|Eโˆ— , i.e. ๐‘ฃJ๐‘โ‹…|Eโˆ— > 0

Page 28: A Mathematical Exploration of Language Models

Outline

โ€ข Language modelingโ€ข Cross-entropy and Softmax-parametrized LMs

โ€ข Downstream tasksโ€ข Sentence completion reformulation

โ€ข Formal guaranteesโ€ข ๐œ–-optimal LM โ‡’๐’ช ๐œ– -good on task

โ€ข Extensions, discussions and future work

Page 29: A Mathematical Exploration of Language Models

Natural task

minKโ„“๐’ฏ ๐‘โ‹…|Eโˆ— , ๐‘ฃ โ‰ค ๐œ๐œ - Natural task ๐’ฏ

๐œ captures how โ€naturalโ€ (amenable to sentence completion reformulation) the classification task is

Sentence completion reformulationโ‡’ Can solve using ๐‘ฃ%๐‘โ‹…|#โˆ— > 0

For any ๐ท-dim feature map ๐‘” and classifier ๐‘ฃโ„“๐’ฏ ๐‘” ๐‘  , ๐‘ฃ = ๐”ผ(#,.)[logistic-loss ๐‘ฃ%๐‘”(๐‘ ), ๐‘ฆ ]

๐‘”

โ€œIt was an utter waste of time.โ€

โ€œI would recommend this movie.โ€

โ€œNegativeโ€

โ€œPositiveโ€

Page 30: A Mathematical Exploration of Language Models

Main Result

Naturalness of task(sentence completion view)

โ„“๐’ฏ ฮฆ๐‘: ; โ‰ค ๐œ + ๐’ช ๐œ–

Logistic regression loss of ๐‘‘-dimensional features ฮฆ๐‘& #

1. ๐’‡ is a LM that is ๐ โˆ’ optimal in cross-entropy (does well on next word prediction)

2. ๐’ฏ is a ๐‰ โˆ’ natural task (fits sentence completion view)

3. Word embeddings ๐šฝ are nice (assigns similar embeddings to synonyms)

Why can language models that do well on cross-entropyobjective learn features that are useful for linear

classification tasks?

Loss due to suboptimality of LM

๐œ– = โ„“/0*+ ๐‘& # โˆ’ โ„“/0*+ {๐‘โ‹…|#โˆ— }

Page 31: A Mathematical Exploration of Language Models

Main Result: closer look

Why can language models that do well on cross-entropy objective learn features that are useful for linear classification tasks?

Guarantees for LM ๐‘“ that is ๐œ–-optimal in cross-entropy

Use the output probabilities ฮฆ๐‘M(E)as ๐‘‘- dimensional features

Upper bound on logistic regression loss for natural classification tasks

โ„“๐’ฏ ฮฆ๐‘, # โ‰ค ๐œ + ๐’ช ๐œ–

Page 32: A Mathematical Exploration of Language Models

AG News 4 68.4 78.3 84.5 90.7

AG News* 4 71.4 83.0 88.0 91.1

Conditional mean features

๐‘˜ ๐‘&(#)(๐‘˜ words)

๐‘&(#)(~ 20 words)

ฮฆ๐‘&(#)(768 dim)

๐‘“(๐‘ )(768 dim)

SST 2 76.4 78.2 82.6 87.6

SST* 2 79.4 83.5 87.0 89.5

ฮฆ๐‘M E =3Q

๐‘M E ๐‘ค ๐œ™Q

Weighted average of word embeddings

โ„“&'() ๐‘* #

โ„“ ๐’ฏฮฆ๐‘ *

#

Observing ๐œ– dependence in practice

New way to extract ๐‘‘-dimensional features from LM

โ„“๐’ฏ ฮฆ๐‘, # โ‰ค ๐œ + ๐’ช ๐œ–

Page 33: A Mathematical Exploration of Language Models

Main take-aways

โ€ข Classification tasks โ†’ Sentence completion โ†’ Solve using ๐‘ฃD๐‘โ‹…|;โˆ— > 0

โ€ข ๐œ–-optimal language model will do ๐’ช( ๐œ–) well on such tasks

โ€ข Softmax models can hope to learn ฮฆ๐‘โ‹…|;โˆ—โ€ข Good to assign similar embeddings to synonyms

โ€ข Conditional mean features: ฮฆ๐‘:(;)โ€ข Mathematically motivated way to extract ๐‘‘-dimensional features from LMs

Page 34: A Mathematical Exploration of Language Models

More in paper

โ€ข Connection between ๐‘“ ๐‘  and ฮฆ๐‘M(E)

โ€ข Use insights to design new objective, alternative to cross-entropy

โ€ข Detailed bounds capture other intuitions

Page 35: A Mathematical Exploration of Language Models

Future work

โ€ข Understand why ๐‘“(๐‘ ) does better than ฮฆ๐‘M(E) in practice

โ€ข Bidirectional and masked language models (BERT and variants)โ€ข Theory applies when just one masked token

โ€ข Diverse set of NLP tasksโ€ข Does sentence completion view extend? Other insights?

โ€ข Role of finetuning, inductive biasesโ€ข Needs more empirical exploration

โ€ข Self-supervised learning

Page 36: A Mathematical Exploration of Language Models

Thank you!

โ€ข Happy to take questions

โ€ข Feel free to email: [email protected]

โ€ข ArXiv: https://arxiv.org/abs/2010.03648

โ„“%&'( ๐‘) #

โ„“ ๐’ฏฮฆ๐‘ )

#

โ„“๐’ฏ ฮฆ๐‘* # โ‰ค ๐œ + ๐’ช ๐œ–

I would recommend this movie. ___

๐‘โ‹…|#โˆ— J โˆ’ ๐‘โ‹…|#โˆ— L > 0

๐‘โ‹…|#โˆ— Jโ€ฆ

๐‘โ‹…|#โˆ— Lโ€ฆ

๐‘โ‹…|#โˆ— โ€๐‘‡โ„Ž๐‘’โ€