Natural Language Processingpeople.ischool.berkeley.edu › ~dbamman › nlpF18 › slides ›...

Natural Language Processing

Info 159/259Lecture 7: Language models 2 (Sept 13, 2018)

David Bamman, UC Berkeley

“My favorite food is”

Language Model

• Vocabulary 𝒱 is a finite set of discrete symbols (e.g., words, characters); V = | 𝒱 |

• 𝒱+ is the infinite set of sequences of symbols from 𝒱; each sequence ends with STOP

• x ∈ 𝒱+

Language modeling is the task of estimating P(w)

Language Model

• arms man and I sing • Arms, and the man I sing [Dryden]

• I sing of arms and a manarma virumque cano

• When we have choices to make about different ways to say something, a language models gives us a formal way of operationalizing their fluency (in terms of how likely they are exist in the language)

bigram model (first-order markov)

trigram model (second-order markov)

P (wi | wi�1) � P (STOP | wn)

P (wi | wi�2, wi�1)

�P (STOP | wn�1, wn)

Markov assumption

Classification

𝓧 = set of all documents 𝒴 = {english, mandarin, greek, …}

A mapping h from input data x (drawn from instance space 𝓧) to a label (or labels) y from some enumerable output space 𝒴

x = a single document y = ancient greek

Classification

𝒴 = {the, of, a, dog, iphone, …}

A mapping h from input data x (drawn from instance space 𝓧) to a label (or labels) y from some enumerable output space 𝒴

x = (context) y = word

Logistic regression

Y = {0, 1}output space

P(y = 1 | x, β) =1

1 + exp��

�Fi=1 xiβi

Feature Value

bravest 0

love 0

loved 0

genius 0

fruit 1

BIAS 1

x = feature vector

Feature β

the 0.01

and 0.03

bravest 1.4

love 3.1

loved 1.2

genius 0.5

not -3.0

fruit -0.8

BIAS -0.1

β = coefficients

P(Y = 1 | X = x; β) =exp(x�β)

1 + exp(x�β)

=exp(x�β)

exp(x�0) + exp(x�β)

Logistic regression

output space

P(Y = y | X = x; β) =exp(x�βy)�

y��Y exp(x�βy�)

Y = {1, . . . ,K}

Feature Value

bravest 0

love 0

loved 0

genius 0

fruit 1

BIAS 1

x = feature vector

Feature β1 β2 β3 β4 β5

the 1.33 -0.80 -0.54 0.87 0

and 1.21 -1.73 -1.57 -0.13 0

bravest 0.96 -0.05 0.24 0.81 0

love 1.49 0.53 1.01 0.64 0

loved -0.52 -0.02 2.21 -2.53 0

genius 0.98 0.77 1.53 -0.95 0

not -0.96 2.14 -0.71 0.43 0

fruit 0.59 -0.76 0.93 0.03 0

BIAS -1.92 -0.70 0.94 -0.63 0

β = coefficients

Language Model• We can use multi class logistic regression for

language modeling by treating the vocabulary as the output space

Unigram LM

Feature βthe βof βa βdog βiphone

BIAS -1.92 -0.70 0.94 -0.63 0

• A unigram language model here would have just one feature: a bias term.

Bigram LM

βthe βof βa βdog βiphone

-0.78 -0.80 -0.54 0.87 01.21 -1.73 -1.57 -0.13 00.96 -0.05 0.24 0.81 0

1.49 0.53 1.01 0.64 0-0.52 -0.02 2.21 -2.53 0

0.98 0.77 1.53 -0.95 0

-0.96 2.14 -0.71 0.43 00.59 -0.76 0.93 0.03 0

-1.92 -0.70 0.94 -0.63 0

Feature Value

wi-1=the 1wi-1=and 0

wi-1=bravest

0wi-1=love 0

wi-1=loved 0wi-1=geniu

wi-1=not 0wi-1=fruit 0

BIAS 1

βthe βof βa βdog βiphone

-0.78 -0.80 -0.54 0.87 01.21 -1.73 -1.57 -0.13 00.96 -0.05 0.24 0.81 0

1.49 0.53 1.01 0.64 0-0.52 -0.02 2.21 -2.53 0

0.98 0.77 1.53 -0.95 0

-0.96 2.14 -0.71 0.43 00.59 -0.76 0.93 0.03 0

-1.92 -0.70 0.94 -0.63 0

Feature Value

wi-1=bravest

0wi-1=love 0

wi-1=loved 0wi-1=geniu

wi-1=not 0wi-1=fruit 0

BIAS 1

P(wi = dog | wi�1 = the)

Trigram LM

Feature Value

wi-2=the ^ wi-1=the 0wi-2=and ^ wi-1=the 1

wi-2=bravest ^ wi-1=the 0wi-2=love ^ wi-1=the 0

wi-2=loved ^ wi-1=the 0wi-2=genius ^ wi-1=the 0

wi-2=not ^ wi-1=the 0wi-2=fruit ^ wi-1=the 0

BIAS 1

P(wi = dog | wi�2 = and,wi�1 = the)

Smoothing

Feature Value

wi-2=the ^ wi-1=the 0wi-2=and ^ wi-1=the 1

wi-2=bravest ^ wi-1=the 0wi-2=love ^ wi-1=the 0

wi-1=bravest 0wi-1=love 0

BIAS 1

P(wi = dog | wi�2 = and,wi�1 = the)

second-order features

first-order features

L2 regularization

• We can do this by changing the function we’re trying to optimize by adding a penalty for having values of β that are high

• This is equivalent to saying that each β element is drawn from a Normal distribution centered on 0.

• η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data)

�(β) =N�

i=1logP(yi | xi, β)

� �� we want this to be high

� ηF�

j=1β2j

� �� but we want this to be small

L1 regularization

• L1 regularization encourages coefficients to be exactly 0.

• η again controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data)

�(β) =N�

i=1logP(yi | xi, β)

� �� we want this to be high

� ηF�

j=1|βj|

� �� but we want this to be small

Richer representations

• Log-linear models give us the flexibility of encoding richer representations of the context we are conditioning on.

• We can reason about any observations from the entire history and not just the local context.

“JACKSONVILLE, Fla. — Stressed and exhausted families across the Southeast were assessing the damage from Hurricane Irma on Tuesday, even as flooding from the storm continued to plague some areas, like Jacksonville, and the worst of its wallop was being revealed in others, like the Florida Keys.

Officials in Florida, Georgia and South Carolina tried to prepare residents for the hardships of recovery from the ______________”

https://www.nytimes.com/2017/09/12/us/irma-jacksonville-florida.html

“Hillary Clinton seemed to add Benghazi to her already-long list of culprits to blame for her upset loss to Donald ________”

http://www.foxnews.com/politics/2017/09/13/clinton-laments-how-benghazi-tragedy-hurt-her-politically.html

feature classes example

ngrams (wi-1, wi-2:wi-1, wi-3:wi-1) wi-2=“to”, wi=“donald”

gappy ngrams w1=“hillary” and wi-1=“donald”

spelling, capitalizaiton wi-1 is capitalized and wi is capitalized

class/gazetteer membership wi-1 in list of names and wi in list of names

“Hillary Clinton seemed to add Benghazi to her already-long list of culprits to blame for her upset loss to Donald ________”

Classes

Goldberg 2017

bigram count

black car 100

blue car 37

red car 0

bigram count

<COLOR> car 137

Tradeoffs• Richer representations = more parameters, higher

likelihood of overfitting

• Much slower to train than estimating the parameters of a generative model

P(Y = y | X = x; β) =exp(x�βy)�

Neural LM

x = [v(w1); . . . ; v(wk)]

Simple feed-forward multilayer perceptron (e.g., one hidden layer)

input x = vector concatenation of a conditioning context of fixed size k

Bengio et al. 2003

x = [v(w1); . . . ; v(wk)]

v(w1)0.7

one-hot encoding distributed representation

v(w4)w1 = tried w2 = to w3 = prepare w4 = residents

Bengio et al. 2003

x = [v(w1); . . . ; v(wk)]

b1 � RH b2 � RVW2 � RH�V

y = softmax(hW2 + b2)

W1 � RkD�H

h = g(xW1 + b1)

Bengio et al. 2003

Softmax

P(Y = y | X = x; β) =exp(x�βy)�

tried to prepare residents for the hardships of recovery from the

Neural LM

conditioning context

Recurrent neural network

• RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history.

Goldberg 2017

• Each time step has two inputs:

• xi (the observation at time step i); one-hot vector, feature vector or distributed representation.

• si-1 (the output of the previous state); base case: s0 = 0 vector

Recurrent neural networksi = R(xi, si�1)

yi = O(si)

R computes the output state as a function of the current input and

previous state

O computes the output as a function of the current output

Continuous bag of words

yi = O(si) = si

si = R(xi, si�1) = si�1 + xi

Each state is the sum of all previous inputs

Output is the identity

“Simple” RNN

si = R(xi, si�1) = g(si�1Ws + xiWx + b)

yi = O(si) = si

Ws � RH�H

Wx � RD�H

b � RH

Different weight vectors W transform the previous state and current input before combining

g = tanh or relu

Elman 1990, Mikolov 2012

RNN LM

Elman 1990, Mikolov 2012

• The output state si is an H-dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax

yi = O(si) = (siWo + bo)

Training RNNs

• We have five sets of parameters to learn:

• Given this definition of an RNN:

si = R(xi, si�1) = g(si�1Ws + xiW

x + b)

yi = O(si) = (siWo + bo)

W s, W x, W o, b, bo

Training RNNswe tried to prepare residents

• At each time step, we make a prediction and incur a loss; we know the true y (the word we see in that position)

we tried to prepare residents

• Training here is standard backpropagation, taking the derivative of the loss we incur at step t with respect to the parameters we want to update

�L(�)y1

�W s

�L(�)y2

�W s

�L(�)y3

�W s

�L(�)y4

�W s

�L(�)y5

�W s

• As we sample, the words we generate form the new context we condition on

Generationcontext1 context2 generated

START START The

START The dog

The dog walked

dog walked in

Generation

Conditioned generation• In a basic RNN, the input at each timestep is a

representation of the word at that position

si = R(xi, si�1) = g(si�1Ws + [xi; c]W

x + b)

si = R(xi, si�1) = g(si�1Ws + xiW

x + b)

• But we can also condition on any arbitrary context (topic, author, date, metadata, dialect, etc.)

Conditioned generation• What information could you condition on in order to

predict the next word?

feature classes

location

addressee

we tried to prepare residents

Each state i encodes information seen until time i and its structure is optimized to predict the next word

Character LM• Vocabulary 𝒱 is a finite set of discrete characters

• When the output space is small, you’re putting a lot of the burden on the structure of the model

• Encode long-range dependencies (suffixes depend on prefixes, word boundaries etc.)

Character LM

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Character LM

Drawbacks• Very expensive to train (especially for large

vocabulary, though tricks exists — cf. hierarchical softmax)

• Backpropagation through long histories leads to vanishing gradients (cf. LSTMs in a few weeks).

• But they consistently have some of the strongest performances in perplexity evaluations.

Mikolov 2010

Melis, Dyer and Blunsom 2017

Distributed representations

• Some of the greatest power in neural network approaches is in the representation of words (and contexts) as low-dimensional vectors

• We’ll talk much more about that on Tuesday.

-4.5→

You should feel comfortable:

• Calculate the probability of a sentence given a trained model

• Estimating (e.g., trigram) language model

• Evaluating perplexity on held-out data

• Sampling a sentence from a trained model

• SRILMhttp://www.speech.sri.com/projects/srilm/

• KenLMhttps://kheafield.com/code/kenlm/

• Berkeley LMhttps://code.google.com/archive/p/berkeleylm/

Natural Language Processingpeople.ischool.berkeley.edu › ~dbamman › nlpF18 › slides ›...

Documents

Transcript of Natural Language Processingpeople.ischool.berkeley.edu › ~dbamman › nlpF18 › slides ›...

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlpF18/slides/20_semantic_roles.pdfThematic roles SLP3 • Thematic roles capture the semantic commonality among arguments

€¦ · XLS file · Web view0 0 0 0 0 0 0 0 98426 8719008622. 175 99 18 32638 14913 0 0 0 0 0 0 0 98426. 0 0 0 0 0 0 0 0 0. 0 0 0 0 0 0 0 0 0. 0 0 0 0 0 0 0 0 0. 0 0 0 0 0 0 0 0

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlpF18/slides/13... · 2018-10-04 · Natural Language Processing Info 159/259 Lecture 13: Constituency syntax (Oct

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlpF18/slides/1_intro.pdf · involving natural language processing -- either focusing on core NLP methods or using

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/1... · 2020. 1. 22. · written by women 0.00 0.25 0.50 0.75 1.00 1820 1840 1860 1880 1900 1920 1940

Natural Language Processing - University of California ...people.ischool.berkeley.edu/~dbamman/nlpF18/slides/21_word_senses.pdfscratch, scrape, scar, mark an indication of damage bell

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlpF18/slides/3_classificatio… · Natural Language Processing Info 159/259 Lecture 3: Text classiﬁcation 2 (Aug

Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/... · Lecture 26: Clustering (April 30, 2019) ... • Submit slides by 10am the morning of your

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog

Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/1_intro.pdf · NLP If you’re interested in the core methods and algorithms, take Info 159/259

· XLS file · Web view2016-05-08 · 13 62 257 0 0 0. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0. 0 0 27184 2569 15936 1679 0 0 0 0 0 0 0

[XLS]mams.rmit.edu.aumams.rmit.edu.au/urs1erc4d2nv1.xlsx · Web view0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

POELA November 2016 - Origin Energy€¦ · 13 - Nov 0 0 0 0 0 0 0 0 0 14 - Nov 0 0 0 0 0 0 0 0 0 15 - Nov 0 0 0 0 0 0 0 0 0 16 - Nov 0 0 0 0 0 0 0 0 0 17 - Nov 0 0 0 0 0 0 0 0 0

Code of complianceDE Normfest GmbH€¦ · ) .$%./ +- , +*) '(&%$# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlpF18/slides/25_MT.pdf · 2018-11-28 · Natural Language Processing Info 159/259 ... [The Big Sleep]m1 is a 1946

Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/2_words.pdf · • Type = abstract descriptive concept ... • spaCy: Relies on dependency parsing

David Packard, A Concordance to Livy (1968) …people.ischool.berkeley.edu/~dbamman/nlpF18/slides/8...Romeo & Juliet Richard III Julius Caesar Tempes t Othello King Lear knife 1 1

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/25_social.pdf · • Quoted speech attribution (“Yes,” said TOM SAWYER) • Network construction •

105E021426 0 0 0 1145 0 0 1493 0 1616 0 0 0 1654 0 1325 0 0 0 0 0 0 8 0 0 0 1256 0 0 0 0 0 1452 0 0 0 0 0 0 0 0 0 0 1072 0 0 1416 0 0 0 0 0 1414 812 0 0 1830 0 0 1343 0 0 1434 0 0

Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlpF17/slides/25... · 2017. 11. 29. · Natural Language Processing Info 159/259 Lecture 25: Conversational agents