CS388: Natural Language Processing Lecture 15:...

CS388:NaturalLanguageProcessingLecture15:A9en:on

GregDurrett

ThisLecture‣ GrahamNeubig(CMU)talkthisFridayat11amin6.302. “TowardsOpen-domainGenera:onofProgramsfromNaturalLanguage”

‣Mini2gradedbythisweekend

‣ Project2outbytheendoftoday;due*Friday*November2

Recall:Seq2seqModel‣ Generatenextwordcondi:onedonpreviouswordaswellashiddenstate

themoviewasgreat <s>

h̄

‣Wsizeis|vocab|x|hiddenstate|,soamaxoveren:revocabulary

Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)

P (y|x) =nY

i=1

P (yi|x, y1, . . . , yi�1)

P (yi|x, y1, . . . , yi�1) = softmax(W¯h)

Recall:Seq2seqTraining

‣ Objec:ve:maximize

themoviewasgreat <s> lefilmétaitbon

le

‣ Onelosstermforeachtarget-sentenceword,feedthecorrectwordregardlessofmodel’spredic:on

[STOP]était

X

(x,y)

nX

i=1

logP (y⇤i |x, y⇤1 , . . . , y⇤i�1)

Recall:Seman:cParsingasTransla:on

JiaandLiang(2015)

‣Writedownalinearizedformoftheseman:cparse,trainseq2seqmodelstodirectlytranslateintothisrepresenta:on

‣Mightnotproducewell-formedlogicalforms,mightrequirelotsofdata

“whatstatesborderTexas”

lambda x ( state ( x ) and border ( x , e89 ) ) )

‣ Noneedtohaveanexplicitgrammar,simplifiesalgorithms

RegexPredic:on‣ Canuseforotherseman:cparsing-liketasks

‣ Predictregexfromtext

‣ Problem:requiresalotofdata:10,000examplesneededtoget~60%accuracyonpre9ysimpleregexes

Locascioetal.(2016)

SQLGenera:on‣ Convertnaturallanguagedescrip:onintoaSQLqueryagainstsomeDB

‣ Howtoensurethatwell-formedSQLisgenerated?

Zhongetal.(2017)

‣ Threeseq2seqmodels

‣ Howtocapturecolumnnames+constants?‣ Pointermechanisms

ThisLecture

‣ Transformers

‣ A9en:on

‣ Copying

A9en:on

ProblemswithSeq2seqModels

‣ Needsomeno:onofinputcoverageorwhatinputwordswe’vetranslated

‣ Encoder-decodermodelsliketorepeatthemselves:

AboyplaysinthesnowboyplaysboyplaysUngarçonjouedanslaneige

‣ Oaenabyproductoftrainingthesemodelspoorly


‣ Unknownwords:

‣ Noma9erhowmuchdatayouhave,you’llneedsomemechanismtocopyawordlikePont-de-Buisfromthesourcetotarget


‣ Badatlongsentences:1)afixed-sizerepresenta:ondoesn’tscale;2)LSTMss:llhaveahard:merememberingforreallylongperiodsof:me

RNNsearch:introducesa9en:onmechanismtogive“variable-sized”representa:on

Bahdanauetal.(2014)

AlignedInputs

<s>lefilmétaitbon

themoviewasgreat

themoviewasgreat

lefilmétaitbon

‣Muchlessburdenonthehiddenstate

‣ Supposeweknewthesourceandtargetwouldbeword-by-wordtranslated

‣ Canlookatthecorrespondinginputwordwhentransla:ng—thiscouldscale!

lefilmétaitbon[STOP]

‣ Howcanweachievethiswithouthardcodingit?

A9en:on

‣ Ateachdecoderstate,computeadistribu:onoversourceinputsbasedoncurrentdecoderstatethemoviewasgreat <s> le

themovie wa

sgreatthe

movie wa

sgreat

… …

‣ Usethatinoutputlayer

A9en:on

themoviewasgreat

h1 h2 h3 h4

<s>

h̄1

‣ Foreachdecoderstate,computeweightedsumofinputstates

eij = f(h̄i, hj)

ci =X

j

↵ijhj

c1

‣ Unnormalized scalarweight

‣ Normalizedscalarweight

‣Weightedsumofinputhidden states(vector)

le

↵ij =exp(eij)Pj0 exp(eij0)

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; ¯hi])

P (yi|x, y1, . . . , yi�1) = softmax(W¯hi)‣ Noa9n:

A9en:on

<s>

h̄1

eij = f(h̄i, hj)

ci =X

j

↵ijhj

c1

‣ Notethatthisallusesoutputsofhiddenlayers

f(h̄i, hj) = tanh(W [h̄i, hj ])

f(h̄i, hj) = h̄i · hj

f(h̄i, hj) = h̄>i Whj

‣ Bahdanau+(2014):addi:ve

‣ Luong+(2015):dotproduct

Luongetal.(2015)

‣ Luong+(2015):bilinear

le


Whatcana9en:ondo?‣ Learningtocopy—howmightthiswork?

Luongetal.(2015)

0 3 2 1

0 3 2 1

‣ LSTMcanlearntocountwiththerightweightmatrix

‣ Thisiseffec:velyposi:on-basedaddressing

Whatcana9en:ondo?‣ Learningtosubsampletokens

Luongetal.(2015)

0 3 2 1

3 1

‣ Needtocount(forordering)andalsodeterminewhichtokensarein/out

‣ Content-basedaddressing

A9en:on

‣ Decoderhiddenstatesarenowmostlyresponsibleforselec:ngwhattoa9endto

‣ Doesn’ttakeacomplexhiddenstatetowalkmonotonicallythroughasentenceandspitoutword-by-wordtransla:ons

‣ Encoderhiddenstatescapturecontextualsourcewordiden:ty

themoviewasgreat

BatchingA9en:on

Luongetal.(2015)

themoviewasgreat

tokenoutputs:batchsizexsentencelengthxdimension

sentenceoutputs:batchsizexhiddensize

<s>

hiddenstate:batchsizexdimension

eij = f(h̄i, hj)


a9en:onscores=batchsizexsentencelength

c=batchsizexhiddensize ci =X

j

↵ijhj

‣Makesuretensorsaretherightsize!

Alterna:ves‣Whendowecomputea9en:on?CancomputebeforeoraaerRNNcell

Bahdanauetal.(2015)

<s>

h̄1

c1

<s>

c1

Luongetal.(2015)

‣ AaerRNNcell

‣ BeforeRNNcell;thisoneisali9lemoreconvolutedandlessstandard

lele

Results‣Machinetransla:on:BLEUscoreof14.0onEnglish-German->16.8witha9en:on,19.0withsmartera9en:on(we’llcomebacktothislater)

Luongetal.(2015) Chopraetal.(2016) JiaandLiang(2016)

‣ Summariza:on/headlinegenera:on:bigramrecallfrom11%->15%

‣ Seman:cparsing:~30%accuracy->70+%accuracyonGeoquery

CopyingInput/Pointers

UnknownWords

Jeanetal.(2015),Luongetal.(2015)

‣Wanttobeabletocopynameden::eslikePont-de-Buis

1

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; ¯hi])

froma9en:on fromRNNhiddenstate

‣ S:llcanonlygeneratefromthevocabulary

Copying

{ {thea

zebra

Pont-de-Buisecotax

…

‣ Vocabularycontains“normal”vocabaswellaswordsininput.Normalizesoverbothofthese:

‣ Bilinearfunc:onofinputrepresenta:on+outputhiddenstate

{P (yi = w|x, y1, . . . , yi�1) /expWw[ci; ¯hi]

h>j V h̄i

ifwinvocabifw=xj

PointerNetworks

Vinyalsetal.(2015)

‣ Onlypointtotheinput,don’thaveanyno:onofvocabulary

‣ Usedfortasksincludingsummariza:onandsentenceordering

Results

JiaandLiang(2016)

‣ Inmanyse}ngs,a9en:oncanroughlydothesamethingsascopying

‣ Forseman:cparsing,copyingtokensfromtheinput(texas)canbeveryuseful

Transformers

Self-A9en:on

themoviewasgreat

‣ LSTMabstrac:on:mapseachvectorinasentencetoanew,context-awarevector

‣ CNNsdidsomethingsimilarwithfilters

‣ A9en:oncangiveusathirdwaytodothis

Vaswanietal.(2017)

Self-A9en:on

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesa9en:onovereachword

‣Mul:ple“heads”analogoustodifferentconvolu:onalfilters.UseparametersWkandVktogetdifferenta9en:onvalues+transformvectors

x4

x

04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x

>i xj)

x

0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x

>i Wkxj) x

0k,i =

nX

j=1

↵k,i,jVkxj

DeepTransformers‣ Supervised:transformercanreplaceLSTM;willrevisitthiswhenwediscussMT

‣ Unsupervised:transformersworkbe9erthanLSTMforunsupervisedpre-trainingofembeddings:predictwordgivencontextwords

‣ Devlinetal.October11,2018“BERT:Pre-trainingofDeepBidirec:onalTransformersforLanguageUnderstanding”

‣ Strongerthansimilarmethods,SOTAon~11tasks(includingNER—92.8F1)

Takeaways‣ A9en:onisveryhelpfulforseq2seqmodels

‣ Usedfortasksincludingsummariza:onandsentenceordering

‣ Explicitlycopyinginputcanbebeneficialaswell

‣ Transformersarestrongmodelswe’llcomebacktolater

Wherearewegoing‣We’venowtalkedaboutmostoftheimportantcoretoolsforNLP

‣ Restoftheclass:morefocusedonapplica:ons

‣ Informa:onextrac:on,thenMT,thenagrabbagofthings

CS388: Natural Language Processing Lecture 15:...

Documents

Transcript of CS388: Natural Language Processing Lecture 15:...