CS388: Natural Language Processing Lecture 15:...
Transcript of CS388: Natural Language Processing Lecture 15:...
CS388:NaturalLanguageProcessingLecture15:A9en:on
GregDurrett
ThisLecture‣ GrahamNeubig(CMU)talkthisFridayat11amin6.302. “TowardsOpen-domainGenera:onofProgramsfromNaturalLanguage”
‣Mini2gradedbythisweekend
‣ Project2outbytheendoftoday;due*Friday*November2
Recall:Seq2seqModel‣ Generatenextwordcondi:onedonpreviouswordaswellashiddenstate
themoviewasgreat <s>
h̄
‣Wsizeis|vocab|x|hiddenstate|,soamaxoveren:revocabulary
Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)
P (y|x) =nY
i=1
P (yi|x, y1, . . . , yi�1)
P (yi|x, y1, . . . , yi�1) = softmax(W¯h)
Recall:Seq2seqTraining
‣ Objec:ve:maximize
themoviewasgreat <s> lefilmétaitbon
le
‣ Onelosstermforeachtarget-sentenceword,feedthecorrectwordregardlessofmodel’spredic:on
[STOP]était
X
(x,y)
nX
i=1
logP (y⇤i |x, y⇤1 , . . . , y⇤i�1)
Recall:Seman:cParsingasTransla:on
JiaandLiang(2015)
‣Writedownalinearizedformoftheseman:cparse,trainseq2seqmodelstodirectlytranslateintothisrepresenta:on
‣Mightnotproducewell-formedlogicalforms,mightrequirelotsofdata
“whatstatesborderTexas”
lambda x ( state ( x ) and border ( x , e89 ) ) )
‣ Noneedtohaveanexplicitgrammar,simplifiesalgorithms
RegexPredic:on‣ Canuseforotherseman:cparsing-liketasks
‣ Predictregexfromtext
‣ Problem:requiresalotofdata:10,000examplesneededtoget~60%accuracyonpre9ysimpleregexes
Locascioetal.(2016)
SQLGenera:on‣ Convertnaturallanguagedescrip:onintoaSQLqueryagainstsomeDB
‣ Howtoensurethatwell-formedSQLisgenerated?
Zhongetal.(2017)
‣ Threeseq2seqmodels
‣ Howtocapturecolumnnames+constants?‣ Pointermechanisms
ThisLecture
‣ Transformers
‣ A9en:on
‣ Copying
A9en:on
ProblemswithSeq2seqModels
‣ Needsomeno:onofinputcoverageorwhatinputwordswe’vetranslated
‣ Encoder-decodermodelsliketorepeatthemselves:
AboyplaysinthesnowboyplaysboyplaysUngarçonjouedanslaneige
‣ Oaenabyproductoftrainingthesemodelspoorly
ProblemswithSeq2seqModels
‣ Unknownwords:
‣ Noma9erhowmuchdatayouhave,you’llneedsomemechanismtocopyawordlikePont-de-Buisfromthesourcetotarget
ProblemswithSeq2seqModels
‣ Badatlongsentences:1)afixed-sizerepresenta:ondoesn’tscale;2)LSTMss:llhaveahard:merememberingforreallylongperiodsof:me
RNNsearch:introducesa9en:onmechanismtogive“variable-sized”representa:on
Bahdanauetal.(2014)
AlignedInputs
<s>lefilmétaitbon
themoviewasgreat
themoviewasgreat
lefilmétaitbon
‣Muchlessburdenonthehiddenstate
‣ Supposeweknewthesourceandtargetwouldbeword-by-wordtranslated
‣ Canlookatthecorrespondinginputwordwhentransla:ng—thiscouldscale!
lefilmétaitbon[STOP]
‣ Howcanweachievethiswithouthardcodingit?
A9en:on
‣ Ateachdecoderstate,computeadistribu:onoversourceinputsbasedoncurrentdecoderstatethemoviewasgreat <s> le
themovie wa
sgreatthe
movie wa
sgreat
… …
‣ Usethatinoutputlayer
A9en:on
themoviewasgreat
h1 h2 h3 h4
<s>
h̄1
‣ Foreachdecoderstate,computeweightedsumofinputstates
eij = f(h̄i, hj)
ci =X
j
↵ijhj
c1
‣ Unnormalized scalarweight
‣ Normalizedscalarweight
‣Weightedsumofinputhidden states(vector)
le
↵ij =exp(eij)Pj0 exp(eij0)
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; ¯hi])
P (yi|x, y1, . . . , yi�1) = softmax(W¯hi)‣ Noa9n:
A9en:on
<s>
h̄1
eij = f(h̄i, hj)
ci =X
j
↵ijhj
c1
‣ Notethatthisallusesoutputsofhiddenlayers
f(h̄i, hj) = tanh(W [h̄i, hj ])
f(h̄i, hj) = h̄i · hj
f(h̄i, hj) = h̄>i Whj
‣ Bahdanau+(2014):addi:ve
‣ Luong+(2015):dotproduct
Luongetal.(2015)
‣ Luong+(2015):bilinear
le
↵ij =exp(eij)Pj0 exp(eij0)
Whatcana9en:ondo?‣ Learningtocopy—howmightthiswork?
Luongetal.(2015)
0 3 2 1
0 3 2 1
‣ LSTMcanlearntocountwiththerightweightmatrix
‣ Thisiseffec:velyposi:on-basedaddressing
Whatcana9en:ondo?‣ Learningtosubsampletokens
Luongetal.(2015)
0 3 2 1
3 1
‣ Needtocount(forordering)andalsodeterminewhichtokensarein/out
‣ Content-basedaddressing
A9en:on
‣ Decoderhiddenstatesarenowmostlyresponsibleforselec:ngwhattoa9endto
‣ Doesn’ttakeacomplexhiddenstatetowalkmonotonicallythroughasentenceandspitoutword-by-wordtransla:ons
‣ Encoderhiddenstatescapturecontextualsourcewordiden:ty
themoviewasgreat
BatchingA9en:on
Luongetal.(2015)
themoviewasgreat
tokenoutputs:batchsizexsentencelengthxdimension
sentenceoutputs:batchsizexhiddensize
<s>
hiddenstate:batchsizexdimension
eij = f(h̄i, hj)
↵ij =exp(eij)Pj0 exp(eij0)
a9en:onscores=batchsizexsentencelength
c=batchsizexhiddensize ci =X
j
↵ijhj
‣Makesuretensorsaretherightsize!
Alterna:ves‣Whendowecomputea9en:on?CancomputebeforeoraaerRNNcell
Bahdanauetal.(2015)
<s>
h̄1
c1
<s>
c1
Luongetal.(2015)
‣ AaerRNNcell
‣ BeforeRNNcell;thisoneisali9lemoreconvolutedandlessstandard
lele
Results‣Machinetransla:on:BLEUscoreof14.0onEnglish-German->16.8witha9en:on,19.0withsmartera9en:on(we’llcomebacktothislater)
Luongetal.(2015) Chopraetal.(2016) JiaandLiang(2016)
‣ Summariza:on/headlinegenera:on:bigramrecallfrom11%->15%
‣ Seman:cparsing:~30%accuracy->70+%accuracyonGeoquery
CopyingInput/Pointers
UnknownWords
Jeanetal.(2015),Luongetal.(2015)
‣Wanttobeabletocopynameden::eslikePont-de-Buis
1
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; ¯hi])
froma9en:on fromRNNhiddenstate
‣ S:llcanonlygeneratefromthevocabulary
Copying
{ {thea
zebra
Pont-de-Buisecotax
…
‣ Vocabularycontains“normal”vocabaswellaswordsininput.Normalizesoverbothofthese:
‣ Bilinearfunc:onofinputrepresenta:on+outputhiddenstate
{P (yi = w|x, y1, . . . , yi�1) /expWw[ci; ¯hi]
h>j V h̄i
ifwinvocabifw=xj
PointerNetworks
Vinyalsetal.(2015)
‣ Onlypointtotheinput,don’thaveanyno:onofvocabulary
‣ Usedfortasksincludingsummariza:onandsentenceordering
Results
JiaandLiang(2016)
‣ Inmanyse}ngs,a9en:oncanroughlydothesamethingsascopying
‣ Forseman:cparsing,copyingtokensfromtheinput(texas)canbeveryuseful
Transformers
Self-A9en:on
themoviewasgreat
‣ LSTMabstrac:on:mapseachvectorinasentencetoanew,context-awarevector
‣ CNNsdidsomethingsimilarwithfilters
‣ A9en:oncangiveusathirdwaytodothis
Vaswanietal.(2017)
Self-A9en:on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa9en:onovereachword
‣Mul:ple“heads”analogoustodifferentconvolu:onalfilters.UseparametersWkandVktogetdifferenta9en:onvalues+transformvectors
x4
x
04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x
>i xj)
x
0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x
>i Wkxj) x
0k,i =
nX
j=1
↵k,i,jVkxj
DeepTransformers‣ Supervised:transformercanreplaceLSTM;willrevisitthiswhenwediscussMT
‣ Unsupervised:transformersworkbe9erthanLSTMforunsupervisedpre-trainingofembeddings:predictwordgivencontextwords
‣ Devlinetal.October11,2018“BERT:Pre-trainingofDeepBidirec:onalTransformersforLanguageUnderstanding”
‣ Strongerthansimilarmethods,SOTAon~11tasks(includingNER—92.8F1)
Takeaways‣ A9en:onisveryhelpfulforseq2seqmodels
‣ Usedfortasksincludingsummariza:onandsentenceordering
‣ Explicitlycopyinginputcanbebeneficialaswell
‣ Transformersarestrongmodelswe’llcomebacktolater
Wherearewegoing‣We’venowtalkedaboutmostoftheimportantcoretoolsforNLP
‣ Restoftheclass:morefocusedonapplica:ons
‣ Informa:onextrac:on,thenMT,thenagrabbagofthings