Lecture 8: RNNs - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec8-nn3.pdf‣ Feedforward...

Lecture8:RNNs

AlanRi2er(many slides from Greg Durrett)

Administrivia

‣ Homework3DueNextweekonMonday

‣ GuestlecturenextweekonWednesday

‣MidtermisnextweekonFriday‣WillcovereverythinguptothisFriday‣ PracFceQuesFons:‣ h2ps://docs.google.com/document/d/1eidU29ni8ZeTlBcrTyQ67duurxU74vk7I16fUok_X8s/edit?usp=sharing

‣ Reading:RNNs‣ Goldberg10,11‣ JurafskyandMarFn,Chapter9

https://docs.google.com/document/d/1eidU29ni8ZeTlBcrTyQ67duurxU74vk7I16fUok_X8s/edit?usp=sharing

Recall:TrainingTips

Recall:TrainingTips

‣ ParameteriniFalizaFoniscriFcaltogetgoodgradients,someusefulheurisFcs(e.g.,XavieriniFalizer)

Recall:TrainingTips


‣ DropoutisaneffecFveregularizer

Recall:TrainingTips


‣ DropoutisaneffecFveregularizer

‣ ThinkaboutyouropFmizer:AdamortunedSGDworkwell

Recall:WordVectors

Recall:WordVectors

goodenjoyable

bad

dog

great

is

Recall:ConFnuousBag-of-Words‣ Predictwordfromcontext

thedogbitthemanMikolovetal.(2013)


thedogbittheman

dog

the

Mikolovetal.(2013)


thedogbittheman

dog

the

+

sum,sized

Mikolovetal.(2013)


thedogbittheman

dog

the

+

sum,sized

sogmaxMulFplybyW

Mikolovetal.(2013)


thedogbittheman

dog

the

+

sum,sizedP (w|w�1, w+1)

sogmaxMulFplybyW

Mikolovetal.(2013)


thedogbittheman

dog

the

+

sum,sizedP (w|w�1, w+1)

sogmaxMulFplybyW

‣MatrixfactorizaFonapproachesusefulforlearningvectorsfromreallylargedata

Mikolovetal.(2013)

UsingWordEmbeddings

UsingWordEmbeddings

‣ Approach1:learnembeddingsdirectlyfromdatainyourneuralmodel,nopretraining

‣ Ogenworkspre2ywell

UsingWordEmbeddings


‣ Approach2:pretrainusingGloVe,keepfixed‣ Fasterbecausenoneedtoupdatetheseparameters


‣ NeedtomakesureGloVevocabularycontainsallthewordsyouneed

UsingWordEmbeddings


‣ Approach2:pretrainusingGloVe,keepfixed

‣ Approach3:iniFalizeusingGloVe,fine-tune

‣ Fasterbecausenoneedtoupdatetheseparameters

‣ Notascommonlyusedanymore


‣ NeedtomakesureGloVevocabularycontainsallthewordsyouneed

ComposiFonalSemanFcs


‣WhatifwewantembeddingrepresentaFonsforwholesentences?



‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)



‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

‣ IsthereawaywecancomposevectorstomakesentencerepresentaFons?Summing?RNNs?

ThisLecture

‣ Vanishinggradientproblem

‣ Recurrentneuralnetworks

‣ LSTMs/GRUs

‣ ApplicaFons/visualizaFons

RNNBasics

RNNMoFvaFon‣ FeedforwardNNscan’thandlevariablelengthinput:eachposiFoninthefeaturevectorhasfixedsemanFcs

themoviewasgreat


themoviewasgreat thatwasgreat!



‣ Thesedon’tlookrelated(greatisintwodifferentorthogonalsubspaces)


‣ Instead,weneedto:




‣ Instead,weneedto:1)Processeachwordinauniformway




‣ Instead,weneedto:1)Processeachwordinauniformway


2)…whilesFllexploiFngthecontextthatthattokenoccursin


RNNAbstracFon‣ Cellthattakessomeinputx,hassomehiddenstateh,andupdatesthathiddenstateandproducesoutputy(allvector-valued)

previoush nexth

inputx

outputy

RNNAbstracFon‣ Cellthattakessomeinputx,hassomehiddenstateh,andupdatesthathiddenstateandproducesoutputy(allvector-valued)

previoush nexth

(previousc) (nextc)

inputx

outputy

RNNUses‣ Transducer:makesomepredicFonforeachelementinasequence

themoviewasgreat

DTNNVBDJJoutputy=scoreforeachtag,thensogmax


‣ Acceptor/encoder:encodeasequenceintoafixed-sizedvectorandusethatforsomepurpose

themoviewasgreat

predictsenFment(matmul+sogmax)

themoviewasgreat




themoviewasgreat


translate

themoviewasgreat




themoviewasgreat


translate

themoviewasgreat

DTNNVBDJJ

paraphrase/compress

outputy=scoreforeachtag,thensogmax

ElmanNetworks

inputxt

prevhiddenstateht-1 ht

outputyt

Elman(1990)

ElmanNetworks

inputxt


outputyt

‣ Updateshiddenstatebasedoninputandcurrenthiddenstate

Elman(1990)

ht = tanh(Wxt + V ht�1 + bh)

ElmanNetworks

inputxt


outputyt

‣ Computesoutputfromhiddenstate


yt = tanh(Uht + by)

Elman(1990)


ElmanNetworks

inputxt


outputyt

‣ Computesoutputfromhiddenstate


‣ Longhistory!(inventedinthelate1980s)

yt = tanh(Uht + by)

Elman(1990)


TrainingElmanNetworks

themoviewasgreat

predictsenFment


themoviewasgreat

predictsenFment

‣ “BackpropagaFonthroughFme”:buildthenetworkasonebigcomputaFongraph,someparametersareshared


themoviewasgreat

predictsenFment


‣ RNNpotenFallyneedstolearnhowto“remember”informaFonforalongFme!

itwasmyfavoritemovieof2016,thoughitwasn’twithoutproblems->+


themoviewasgreat

predictsenFment


‣ RNNpotenFallyneedstolearnhowto“remember”informaFonforalongFme!

itwasmyfavoritemovieof2016,thoughitwasn’twithoutproblems->+

‣ “Correct”parameterupdateistodoabe2erjobofrememberingthesenFmentoffavorite

VanishingGradient

h2p://colah.github.io/posts/2015-08-Understanding-LSTMs/

VanishingGradient

<-gradient


VanishingGradient

<-gradient<-smallergradient


VanishingGradient

‣ Gradientdiminishesgoingthroughtanh;ifnotin[-2,2],gradientisalmost0

<-gradient<-smallergradient


VanishingGradient

‣ Gradientdiminishesgoingthroughtanh;ifnotin[-2,2],gradientisalmost0

<-gradient<-smallergradient<-Fnygradient


LSTMs/GRUs

GatedConnecFons‣ Designedtofix“vanishinggradient”problemusinggates

ht = ht�1 � f + func(xt) ht = tanh(Wxt + V ht�1 + bh)

gated Elman


‣ Vector-valued“forgetgate”fcomputedbasedoninputandprevioushiddenstate

‣ Sigmoid:elementsoffarein(0,1)

f = �(W xfxt +Whfht�1)

ht = ht�1 � f + func(xt) ht = tanh(Wxt + V ht�1 + bh)

gated Elman





ht = ht�1 � f + func(xt)

=

ht-1 f ht


gated Elman





ht = ht�1 � f + func(xt)

=

ht-1 f ht


gated Elman

‣ Iff≈1,wesimplysumupafuncFonofallinputs—gradientdoesn’tvanish!

LSTMs

‣ “Cell”cinaddiFontohiddenstatehct = ct�1 � f + func(xt,ht�1)

LSTMs

‣ “Cell”cinaddiFontohiddenstateh

‣ Vector-valuedforgetgatefdependsonthehhiddenstate

ct = ct�1 � f + func(xt,ht�1)


LSTMs

‣ “Cell”cinaddiFontohiddenstateh

‣ Vector-valuedforgetgatefdependsonthehhiddenstate

‣ BasiccommunicaFonflow:x->c->h->output,eachstepofthisprocessisgatedinaddiFontogatesfrompreviousFmesteps

ct = ct�1 � f + func(xt,ht�1)


LSTMs

xj

f

hjhj-1

cj-1 cj


Goldberglecturenotes

‣ f,i,oaregatesthatcontrolinformaFonflow

‣ greflectsthemaincomputaFonofthecell

LSTMs

xj

fg

i

hjhj-1

cj-1 cj





LSTMs

xj

fg

io

hjhj-1

cj-1 cj





LSTMs

xj

fg

io

hjhj-1

cj-1 cj

‣ CanweignoretheoldvalueofcforthisFmestep?

LSTMs

xj

fg

io

hjhj-1

cj-1 cj

‣ CanweignoretheoldvalueofcforthisFmestep?‣ CananLSTMsumupitsinputsx?

LSTMs

xj

fg

io

hjhj-1

cj-1 cj


‣ CanweignoreaparFcularinputx?‣ CananLSTMsumupitsinputsx?

LSTMs

xj

fg

io

hjhj-1

cj-1 cj


‣ CanweignoreaparFcularinputx?‣ CananLSTMsumupitsinputsx?

‣ Canweoutputsomethingwithoutchangingc?

LSTMs

xj

fg

io

hjhj-1

cj-1 cj



‣ IgnoringrecurrentstateenFrely:‣ Letsusgetfeedforwardlayerovertoken

LSTMs

xj

fg

io

hjhj-1

cj-1 cj



‣ IgnoringrecurrentstateenFrely:

‣ Letsusdiscardstopwords

‣ Letsusgetfeedforwardlayerovertoken‣ Ignoringinput:

LSTMs

xj

fg

io

hjhj-1

cj-1 cj



‣ IgnoringrecurrentstateenFrely:

‣ Letsusdiscardstopwords‣ Summinginputs:

‣ Letsusgetfeedforwardlayerovertoken‣ Ignoringinput:

‣ Letsuscomputeabag-of-wordsrepresentaFon

LSTMs

<-gradientsimilargradient<-


LSTMs

‣ GradientsFlldiminishes,butinacontrolledwayandgenerallybyless—usuallyiniFalizeforgetgate=1toremembereverythingtostart

<-gradientsimilargradient<-


GRUs

xj

fg

io

hjhj-1

cj-1 cj

hj-1

sj-1

xj

sj

‣ LSTM:morecomplexandslower,mayworkabitbe2er

X

hj

sj

σ X

+1-z

z

σ tanhr

GRUs

xj

fg

io

hjhj-1

cj-1 cj

hj-1

sj-1

xj

sj

‣ GRU:faster,abitsimpler‣ LSTM:morecomplexandslower,mayworkabitbe2er

X

hj

sj

σ X

+1-z

z

σ tanhr

GRUs

xj

fg

io

hjhj-1

cj-1 cj

hj-1

sj-1

xj

sj

‣ GRU:faster,abitsimpler‣ LSTM:morecomplexandslower,mayworkabitbe2er

X

hj

sj

σ X

+1-z

z

σ tanhr

‣ Twogates:z(forget,mixessandh)andr(mixeshandx)

WhatdoRNNsproduce?=

‣ Encodingofthesentence—canpassthisadecoderormakeaclassificaFondecisionaboutthesentence

themoviewasgreat

WhatdoRNNsproduce?

‣ Encodingofeachword—canpassthistoanotherlayertomakeapredicFon(canalsopoolthesetogetadifferentsentenceencoding)

=


themoviewasgreat

WhatdoRNNsproduce?

‣ Encodingofeachword—canpassthistoanotherlayertomakeapredicFon(canalsopoolthesetogetadifferentsentenceencoding)

=


themoviewasgreat

‣ RNNcanbeviewedasatransformaFonofasequenceofvectorsintoasequenceofcontext-dependentvectors

MulFlayerBidirecFonalRNN

themoviewasgreat


themoviewasgreat themoviewasgreat


‣ SentenceclassificaFonbasedonconcatenaFonofbothfinaloutputs



‣ SentenceclassificaFonbasedonconcatenaFonofbothfinaloutputs

‣ TokenclassificaFonbasedonconcatenaFonofbothdirecFons’tokenrepresentaFons


TrainingRNNs

themoviewasgreat

TrainingRNNs

themoviewasgreat

P (y|x)

TrainingRNNs

themoviewasgreat

‣ Loss=negaFveloglikelihoodofprobabilityofgoldlabel(oruseSVMorotherloss)

P (y|x)

TrainingRNNs

themoviewasgreat


P (y|x)

‣ BackpropagatethroughenFrenetwork

TrainingRNNs

themoviewasgreat


P (y|x)

‣ BackpropagatethroughenFrenetwork‣ Example:senFmentanalysis

TrainingRNNs

themoviewasgreat

TrainingRNNs

themoviewasgreat

P (ti|x)

TrainingRNNs

themoviewasgreat

‣ Loss=negaFveloglikelihoodofprobabilityofgoldpredicFons,summedoverthetags

P (ti|x)

TrainingRNNs

themoviewasgreat


‣ Losstermsfilterbackthroughnetwork

P (ti|x)

TrainingRNNs

themoviewasgreat


‣ Losstermsfilterbackthroughnetwork

P (ti|x)

‣ Example:languagemodeling(predictnextwordgivencontext)

ApplicaFons

WhatcanLSTMsmodel?

WhatcanLSTMsmodel?‣ SenFment

‣ Languagemodels

‣ Encodeonesentence,predict

‣Moveleg-to-right,per-tokenpredicFon


‣ TranslaFon

‣ Languagemodels




‣ TranslaFon

‣ Languagemodels



‣ Encodesentence+thendecode,usetokenpredicFonsfora2enFonweights(laterinthecourse)

VisualizingLSTMs

Karpathyetal.(2015)

VisualizingLSTMs‣ TraincharacterLSTMlanguagemodel(predictnextcharacterbasedonhistory)overtwodatasets:WarandPeaceandLinuxkernelsourcecode

Karpathyetal.(2015)


Karpathyetal.(2015)

‣ VisualizeacFvaFonsofspecificcells(componentsofc)tounderstandthem


Karpathyetal.(2015)

‣ Counter:knowwhentogenerate\n‣ VisualizeacFvaFonsofspecificcells(componentsofc)tounderstandthem

VisualizingLSTMs

Karpathyetal.(2015)

‣ VisualizeacFvaFonsofspecificcellstoseewhattheytrack

‣ TraincharacterLSTMlanguagemodel(predictnextcharacterbasedonhistory)overtwodatasets:WarandPeaceandLinuxkernelsourcecode

VisualizingLSTMs

Karpathyetal.(2015)

‣ Binaryswitch:tellsusifwe’reinaquoteornot‣ VisualizeacFvaFonsofspecificcellstoseewhattheytrack


VisualizingLSTMs

Karpathyetal.(2015)



VisualizingLSTMs

Karpathyetal.(2015)

‣ Stack:acFvaFonbasedonindentaFon‣ VisualizeacFvaFonsofspecificcellstoseewhattheytrack


VisualizingLSTMs

Karpathyetal.(2015)



VisualizingLSTMs

Karpathyetal.(2015)

‣ Uninterpretable:probablydoingdouble-duty,oronlymakessenseinthecontextofanotheracFvaFon




‣ TranslaFon

‣ Languagemodels



‣ Encodesentence+thendecode,usetokenpredicFonsfora2enFonweights(nextlecture)


‣ TranslaFon

‣ Languagemodels




‣ Textualentailment


‣ TranslaFon

‣ Languagemodels




‣ Textualentailment

‣ Encodetwosentences,predict

NaturalLanguageInference

Aboyplaysinthesnow Aboyisoutside

Premise Hypothesis


Aboyplaysinthesnow Aboyisoutsideentails

Premise Hypothesis


Amaninspectstheuniformofafigure Themanissleeping


Premise Hypothesis




contradicts

Premise Hypothesis



AnolderandyoungermansmilingTwomenaresmilingandlaughingatcatsplaying


contradicts

Premise Hypothesis





contradicts

neutral

Premise Hypothesis





contradicts

neutral

‣ Longhistoryofthistask:“RecognizingTextualEntailment”challengein2006(Dagan,Glickman,Magnini)

Premise Hypothesis





contradicts

neutral

‣ Longhistoryofthistask:“RecognizingTextualEntailment”challengein2006(Dagan,Glickman,Magnini)

‣ Earlydatasets:small(hundredsofpairs),veryambiFous(lotsofworldknowledge,temporalreasoning,etc.)

Premise Hypothesis

SNLIDataset

Bowmanetal.(2015)

‣ ShowpeoplecapFonsfor(unseen)imagesandsolicitentailed/neural/contradictorystatements

‣ >500,000sentencepairs

SNLIDataset

Bowmanetal.(2015)


‣ >500,000sentencepairs‣ Encodeeachsentenceandprocess

SNLIDataset

Bowmanetal.(2015)



100DLSTM:78%accuracy

‣ Encodeeachsentenceandprocess

SNLIDataset

Bowmanetal.(2015)




300DLSTM:80%accuracy(Bowmanetal.,2016)


SNLIDataset

Bowmanetal.(2015)





300DBiLSTM:83%accuracy(Liuetal.,2016)


SNLIDataset

Bowmanetal.(2015)





300DBiLSTM:83%accuracy(Liuetal.,2016)


‣ Later:be2ermodelsforthis

Takeaways

‣ RNNscantransduceinputs(produceoneoutputforeachinput)orcompressthewholeinputintoavector

‣ UsefulforarangeoftaskswithsequenFalinput:senFmentanalysis,languagemodeling,naturallanguageinference,machinetranslaFon

‣ NextFme:CNNsandneuralCRFs

Lecture 8: RNNs - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec8-nn3.pdf‣ Feedforward...

Documents

Transcript of Lecture 8: RNNs - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec8-nn3.pdf‣ Feedforward...