CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised...

11/9/15

DeepLearningforNLPPart3

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Backpropagation TrainingPart1.5:The Basics

Backprop

• Computegradientofexample-wise losswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

Simple Chain Rule

Multiple Paths Chain Rule

Multiple Paths Chain Rule - General

11/9/15

Chain Rule in Flow Graph

Flow graph:anydirected acyclicgraphnode =computation resultarc=computation dependency

=successors of

Back-Prop in Multi-Layer Net

h = sigmoid(Vx)

Back-Prop in General Flow Graph

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodesin reverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)

Deep Learning General Strategy and Tricks

General Strategy1. Selectnetworkstructureappropriateforproblem

1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords

2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit

1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize

11/9/15

Gradient Checks are Awesome!

• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)

• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping

throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives

3. Comparethetwoandmakesuretheyarealmostthesame13

Parameter Initialization

• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)

biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).

• Initializeweights∼Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):

fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits

• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.

• SGDupdatesaftereachexample:

• L = lossfunction,zt =currentexample,θ =parametervector,andεt = learningrate.

• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient

• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass

Stochastic Gradient Descent (SGD)

• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.

• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies

• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM

Stochastic Gradient Descent (SGD)

Learning Rates

• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst

passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor

decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ

• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]

Attempt to overfit training data

Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:

• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization

• Ifyes:• Now,it’stimetoregularizethenetwork

11/9/15

Prevent Overfitting: Model Size and Regularization

• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters

• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:

Prevent Feature Co-adaptation

Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html

• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)

http://arxiv.org/abs/1307.149 320

Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-

BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &early stopping ,Minibatches, Parameterinitialization , Number ofhidden units, L1orL2weight decay,…

• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms.http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper

Sharing statistical strengthPart1.7

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearning isstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training•Multi-tasklearning• Semi-supervisedlearning

Multi-Task Learning• Generalizingbettertonew

tasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

11/9/15

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

Advantages of Deep LearningPart 2

Part1.1:The Basics

#4 Unsupervised feature learning

Today,mostpractical,goodNLP&MLmethodsrequirelabeled trainingdata(i.e., supervisedlearning)

Butalmostalldataisunlabeled

MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions

Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28

#5 Handling the recursivity of language

Humansentencesarecomposedfromwordsandphrases

Weneedcompositionality inourMLmodels

Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents

Recurrentmodels:Recursionalongatemporalsequence

Asmallcrowdquietlyentersthehistoricchurch

historicthe

quietlyenters

Det. Adj.

Asmallcrowd

church

SemanticRepresentations

xt−1 xt xt+1

zt−1 zt zt+1

#6 Why now?

Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…

Before2006trainingdeeparchitectureswasunsuccessfulL

Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted

BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped

• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower

11/9/15

Deep Learning models have already achieved impressive results for HLT

NeuralLanguageModel[Mikolov etal. Interspeech 2011]

MSRMAVISSpeechSystem[Dahletal. 2012;Seide etal. 2011;followingMohamedetal. 2011]

“Thealgorithmsrepresentthefirsttimeacompanyhasreleasedadeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”

Model\ WSJASRtask EvalWERKN5Baseline 17.2DiscriminativeLM 16.9Recurrent NNcombination 14.4

Acousticmodel&training

Recog\ WER

RT03SFSH

Hub5SWB

GMM40-mix,BMMI,SWB309h

1-pass−adapt

27.4 23.6

DBN-DNN7layerx2048,SWB309h

1-pass−adapt

18.5(−33%)

16.1(−32%)

GMM72-mix,BMMI,FSH2000h

k-pass+adapt

18.6 17.131

Deep Learn Models Have Interesting Performance Characteristics

Deeplearningmodelscannowbeveryfastinsomecircumstances

• SENNA[Collobert etal. 2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory

• WSJPOS97.29% acc;CoNLL NER89.59% F1;CoNLL Chunking 94.32% F1

Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly• Uniformparalleloperationsondensevectorsarefaster

Thesetrendsareevenstrongerwithmulti-coreCPUsandGPUs32

TREE STRUCTURES WITH CONTINUOUS VECTORS

Compositionality We need more than word vectors and bags! What of larger semantic units?

Howcanweknowwhen largerunitsaresimilarinmeaning?

• Thesnowboarderisleapingoverthemogul

• Apersononasnowboardjumpsintotheair

Peopleinterpret themeaningoflarger textunits–entities,descriptiveterms, facts,arguments,stories– bysemanticcomposition ofsmallerelements

11/9/15

Socher, Manning, NgSoc her, Ng, Manning

Representing PhrasesasVectors

x10 1 2 3 4 5 6 7 8 9 10

Monday92

Tuesday 9.51.5

Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation

thecountryofmybirththeplacewhereI wasborn

Canweextendideasofwordvectorspacestophrases?

France 22.5

Germany 13

Vector for single words are useful as features but limited

How should wemapphrasesinto avector space?

the country of my birth

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

Use theprinciple ofcompositionality!

Themeaning (vector)ofasentenceisdetermined by(1) themeanings ofits words and(2) the rules thatcombine them.

Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.

x10 1 2 3 4 5 6 7 8 9 10

the country of my birththe place where I was born

Monday

Tuesday

FranceGermany

TreeRecursiveNeuralNetworks(TreeRNNs)

on the mat.

Computationalunit:SimpleNeuralNetworklayer,appliedrecursively

NeuralNetwork

(Goller &Küchler 1996,Costa etal. 2003, Socheretal. ICML, 2011)

Version 1:Simple concatenation TreeRNN

Onlyasingleweightmatrix=composition function!

No reallyinteractionbetweentheinputwords!

Notadequate for human language composition function

pWscore s

p=tanh(W +b),

wheretanh:

score=VT p

Version 4:RecursiveNeuralTensorNetwork

Allowsthetwowordorphrasevectorstointeractmultiplicatively

Beyondthe bagof words:Sentiment detection

Isthetoneofapieceoftextpositive,negative,orneutral?

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT

……loved……………great………………impressed………………marvelous…………

11/9/15

Stanford Sentiment Treebank

• 215,154 phraseslabeled in11,855 sentences• Canactually trainandtest compositions

http://nlp.stanford.edu:8080/sentiment/Socher, Manning, NgSoc her, Ng, Manning

BetterDataset Helped All Models

• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!

TrainingwithSentenceLabels

TrainingwithTreebank

MV-RNN

Version 4:RecursiveNeuralTensorNetwork

Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors

RecursiveNeuralTensor Network

• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression

• Trainallweightsjointlywithgradientdescent

11/9/15

Positive/Negative Results on Treebank

TrainingwithSentenceLabels TrainingwithTreebank

BiNBRNNMV-RNNRNTN

ClassifyingSentences:Accuracyimprovesto85.4

Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal. (2015)Socher, Manning, NgSoc her, Ng, Manning

Experimental Resultson Treebank• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)

Negation ResultsWhennegatingnegatives,positiveactivationshouldincrease!

Demo:http://nlp.stanford.edu:8080/sent ime nt/

Conclusion

Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure

Italsoinvolvesotherthingsliketop-downprediction,ofcourse

Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems

CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised...

Documents

Transcript of CS224N DeepNLP Week7 lecture3 - Stanford University · Part 1.1: The Basics 27 #4 Unsupervised...

Naume2250 week7

week7 - math.bgu.ac.ilurionn/Infi3/week7.pdf · Title: week7.jnt Author: BGU USER Created Date: 12/6/2013 11:09:44 AM

Week7 assignmentc

Presentatie week7

Natural Language Processing with Deep Learning CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture01-wordvecs1.pdf · What’s different this year? •Lectures

Mkt460 week7

Natural Language Processing with Deep Learning CS224N/Ling284web.stanford.edu/class/cs224n/slides/cs224n-2020... · • Commonly in NLP deep learning: • We learn both W and word

Wt5912 unit5 week7

Lecture Week7

GM545 Week7 Paper

Reading week7

Week7 final

CS224N/Ling284 - Stanford University

Ey4106 unit6 week7

CS224N/Lin4 with Deep Learning tural Language Pr ocessingweb.stanford.edu/class/cs224n/slides/cs224n-2019-lecture06-rnnlm.pdf · longer, but this slide doesn’t have space! hidden

CS224N/Lin4 with Deep Learning tural Language Pr ocessing › class › archive › cs › cs224n › cs224n... · 2019-02-04 · tural Language Pr ocessing with Deep Learning CS224N/Lin4

Week7 Notes

CS224n: Natural Language Processing with Deep Learning …web.stanford.edu/class/cs224n/readings/cs224n-2019-notes... · 2020-01-02 · cs224n: natural language processing with deep

Week7 assignment

Fallacies week7