Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher...

DeepLearningforNLPPart3

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Backpropagation TrainingPart1.5:TheBasics

Backprop

• Computegradientofexample-wiselosswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

Simple Chain Rule

Multiple Paths Chain Rule

Multiple Paths Chain Rule - General

Chain Rule in Flow Graph

Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency

=successors of

Back-Prop in Multi-Layer Net

h = sigmoid(Vx)

Back-Prop in General Flow Graph

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodes inreverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)

Deep Learning General Strategy and Tricks

General Strategy1. Selectnetworkstructureappropriateforproblem

1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords

2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit

1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize

Gradient Checks are Awesome!

• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)

• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping

throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives

3. Comparethetwoandmakesuretheyarealmostthesame13

Parameter Initialization

• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)

biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).

• Initializeweights∼ Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):

fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits

• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.

• SGDupdatesaftereachexample:

• L = lossfunction,zt=currentexample,θ = parametervector,andεt = learningrate.

• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient

• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass

Stochastic Gradient Descent (SGD)

• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.

• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.

• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies

• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM

Stochastic Gradient Descent (SGD)

Learning Rates

• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst

passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor

decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ

• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]

Attempt to overfit training data

Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:

• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization

• Ifyes:• Now,it’stimetoregularizethenetwork

Prevent Overfitting: Model Size and Regularization

• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters

• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:

Prevent Feature Co-adaptation

Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html

• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0

• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures

• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)

• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)

http://arxiv.org/abs/1307.149320

Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-

BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &earlystopping,Minibatches, Parameterinitialization,Numberofhiddenunits,L1orL2weightdecay,…

• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms. http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper

Sharing statistical strengthPart1.7

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearningisstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training• Multi-tasklearning• Semi-supervisedlearning

Multi-Task Learning

• GeneralizingbettertonewtasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

Advantages of Deep LearningPart 2

Part1.1:TheBasics

#4 Unsupervised feature learning

Today,mostpractical,goodNLP&MLmethodsrequirelabeledtrainingdata(i.e.,supervisedlearning)

Butalmostalldataisunlabeled

MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions

Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28

#5 Handling the recursivity of language

Humansentencesarecomposedfromwordsandphrases

Weneedcompositionality inourMLmodels

Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents

Recurrentmodels:Recursionalongatemporalsequence

Asmallcrowdquietlyentersthehistoricchurch

historicthe

quietlyenters

Det. Adj.

Asmallcrowd

church

SemanticRepresentations

xt−1 xt xt+1

zt−1 zt zt+1

#6 Why now?

Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…

Before2006trainingdeeparchitectureswasunsuccessfulL

Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted

BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped

• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower

Deep Learning models have already achieved impressive results for HLT

NeuralLanguageModel[Mikolov etal.Interspeech 2011]

MSRMAVISSpeechSystem[Dahletal.2012;Seide etal. 2011;followingMohamedetal.2011]

“Thealgorithms representthefirsttimeacompanyhasreleased adeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”

Model\ WSJASRtask Eval WER

KN5Baseline 17.2

Discriminative LM 16.9

Recurrent NNcombination 14.4

Acousticmodel&training

Recog\ WER

RT03SFSH

Hub5SWB

GMM40-mix,BMMI,SWB309h

1-pass−adapt

27.4 23.6

DBN-DNN7layerx2048, SWB309h

1-pass−adapt

18.5(−33%)

16.1(−32%)

GMM72-mix,BMMI,FSH2000h

k-pass+adapt

18.6 17.131

Deep Learn Models Have Interesting Performance Characteristics

Deeplearningmodelscannowbeveryfastinsomecircumstances

• SENNA[Collobert etal.2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory

• WSJPOS97.29%acc;CoNLL NER89.59%F1;CoNLL Chunking94.32%F1

Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly

• UniformparalleloperationsondensevectorsarefasterThesetrendsareevenstrongerwithmulti-coreCPUsandGPUs

TREE STRUCTURES WITH CONTINUOUS VECTORS

Compositionality

We need more than word vectors and bags! What of larger semantic units?

Howcanweknowwhenlargerunitsaresimilarinmeaning?

• Thesnowboarder isleapingoverthemogul

• Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements

Socher, Manning, NgSocher, Ng, Manning

RepresentingPhrasesasVectors

x1012345678910

1Monday

Tuesday 9.51.5

Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation

thecountryofmybirththeplacewhereIwasborn

Canweextendideasofwordvectorspacestophrases?

France 22.5

Germany 13

Vector for single words are useful as features but limited

Howshouldwemapphrasesintoavectorspace?

thecountryofmybirth

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

Use theprincipleofcompositionality!Themeaning (vector)ofasentenceisdetermined by(1) themeaningsofitswordsand(2) therules thatcombinethem.

Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.

x10 1 2 3 4 5 6 7 8 9 10

the country of my birththe place where I was born

Monday

Tuesday

FranceGermany

TreeRecursiveNeuralNetworks(TreeRNNs)

onthemat.

Computational unit:SimpleNeuralNetworklayer,appliedrecursively

NeuralNetwork

(Goller &Küchler 1996,Costaetal.2003,Socheretal.ICML,2011)

Version1:SimpleconcatenationTreeRNN

Onlyasingleweightmatrix=compositionfunction!

Noreallyinteractionbetweentheinputwords!

Notadequateforhumanlanguagecompositionfunction

pWscore s

p=tanh(W+b),

wheretanh:

score=VT p

Version4:RecursiveNeuralTensorNetwork

Allowsthetwowordorphrasevectorstointeractmultiplicatively

Beyondthebagofwords:Sentimentdetection

Isthetoneofapieceoftextpositive,negative,orneutral?

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT

……loved……………great………………impressed………………marvelous…………

StanfordSentimentTreebank

• 215,154phraseslabeledin11,855sentences• Canactuallytrainandtestcompositions

http://nlp.stanford.edu:8080/sentiment/

BetterDatasetHelpedAllModels

• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!

TrainingwithSentenceLabels

TrainingwithTreebank

MV-RNN

Version4:RecursiveNeuralTensorNetwork

Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors

RecursiveNeuralTensorNetwork

• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression

• Trainallweightsjointlywithgradientdescent

Positive/NegativeResultsonTreebank

TrainingwithSentence Labels TrainingwithTreebank

BiNBRNNMV-RNNRNTN

ClassifyingSentences:Accuracyimprovesto85.4

Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal.(2015)

ExperimentalResultsonTreebank

• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)

NegationResults

Whennegatingnegatives,positiveactivationshouldincrease!

Demo:http://nlp.stanford.edu:8080/sentiment/

Conclusion

Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure

Italsoinvolvesotherthingsliketop-downprediction,ofcourse

Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems

Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher...

Documents

Transcript of Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher...

CS224d: Deep NLP Lecture 13: Convolutional Neural Networks ... · CS224d: Deep NLP Lecture 13: Convolutional Neural Networks (for NLP) Richard Socher richard@metamind.io

Deep Learning in NLP: The Past, The Present, and The Futuredeep,learning,in,nlp... · Deep Learning in NLP: Overview In this talk, focus on: •The fundamental concepts •Understanding

Stock Prediction Using NLP and Deep Learning

Deep Learning for Everyone Season II (Deep NLP!) · NLP Applications •Applications range fromsimple to complex:-Spell checking, keyword search, ﬁnding synonyms-Extracting information

Probabilistic modelling for NLP powered by deep learningwilkeraziz.github.io/slides/dgms-at-naver-2018.pdf · Probabilistic modelling for NLP powered by deep learning Wilker Aziz

Deep Learning & NLP: Graphs to the Rescue!

Deep learning-based NLP Data Pipeline for EHR Scanned ...

Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)

Deep Adversarial Learning for NLP

CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Introduction to Deep Processing Techniques for NLP

Metaphor for NLP (at NAACL 2015)

CS224d: Deep NLP Lecture 13: Convolutional Neural Networks ...cs224d.stanford.edu/lectures/CS224d-Lecture13.pdf · CS224d: Deep NLP Lecture 13: Convolutional Neural Networks (for

List of Deep Learning and NLP Resources - Computer … + + ...

NAACL HLT 2019

10th Workshop on Innovative Use of NLP for Building Educational Applications NAACL BEA10 June 04, 2015 Joel Tetreault Jill Burstein Claudia Leacock.

List of Deep Learning and NLP Resourcescs- · List of Deep Learning and NLP Resources Dragomir Radev dragomir.radev@yale.edu May 3, 2017 * Intro + ...

Deep Learning: An Introduction from the NLP Perspectivecl.naist.jp/~kevinduh/notes/duh12deeplearn.pdf3 Integrating deep learning into current NLP pipelines In particular: how to handle

Deep Learning for Natural Language processingufal.mff.cuni.cz/~zabokrtsky/fel/slides/lect13-deep-learning-nlp.pdf · Deep Learning in NLP • NLP tasks learn end-to-end using deep

Article A Survey on Bias in Deep NLP