Maximum Entropy Models / Softmax Classifiers · Joint vs. Conditional Models • Discriminative...
Transcript of Maximum Entropy Models / Softmax Classifiers · Joint vs. Conditional Models • Discriminative...
MaximumEntropyModels/
Softmax Classifiers
ChristopherManning
ChristopherManning
Today’splan
1. Generativevs.discriminativemodels[15mins]2. Optimizingsoftmax/maxentmodelparameters[20mins]3. NamedEntityRecognition[10mins]4. Maximumentropysequencemodels[10mins]
21
Generativevs.Discriminative
models
ChristopherManning
ChristopherManning
Introduction
• Sofarwe’vemainlylookedat“generativemodels”• Languagemodels,IBMalignmentmodels,PCFGs
• ButthereismuchuseofconditionalordiscriminativemodelsinNLP,Speech,IR,andMLgenerally
• Because:• Theygivehighaccuracyperformance• Theymakeiteasytoincorporate lotsoflinguisticallyimportant features• Theyalloweasybuildingoflanguage independent, retargetable NLPmodules
ChristopherManning
Jointvs.ConditionalModels
• Wehavesomedata{(d,c)}ofpairedobservationsd andhiddenclassesc.
• Joint(generative)modelsplaceprobabilitiesoverbothobserveddataandthehiddenstuff• Theygenerate theobserved datafromthehiddenstuff• Alltheclassic1990sStatNLP models:• n-gramlanguagemodels,NaiveBayesclassifiers,hiddenMarkovmodels,probabilistic context-freegrammars, IBMmachine translationalignmentmodels
P(c,d)
ChristopherManning
Jointvs.ConditionalModels
• Discriminative(conditional)modelstakethedataasgiven,andputaprobability/scoreoverhiddenstructuregiventhedata:
• Logisticregression,maximumentropymodels,conditionalrandomfields
• Also,SVMs,(averaged) perceptron, feed forwardneuralnetworks,etc.arediscriminative classifiers• butnotdirectlyprobabilistic
P(c|d)
ChristopherManning
Conditionalvs.JointLikelihood
• AjointmodelgivesprobabilitiesP(d,c)=P(c)P(d|c)andtriestomaximizethisjointlikelihood.• Itendsuptrivialtochooseweights: justcount!• Relative frequencies givemaximumjointlikelihoodoncategoricaldata
• AconditionalmodelgivesprobabilitiesP(c|d).Itmodelsonlytheconditionalprobabilityoftheclass.• Weseek tomaximizeconditionallikelihood.• Harder todo(aswe’llsee…)• Morecloselyrelated toclassificationerror.
ChristopherManning Conditionalmodelsworkwell:WordSenseDisambiguation
• Evenwithexactly thesamefeatures, changingfromjointtoconditionalestimation increasesperformance
• Thatis,weusethesamesmoothing, andthesameword-class features,wejustchangethenumbers(parameters)
Training Set
Objective Accuracy
Joint Like. 86.8
Cond. Like. 98.5
Test Set
Objective Accuracy
Joint Like. 73.6
Cond. Like. 76.1
(Klein and Manning 2002, using Senseval-1 Data)
ChristopherManning
PCFGsMaximizeJoint,notConditionalLikelihood
1. Whatparse foreatricewithchopsticks?
2. Howcanyougettheotherparse?
Optimizingsoftmax/maxentmodelparameters
Theirlikelihoodandderivatives
ChristopherManning
Background:FeatureExpectations
• Wewillcruciallymakeuseoftwoexpectations• actualandpredicted countsofafeature firing:
• Empiricalexpectation (count)ofafeature:
• Modelexpectation ofafeature:
∑ ∈=
),(observed),(),()( empirical
DCdc ii dcffE
∑ ∈=
),(),(),(),()(
DCdc ii dcfdcPfE
ChristopherManning
Maxent/Softmax ModelLikelihood
• Maximum(Conditional)LikelihoodModels• Givenamodelform,wechoosevaluesofparametersλi tomaximize the(conditional) likelihoodofthedata.
• Foranygivenfeatureweights,wecancalculate:• Conditional likelihoodoftrainingdata
• Derivative ofthelikelihoodwrt eachfeatureweight
∑∏∈∈
==),(),(),(),(
),|(log),|(log),|(logDCdcDCdc
dcPdcPDCP λλλ
ChristopherManning
TheLikelihoodValue
• The(log)conditionallikelihoodofiid* data(C,D)accordingtoamaxentmodelisafunctionofthedataandtheparametersλ:
• Iftherearen’tmanyvaluesofc,it’seasytocalculate:
∑∏∈∈
==),(),(),(),(
),|(log),|(log),|(logDCdcDCdc
dcPdcPDCP λλλ
∑∈
=),(),(
log),|(logDCdc
DCP λ∑ ∑'
),'(expc i
ii dcfλ
∑i
ii dcf ),(exp λ
*A fancy statistics term meaning “independent and identically distributed”. You normally need to assume this for anything formal to be derivable, even though it’s never quite true in practice.
ChristopherManning
TheLikelihoodValue
• Wecanseparatethisintotwocomponents:
• Wecanmaximizeitbyfindingwherethederivativeis0• Thederivativeisthedifferencebetweenthederivatives
ofeachcomponent
∑ ∑ ∑∈ ),(),( '
),'(explogDCdc c i
ii dcfλ∑ ∑∈ ),(),(
),(explogDCdc i
ii dcfλ −=),|(log λDCP
)(λN )(λM=),|(log λDCP −
ChristopherManning
TheDerivativeI:Numerator
Derivativeofthenumeratoris:theempiricalcount(fi,c)
i
DCdc iii dcf
λ
λ
∂
∂
=∑ ∑∈ ),(),(
),(
∑∑
∈ ∂
∂=
),(),(
),(
DCdc i
iii dcf
λ
λ
∑∈
=),(),(
),(DCdci dcf
i
DCdc iici
i
dcfN
λ
λ
λλ
∂
∂
=∂
∂∑ ∑∈ ),(),(
),(explog)(
ChristopherManning
TheDerivativeII:Denominator
i
DCdc c iii
i
dcfM
λ
λ
λλ
∂
∂
=∂
∂∑ ∑ ∑∈ ),(),( '
),'(explog)(
∑∑ ∑
∑ ∑∈ ∂
∂=
),(),(
'
''
),'(exp
),''(exp1
DCdc i
c iii
c iii
dcf
dcf λ
λ
λ
∑ ∑∑∑
∑ ∑∈ ∂
∂=
),(),( '''
),'(
1
),'(exp
),''(exp1
DCdc c i
iii
iii
c iii
dcfdcf
dcf λ
λλ
λ
i
iii
DCdc cc i
ii
iii dcf
dcf
dcf
λ
λ
λ
λ
∂
∂=
∑∑ ∑∑ ∑
∑∈
),'(
),''(exp
),'(exp
),(),( '''
∑ ∑∈
=),(),( '
),'(),|'(DCdc
ic
dcfdcP λ =predictedcount(fi,λ)
ChristopherManning
TheDerivativeIII
• Theoptimumparameters aretheonesforwhicheachfeature’spredicted expectation equalsitsempiricalexpectation. Theoptimumdistribution is:• Alwaysunique(butparametersmaynotbeunique)• Alwaysexists (iffeaturecountsarefromactualdata).
• Thesemodelsarealsocalledmaximumentropymodelsbecausewefindthemodelhavingmaximumentropyandsatisfyingtheconstraints:
=∂
∂
i
DCPλ
λ),|(log),(countactual Cfi ),(countpredicted λif−
jfEfE jpjp ∀= ),()( ~
ChristopherManning
Findingtheoptimalparameters
• Wewanttochooseparametersλ1,λ2,λ3,…thatmaximizetheconditionallog-likelihoodofthetrainingdata
• Tobeabletodothat,we’veworkedouthowtocalculatethefunctionvalueanditspartialderivatives(itsgradient)
)|(log)(1
i
n
ii dcPDCLogLik ∑
=
=
ChristopherManning
Alikelihood surface
ChristopherManning
Findingtheoptimalparameters
• Useyourfavoritenumericaloptimizationpackage….• Commonly(andinourcode),youminimize thenegative ofCLogLik
1. Gradientdescent (GD);Stochasticgradientdescent (SGD)• Improved variants likeAdagrad,Adadelta, RMSprop,NAG
2. Iterative proportional fittingmethods:Generalized Iterative Scaling(GIS)andImproved Iterative Scaling(IIS)
3. Conjugategradient (CG),perhapswithpreconditioning4. Quasi-Newtonmethods– limitedmemoryvariablemetric(LMVM)
methods, inparticular, L-BFGS
NamedEntityRecognition
ChristopherManning
NamedEntityRecognition(NER)
• AveryimportantNLPsub-task:find andclassifynamesintext,forexample:
• Thedecisionbytheindependent MPAndrewWilkie towithdrawhissupport fortheminorityLaborgovernmentsoundeddramaticbutitshouldnotfurther threaten itsstability.When,after the2010election,Wilkie,RobOakeshott,TonyWindsorandtheGreensagreed tosupportLabor,theygavejusttwoguarantees: confidence andsupply.
ChristopherManning
• AveryimportantNLPsub-task:find andclassifynamesintext,forexample:
• Thedecisionbytheindependent MPAndrewWilkie towithdrawhissupport fortheminorityLabor governmentsoundeddramaticbutitshouldnotfurther threaten itsstability.When,after the2010 election,Wilkie,RobOakeshott,TonyWindsorandtheGreens agreed tosupportLabor,theygavejusttwoguarantees: confidence andsupply.
NamedEntityRecognition (NER)
ChristopherManning
• AveryimportantNLPsub-task:find andclassifynamesintext,forexample:
• Thedecisionbytheindependent MPAndrewWilkie towithdrawhissupport fortheminorityLabor governmentsoundeddramaticbutitshouldnotfurther threaten itsstability.When,after the2010 election,Wilkie,RobOakeshott,TonyWindsorandtheGreens agreed tosupportLabor,theygavejusttwoguarantees: confidence andsupply.
NamedEntityRecognition (NER)
PersonDateLocationOrgani-zation
ChristopherManning
NamedEntityRecognition(NER)
• Theuses:• Namedentitiescanbeindexed, linkedoff,etc.• Sentimentcanbeattributed tocompaniesorproducts• Alotofrelations (employs,won,born-in)arebetweennamedentities• Forquestionanswering, answers areoftennamedentities.
• Concretely:• Manywebpages tagvariousentities, withlinkstobioortopicpages,etc.• Reuters’ OpenCalais,Evri,AlchemyAPI,Yahoo’sTermExtraction,…
• Apple/Google/Microsoft/… smartrecognizers fordocumentcontent
ChristopherManning
NamedEntityRecognitionEvaluation
Task:Predictentities inatext
Foreign ORGMinistry ORGspokesman OShen PERGuofang PERtold OReuters ORG: :
}Standardevaluationisperentity,not pertoken
ChristopherManning
TheNamedEntityRecognitionTask
We ORGshould ORGshow ONeha PEREric PERKing PER’s Oassignment ORG: :
OOOB-PERB-PER BIO/IOBnotationI-PEROO:
ChristopherManning
Precision/Recall/F1forNER
• RecallandprecisionarestraightforwardfortaskslikeIRandtextcategorization,wherethereisonlyonegrainsize(documents)
• ThemeasurebehavesabitfunnilyforIE/NERwhenthereareboundaryerrors(whicharecommon):• FirstBankofChicagoannouncedearnings…
• Thiscountsasbothafalsepositiveandafalsenegative• Selectingnothing wouldhavebeenbetter• Someothermetrics(e.g.,MUCscorer)givepartialcredit
(accordingtocomplexrules)
Maximumentropysequencemodels
MaximumentropyMarkovmodels(MEMMs)a.k.a.
ConditionalMarkovmodels
ChristopherManning
Sequenceproblems
• ManyproblemsinNLPhavedatawhichisasequenceofcharacters,words,phrases,lines,orsentences…
• WecanthinkofourtaskasoneoflabelingeachitemVBG NN IN DT NN IN NNChasing opportunity in an age of upheaval
POStagging
B B I I B I B I B B
而 相 对 于 这 些 品 牌 的 价
Wordsegmentation
PERS O O O ORG ORGMurdoch discusses future of News Corp.
Namedentityrecognition
Textsegmen-tation
QAQAAAQA
ChristopherManning
MEMMinferenceinsystems
• ForaConditionalMarkovModel(CMM)a.k.a.aMaximumEntropyMarkovModel(MEMM), theclassifiermakesasingledecisionatatime,conditionedonevidence fromobservations andprevious decisions
• Alarger spaceofsequences isusuallyexplored viasearch
-3 -2 -1 0 +1
ORG ORG O ??? ???
Xerox Corp. fell 22.6 %
LocalContextFeatures
W0 22.6
W+1 %
W-1 fell
C-1 O
C-1-C-2 ORG-O
hasDigit? true
… …(Borthwick 1999,Kleinetal.2003,etc.)
DecisionPoint
ChristopherManning
Example:NER
• Scoringindividuallabelingdecisions isnomorecomplexthanstandardclassificationdecisions• Wehavesomeassumed labels touseforpriorpositions• Weusefeaturesofthoseandtheobserveddata(whichcanincludecurrent,previous,andnextwords)topredictthecurrentlabel
-3 -2 -1 0 +1
ORG ORG O ??? ???
Xerox Corp. fell 22.6 %
LocalContextFeatures
W0 22.6
W+1 %
W-1 fell
C-1 O
C-1-C-2 ORG-O
hasDigit? true
… …
DecisionPoint
ChristopherManning
InferenceinSystemsSequenceLevel
LocalLevel
LocalData
FeatureExtraction
Features
Label
Optimization
Smoothing
Classifier Type
Features
Label
SequenceData
Maximum Entropy Model RegularizationOptimization
Sequence Model Inference(Search)
LocalData
LocalData
ChristopherManning
GreedyInference
• Greedy inference:• We just start at the left, and use our classifier at each position to assign a label• The classifier can depend on previous labeling decisions as well as observed data
• Advantages:• Fast, no extra memory requirements• Very easy to implement• With rich features including observations to the right, it can perform quite well
• Disadvantage:• Greedy. We make commit errors we cannot recover from
Sequence Model
Inference
Best Sequence
ChristopherManning
BeamInference
• Beam inference:
• At each position keep the top k complete sequences.
• Extend each sequence in each local way.
• The extensions compete for the k slots at the next position.
• Advantages:• Fast; beam sizes of 3–5 are almost as good as exact inference in many cases.• Easy to implement (no dynamic programming required).
• Disadvantage:• Inexact: the globally best sequence can fall off the beam.
Sequence Model
Inference
Best Sequence
ChristopherManning
ViterbiInference
• Viterbi inference:• Dynamic programming or memoization.• Requires small window of state influence (e.g., past two states are relevant).
• Advantage:• Exact: the global best sequence is returned.
• Disadvantage:• Harder to implement long-distance state-state interactions (but beam inference
tends not to allow long-distance resurrection of sequences anyway).
Sequence Model
Inference
Best Sequence
ChristopherManning
CRFs [Lafferty, Pereira, and McCallum 2001]
• Another sequence model: Conditional Random Fields (CRFs)
• A whole-sequence conditional model rather than a chaining of local models.
• The space of c’s is now the space of sequences• But if the features fi remain local, the conditional sequence likelihood can be calculated
exactly using dynamic programming
• Training is slower, but CRFs avoid causal-competition biases
• These (or a variant using a max margin criterion) are seen as the state-of-the-art these days … but in practice they usually work much the same as MEMMs.
∑ ∑'
),'(expc i
ii dcfλ=),|( λdcP ∑
iii dcf ),(exp λ
ChristopherManning
CoNLL2003NERsharedtaskResultsonEnglishDevset
82
84
86
88
90
92
94
96
MEMM 1st CRF MMMN
OverallLocMiscOrgPerson
Smoothing/Priors/RegularizationforMaxent Models
ChristopherManning
Smoothing:IssuesofScale• Lotsoffeatures:
• NLPmaxent modelscanhavetenmillion features.• Evenstoringasinglearrayofparametervaluescanhavea
substantialmemorycost.
• Lotsofsparsity:• Overfitting veryeasy– weneedsmoothing!• Manyfeatures seenintraining willneveroccuragainattesttime.
• Optimizationproblems:• Featureweightscanbeinfinite, anditerativesolverscantakealong
timetogettothoseinfinities.
ChristopherManning
Smoothing:Issues• Assume the following empirical distribution:
• Features: {Heads}, {Tails}
• We’ll have the following softmax model distribution:
• Really, only one degree of freedom (λ = λH−λT)
Heads Tails
h t
TH
H
HEADS λλ
λ
eeep+
=TH
T
TAILS λλ
λ
eeep+
=
pHEADS =eλHe−λT
eλHe−λT + eλTe−λT=
eλ
eλ + e0=
eλ
eλ +1pTAILS =
e0
eλ + e0=
11+ eλ
λ
Logisticregression!
ChristopherManning
Smoothing:Issues
• Thedatalikelihoodinthismodelis:TAILSHEADS loglog)|,(log ptphthP +=λ
)1(log)()|,(log λλλ ehththP ++−=
Heads Tails
2 2Heads Tails
3 1Heads Tails
4 0
λ λ λ
log P log P log P
ChristopherManning
Smoothing:EarlyStopping• Inthe4/0case,thereweretwoproblems:
• Theoptimalvalueofλ was∞,whichisalongtripforanoptimizationprocedure
• Thelearneddistributionisjustasspikedastheempiricalone–nosmoothing
• Onewaytosolvebothissues istojuststoptheoptimizationearly,afterafewiterations:• Thevalueofλwillbefinite(butpresumablybig)• Theoptimizationwon’ttakeforever(clearly)• Commonlyusedinearlymaxent work
• HasseenarevivalindeeplearningJ
Heads Tails
4 0
Heads Tails
1 0
Input
Output
λ
ChristopherManning
Smoothing:Priors(MAP)• Whatifwehadapriorexpectation thatparametervalueswouldn’t
beverylarge?• Wecouldthenbalance evidence suggesting largeparameters (or
infinite) againstourprior.• Theevidencewouldnevertotallydefeattheprior,andparameters
wouldbesmoothed (andkeptfinite!).• Wecandothisexplicitly bychangingtheoptimization objectiveto
maximumposterior likelihood:
),|(log)(log)|,(log λλλ DCPPDCP +=
Posterior Prior Evidence
ChristopherManning
Smoothing:Priors• Gaussian, orquadratic,orL2 priors:
• Intuition: parameters shouldn’tbelarge.• Formalization: priorexpectation thateachparameterwillbedistributed accordingtoagaussian withmeanµ andvarianceσ2.
• Penalizes parameters fordriftingtoofarfromtheirmeanpriorvalue(usuallyµ=0).
• 2σ2=1workssurprisinglywell.
Theydon’tevencapitalizemynameanymore!
!!"
#$$%
& −−= 2
2
2)(
exp21
)(i
ii
iiP
σµλ
πσλ
2σ2=1
2σ2 = 10
2σ2 = ∞
ChristopherManning
Smoothing:Priors• Ifweusegaussian priors /L2 regularization:
• Tradeoffsomeexpectation-matchingforsmallerparameters.• Whenmultiplefeaturescanberecruitedtoexplainadatapoint,themorecommonones
generallyreceivemoreweight.• Accuracygenerallygoesup!
• Change theobjective:
• Change thederivative:
),|(log)|,(log λλ DCPDCP =
∑∈
=),(),(
),|()|,(logDCdc
dcPDCP λλ
),(predicted),(actual/)|,(log λλλ iii fCfDCP −=∂∂
2σ2
=1
2σ2
= 10
2σ2 = ∞
€
−(λi −µi)
2
2σ i2
i∑ + k
+ logP(λ)
€
−(λi −µi) /σ2
ChristopherManning
Smoothing:Priors• Ifweusegaussian priors /L2 regularization :
• Tradeoffsomeexpectation-matchingforsmallerparameters.• Whenmultiplefeaturescanberecruitedtoexplainadatapoint,themorecommonones
generallyreceivemoreweight.• Accuracygenerallygoesup!
• Change theobjective:
• Change thederivative:
),|(log)|,(log λλ DCPDCP =
∑∈
=),(),(
),|()|,(logDCdc
dcPDCP λλ
),(predicted),(actual/)|,(log λλλ iii fCfDCP −=∂∂
2σ2
=1
2σ2
= 10
2σ2 = ∞
−λi2
2σ i2
i∑ + k
+ logP(λ)
−λi /σ2
Takingpriormeanas0
ChristopherManning
Example:NERSmoothing
Feature Type Feature PERS LOC
Previous word at -0.73 0.94
Current word Grace 0.03 0.00
Beginning bigram <G 0.45 -0.04
Current POS tag NNP 0.47 0.45
Prev and cur tags IN NNP -0.10 0.14
Previous state Other -0.70 -0.92
Current signature Xx 0.80 0.46
Prev state, cur sig O-Xx 0.68 0.37
Prev-cur-next sig x-Xx-Xx -0.69 0.37
P. state - p-cur sig O-x-Xx -0.20 0.82
…
Total: -0.58 2.68
Prev Cur Next
State Other ??? ???
Word at Grace Road
Tag IN NNP NNP
Sig x Xx Xx
Local Context
Feature WeightsBecauseofsmoothing,themorecommonprefixandsingle-tagfeatureshavelargerweightseventhoughentire-wordandtag-pairfeaturesaremorespecific.
ChristopherManning
Example:POSTagging• From(Toutanovaetal.,2003):
• Smoothinghelps:• Softensdistributions.• Pushesweightontomoreexplanatoryfeatures.• Allowsmanyfeaturestobedumpedsafelyintothemix.• Speedsupconvergence (ifbothareallowedtoconverge)!
Overall Accuracy
Unknown Word Acc
Without Smoothing
96.54 85.20
With Smoothing
97.10 88.20
DevTest Performance
ChristopherManning
Smoothing/Regularization
• Talkingof“priors”and“MAPestimation”isBayesianlanguage• Infrequentist statistics,peoplewillinsteadtalkaboutusing
“regularization”,andinparticular,agaussian prioris“L2regularization”
• Thechoiceofnamesmakesnodifferencetothemath• Recently,L1 regularizationisalsoverypopular
• Givessparsesolutions– mostparameters becomezero [Yay!]• Harderoptimizationproblem(non-continuous derivative)
ChristopherManning
Smoothing:VirtualData• Anotheroption:smooththedata,nottheparameters.• Example:
• Equivalent toaddingtwoextradatapoints.• Similartoadd-onesmoothingforgenerative models.
• Forfeature-based models,hardtoknowwhatartificialdatatocreate!
Heads Tails
4 0Heads Tails
5 1
ChristopherManning
Smoothing:CountCutoffs
• InNLP,featureswithlowempiricalcountsareoftendropped.• Veryweakandindirectsmoothingmethod.• Equivalent tolockingtheirweighttobezero.• Equivalent toassigning themgaussian priorswithmeanzeroandvariancezero.• Droppinglowcountsdoesremovethefeatureswhichweremostinneedofsmoothing…
• …andspeeds uptheestimation byreducingmodelsize…• …butcountcutoffsgenerallyhurtaccuracyinthepresence ofpropersmoothing.
• Don’tusecountcutoffsunlessnecessaryformemoryusagereasons.PreferL1 regularizationforfindingfeaturestodrop.
Smoothing/Priors/RegularizationforMaxent Models