Maximum Entropy Models / Softmax Classifiers · Joint vs. Conditional Models • Discriminative...

MaximumEntropyModels/

Softmax Classifiers

ChristopherManning

ChristopherManning

Today’splan

1. Generativevs.discriminativemodels[15mins]2. Optimizingsoftmax/maxentmodelparameters[20mins]3. NamedEntityRecognition[10mins]4. Maximumentropysequencemodels[10mins]

21

Generativevs.Discriminative

models

ChristopherManning

ChristopherManning

Introduction

• Sofarwe’vemainlylookedat“generativemodels”• Languagemodels,IBMalignmentmodels,PCFGs

• ButthereismuchuseofconditionalordiscriminativemodelsinNLP,Speech,IR,andMLgenerally

• Because:• Theygivehighaccuracyperformance• Theymakeiteasytoincorporate lotsoflinguisticallyimportant features• Theyalloweasybuildingoflanguage independent, retargetable NLPmodules

ChristopherManning

Jointvs.ConditionalModels

• Wehavesomedata{(d,c)}ofpairedobservationsd andhiddenclassesc.

• Joint(generative)modelsplaceprobabilitiesoverbothobserveddataandthehiddenstuff• Theygenerate theobserved datafromthehiddenstuff• Alltheclassic1990sStatNLP models:• n-gramlanguagemodels,NaiveBayesclassifiers,hiddenMarkovmodels,probabilistic context-freegrammars, IBMmachine translationalignmentmodels

P(c,d)

ChristopherManning

Jointvs.ConditionalModels

• Discriminative(conditional)modelstakethedataasgiven,andputaprobability/scoreoverhiddenstructuregiventhedata:

• Logisticregression,maximumentropymodels,conditionalrandomfields

• Also,SVMs,(averaged) perceptron, feed forwardneuralnetworks,etc.arediscriminative classifiers• butnotdirectlyprobabilistic

P(c|d)

ChristopherManning

Conditionalvs.JointLikelihood

• AjointmodelgivesprobabilitiesP(d,c)=P(c)P(d|c)andtriestomaximizethisjointlikelihood.• Itendsuptrivialtochooseweights: justcount!• Relative frequencies givemaximumjointlikelihoodoncategoricaldata

• AconditionalmodelgivesprobabilitiesP(c|d).Itmodelsonlytheconditionalprobabilityoftheclass.• Weseek tomaximizeconditionallikelihood.• Harder todo(aswe’llsee…)• Morecloselyrelated toclassificationerror.

ChristopherManning Conditionalmodelsworkwell:WordSenseDisambiguation

• Evenwithexactly thesamefeatures, changingfromjointtoconditionalestimation increasesperformance

• Thatis,weusethesamesmoothing, andthesameword-class features,wejustchangethenumbers(parameters)

Training Set

Objective Accuracy

Joint Like. 86.8

Cond. Like. 98.5

Test Set

Objective Accuracy

Joint Like. 73.6

Cond. Like. 76.1

(Klein and Manning 2002, using Senseval-1 Data)

ChristopherManning

PCFGsMaximizeJoint,notConditionalLikelihood

1. Whatparse foreatricewithchopsticks?

2. Howcanyougettheotherparse?

Optimizingsoftmax/maxentmodelparameters

Theirlikelihoodandderivatives

ChristopherManning

Background:FeatureExpectations

• Wewillcruciallymakeuseoftwoexpectations• actualandpredicted countsofafeature firing:

• Empiricalexpectation (count)ofafeature:

• Modelexpectation ofafeature:

∑ ∈=

),(observed),(),()( empirical

DCdc ii dcffE

∑ ∈=

),(),(),(),()(

DCdc ii dcfdcPfE

ChristopherManning

Maxent/Softmax ModelLikelihood

• Maximum(Conditional)LikelihoodModels• Givenamodelform,wechoosevaluesofparametersλi tomaximize the(conditional) likelihoodofthedata.

• Foranygivenfeatureweights,wecancalculate:• Conditional likelihoodoftrainingdata

• Derivative ofthelikelihoodwrt eachfeatureweight

∑∏∈∈

==),(),(),(),(

),|(log),|(log),|(logDCdcDCdc

dcPdcPDCP λλλ

ChristopherManning

TheLikelihoodValue

• The(log)conditionallikelihoodofiid* data(C,D)accordingtoamaxentmodelisafunctionofthedataandtheparametersλ:

• Iftherearen’tmanyvaluesofc,it’seasytocalculate:

∑∏∈∈

==),(),(),(),(

),|(log),|(log),|(logDCdcDCdc

dcPdcPDCP λλλ

∑∈

=),(),(

log),|(logDCdc

DCP λ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

*A fancy statistics term meaning “independent and identically distributed”. You normally need to assume this for anything formal to be derivable, even though it’s never quite true in practice.

ChristopherManning

TheLikelihoodValue

• Wecanseparatethisintotwocomponents:

• Wecanmaximizeitbyfindingwherethederivativeis0• Thederivativeisthedifferencebetweenthederivatives

ofeachcomponent

∑ ∑ ∑∈ ),(),( '

),'(explogDCdc c i

ii dcfλ∑ ∑∈ ),(),(

),(explogDCdc i

ii dcfλ −=),|(log λDCP

)(λN )(λM=),|(log λDCP −

ChristopherManning

TheDerivativeI:Numerator

Derivativeofthenumeratoris:theempiricalcount(fi,c)

i

DCdc iii dcf

λ

λ

∂

∂

=∑ ∑∈ ),(),(

),(

∑∑

∈ ∂

∂=

),(),(

),(

DCdc i

iii dcf

λ

λ

∑∈

=),(),(

),(DCdci dcf

i

DCdc iici

i

dcfN

λ

λ

λλ

∂

∂

=∂

∂∑ ∑∈ ),(),(

),(explog)(

ChristopherManning

TheDerivativeII:Denominator

i

DCdc c iii

i

dcfM

λ

λ

λλ

∂

∂

=∂

∂∑ ∑ ∑∈ ),(),( '

),'(explog)(

∑∑ ∑

∑ ∑∈ ∂

∂=

),(),(

'

''

),'(exp

),''(exp1

DCdc i

c iii

c iii

dcf

dcf λ

λ

λ

∑ ∑∑∑

∑ ∑∈ ∂

∂=

),(),( '''

),'(

1

),'(exp

),''(exp1

DCdc c i

iii

iii

c iii

dcfdcf

dcf λ

λλ

λ

i

iii

DCdc cc i

ii

iii dcf

dcf

dcf

λ

λ

λ

λ

∂

∂=

∑∑ ∑∑ ∑

∑∈

),'(

),''(exp

),'(exp

),(),( '''

∑ ∑∈

=),(),( '

),'(),|'(DCdc

ic

dcfdcP λ =predictedcount(fi,λ)

ChristopherManning

TheDerivativeIII

• Theoptimumparameters aretheonesforwhicheachfeature’spredicted expectation equalsitsempiricalexpectation. Theoptimumdistribution is:• Alwaysunique(butparametersmaynotbeunique)• Alwaysexists (iffeaturecountsarefromactualdata).

• Thesemodelsarealsocalledmaximumentropymodelsbecausewefindthemodelhavingmaximumentropyandsatisfyingtheconstraints:

=∂

∂

i

DCPλ

λ),|(log),(countactual Cfi ),(countpredicted λif−

jfEfE jpjp ∀= ),()( ~

ChristopherManning

Findingtheoptimalparameters

• Wewanttochooseparametersλ1,λ2,λ3,…thatmaximizetheconditionallog-likelihoodofthetrainingdata

• Tobeabletodothat,we’veworkedouthowtocalculatethefunctionvalueanditspartialderivatives(itsgradient)

)|(log)(1

i

n

ii dcPDCLogLik ∑

=

=

ChristopherManning

Alikelihood surface

ChristopherManning

Findingtheoptimalparameters

• Useyourfavoritenumericaloptimizationpackage….• Commonly(andinourcode),youminimize thenegative ofCLogLik

1. Gradientdescent (GD);Stochasticgradientdescent (SGD)• Improved variants likeAdagrad,Adadelta, RMSprop,NAG

2. Iterative proportional fittingmethods:Generalized Iterative Scaling(GIS)andImproved Iterative Scaling(IIS)

3. Conjugategradient (CG),perhapswithpreconditioning4. Quasi-Newtonmethods– limitedmemoryvariablemetric(LMVM)

methods, inparticular, L-BFGS

NamedEntityRecognition

ChristopherManning

NamedEntityRecognition(NER)

• AveryimportantNLPsub-task:find andclassifynamesintext,forexample:

• Thedecisionbytheindependent MPAndrewWilkie towithdrawhissupport fortheminorityLaborgovernmentsoundeddramaticbutitshouldnotfurther threaten itsstability.When,after the2010election,Wilkie,RobOakeshott,TonyWindsorandtheGreensagreed tosupportLabor,theygavejusttwoguarantees: confidence andsupply.

ChristopherManning


• Thedecisionbytheindependent MPAndrewWilkie towithdrawhissupport fortheminorityLabor governmentsoundeddramaticbutitshouldnotfurther threaten itsstability.When,after the2010 election,Wilkie,RobOakeshott,TonyWindsorandtheGreens agreed tosupportLabor,theygavejusttwoguarantees: confidence andsupply.

NamedEntityRecognition (NER)

ChristopherManning


• Thedecisionbytheindependent MPAndrewWilkie towithdrawhissupport fortheminorityLabor governmentsoundeddramaticbutitshouldnotfurther threaten itsstability.When,after the2010 election,Wilkie,RobOakeshott,TonyWindsorandtheGreens agreed tosupportLabor,theygavejusttwoguarantees: confidence andsupply.

NamedEntityRecognition (NER)

PersonDateLocationOrgani-zation

ChristopherManning

NamedEntityRecognition(NER)

• Theuses:• Namedentitiescanbeindexed, linkedoff,etc.• Sentimentcanbeattributed tocompaniesorproducts• Alotofrelations (employs,won,born-in)arebetweennamedentities• Forquestionanswering, answers areoftennamedentities.

• Concretely:• Manywebpages tagvariousentities, withlinkstobioortopicpages,etc.• Reuters’ OpenCalais,Evri,AlchemyAPI,Yahoo’sTermExtraction,…

• Apple/Google/Microsoft/… smartrecognizers fordocumentcontent

ChristopherManning

NamedEntityRecognitionEvaluation

Task:Predictentities inatext

Foreign ORGMinistry ORGspokesman OShen PERGuofang PERtold OReuters ORG: :

}Standardevaluationisperentity,not pertoken

ChristopherManning

TheNamedEntityRecognitionTask

We ORGshould ORGshow ONeha PEREric PERKing PER’s Oassignment ORG: :

OOOB-PERB-PER BIO/IOBnotationI-PEROO:

ChristopherManning

Precision/Recall/F1forNER

• RecallandprecisionarestraightforwardfortaskslikeIRandtextcategorization,wherethereisonlyonegrainsize(documents)

• ThemeasurebehavesabitfunnilyforIE/NERwhenthereareboundaryerrors(whicharecommon):• FirstBankofChicagoannouncedearnings…

• Thiscountsasbothafalsepositiveandafalsenegative• Selectingnothing wouldhavebeenbetter• Someothermetrics(e.g.,MUCscorer)givepartialcredit

(accordingtocomplexrules)

Maximumentropysequencemodels

MaximumentropyMarkovmodels(MEMMs)a.k.a.

ConditionalMarkovmodels

ChristopherManning

Sequenceproblems

• ManyproblemsinNLPhavedatawhichisasequenceofcharacters,words,phrases,lines,orsentences…

• WecanthinkofourtaskasoneoflabelingeachitemVBG NN IN DT NN IN NNChasing opportunity in an age of upheaval

POStagging

B B I I B I B I B B

而相对于这些品牌的价

Wordsegmentation

PERS O O O ORG ORGMurdoch discusses future of News Corp.

Namedentityrecognition

Textsegmen-tation

QAQAAAQA

ChristopherManning

MEMMinferenceinsystems

• ForaConditionalMarkovModel(CMM)a.k.a.aMaximumEntropyMarkovModel(MEMM), theclassifiermakesasingledecisionatatime,conditionedonevidence fromobservations andprevious decisions

• Alarger spaceofsequences isusuallyexplored viasearch

-3 -2 -1 0 +1

ORG ORG O ??? ???

Xerox Corp. fell 22.6 %

LocalContextFeatures

W0 22.6

W+1 %

W-1 fell

C-1 O

C-1-C-2 ORG-O

hasDigit? true

… …(Borthwick 1999,Kleinetal.2003,etc.)

DecisionPoint

ChristopherManning

Example:NER

• Scoringindividuallabelingdecisions isnomorecomplexthanstandardclassificationdecisions• Wehavesomeassumed labels touseforpriorpositions• Weusefeaturesofthoseandtheobserveddata(whichcanincludecurrent,previous,andnextwords)topredictthecurrentlabel

-3 -2 -1 0 +1

ORG ORG O ??? ???

Xerox Corp. fell 22.6 %

LocalContextFeatures

W0 22.6

W+1 %

W-1 fell

C-1 O

C-1-C-2 ORG-O

hasDigit? true

… …

DecisionPoint

ChristopherManning

InferenceinSystemsSequenceLevel

LocalLevel

LocalData

FeatureExtraction

Features

Label

Optimization

Smoothing

Classifier Type

Features

Label

SequenceData

Maximum Entropy Model RegularizationOptimization

Sequence Model Inference(Search)

LocalData

LocalData

ChristopherManning

GreedyInference

• Greedy inference:• We just start at the left, and use our classifier at each position to assign a label• The classifier can depend on previous labeling decisions as well as observed data

• Advantages:• Fast, no extra memory requirements• Very easy to implement• With rich features including observations to the right, it can perform quite well

• Disadvantage:• Greedy. We make commit errors we cannot recover from

Sequence Model

Inference

Best Sequence

ChristopherManning

BeamInference

• Beam inference:

• At each position keep the top k complete sequences.

• Extend each sequence in each local way.

• The extensions compete for the k slots at the next position.

• Advantages:• Fast; beam sizes of 3–5 are almost as good as exact inference in many cases.• Easy to implement (no dynamic programming required).

• Disadvantage:• Inexact: the globally best sequence can fall off the beam.

Sequence Model

Inference

Best Sequence

ChristopherManning

ViterbiInference

• Viterbi inference:• Dynamic programming or memoization.• Requires small window of state influence (e.g., past two states are relevant).

• Advantage:• Exact: the global best sequence is returned.

• Disadvantage:• Harder to implement long-distance state-state interactions (but beam inference

tends not to allow long-distance resurrection of sequences anyway).

Sequence Model

Inference

Best Sequence

ChristopherManning

CRFs [Lafferty, Pereira, and McCallum 2001]

• Another sequence model: Conditional Random Fields (CRFs)

• A whole-sequence conditional model rather than a chaining of local models.

• The space of c’s is now the space of sequences• But if the features fi remain local, the conditional sequence likelihood can be calculated

exactly using dynamic programming

• Training is slower, but CRFs avoid causal-competition biases

• These (or a variant using a max margin criterion) are seen as the state-of-the-art these days … but in practice they usually work much the same as MEMMs.

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP ∑

iii dcf ),(exp λ

ChristopherManning

CoNLL2003NERsharedtaskResultsonEnglishDevset

82

84

86

88

90

92

94

96

MEMM 1st CRF MMMN

OverallLocMiscOrgPerson

Smoothing/Priors/RegularizationforMaxent Models

ChristopherManning

Smoothing:IssuesofScale• Lotsoffeatures:

• NLPmaxent modelscanhavetenmillion features.• Evenstoringasinglearrayofparametervaluescanhavea

substantialmemorycost.

• Lotsofsparsity:• Overfitting veryeasy– weneedsmoothing!• Manyfeatures seenintraining willneveroccuragainattesttime.

• Optimizationproblems:• Featureweightscanbeinfinite, anditerativesolverscantakealong

timetogettothoseinfinities.

ChristopherManning

Smoothing:Issues• Assume the following empirical distribution:

• Features: {Heads}, {Tails}

• We’ll have the following softmax model distribution:

• Really, only one degree of freedom (λ = λH−λT)

Heads Tails

h t

TH

H

HEADS λλ

λ

eeep+

=TH

T

TAILS λλ

λ

eeep+

=

pHEADS =eλHe−λT

eλHe−λT + eλTe−λT=

eλ

eλ + e0=

eλ

eλ +1pTAILS =

e0

eλ + e0=

11+ eλ

λ

Logisticregression!

ChristopherManning

Smoothing:Issues

• Thedatalikelihoodinthismodelis:TAILSHEADS loglog)|,(log ptphthP +=λ

)1(log)()|,(log λλλ ehththP ++−=

Heads Tails

2 2Heads Tails

3 1Heads Tails

4 0

λ λ λ

log P log P log P

ChristopherManning

Smoothing:EarlyStopping• Inthe4/0case,thereweretwoproblems:

• Theoptimalvalueofλ was∞,whichisalongtripforanoptimizationprocedure

• Thelearneddistributionisjustasspikedastheempiricalone–nosmoothing

• Onewaytosolvebothissues istojuststoptheoptimizationearly,afterafewiterations:• Thevalueofλwillbefinite(butpresumablybig)• Theoptimizationwon’ttakeforever(clearly)• Commonlyusedinearlymaxent work

• HasseenarevivalindeeplearningJ

Heads Tails

4 0

Heads Tails

1 0

Input

Output

λ

ChristopherManning

Smoothing:Priors(MAP)• Whatifwehadapriorexpectation thatparametervalueswouldn’t

beverylarge?• Wecouldthenbalance evidence suggesting largeparameters (or

infinite) againstourprior.• Theevidencewouldnevertotallydefeattheprior,andparameters

wouldbesmoothed (andkeptfinite!).• Wecandothisexplicitly bychangingtheoptimization objectiveto

maximumposterior likelihood:

),|(log)(log)|,(log λλλ DCPPDCP +=

Posterior Prior Evidence

ChristopherManning

Smoothing:Priors• Gaussian, orquadratic,orL2 priors:

• Intuition: parameters shouldn’tbelarge.• Formalization: priorexpectation thateachparameterwillbedistributed accordingtoagaussian withmeanµ andvarianceσ2.

• Penalizes parameters fordriftingtoofarfromtheirmeanpriorvalue(usuallyµ=0).

• 2σ2=1workssurprisinglywell.

Theydon’tevencapitalizemynameanymore!

!!"

#$$%

& −−= 2

2

2)(

exp21

)(i

ii

iiP

σµλ

πσλ

2σ2=1

2σ2 = 10

2σ2 = ∞

ChristopherManning

Smoothing:Priors• Ifweusegaussian priors /L2 regularization:

• Tradeoffsomeexpectation-matchingforsmallerparameters.• Whenmultiplefeaturescanberecruitedtoexplainadatapoint,themorecommonones

generallyreceivemoreweight.• Accuracygenerallygoesup!

• Change theobjective:

• Change thederivative:

),|(log)|,(log λλ DCPDCP =

∑∈

=),(),(

),|()|,(logDCdc

dcPDCP λλ

),(predicted),(actual/)|,(log λλλ iii fCfDCP −=∂∂

2σ2

=1

2σ2

= 10

2σ2 = ∞

€

−(λi −µi)

2

2σ i2

i∑ + k

+ logP(λ)

€

−(λi −µi) /σ2

ChristopherManning

Smoothing:Priors• Ifweusegaussian priors /L2 regularization :

• Tradeoffsomeexpectation-matchingforsmallerparameters.• Whenmultiplefeaturescanberecruitedtoexplainadatapoint,themorecommonones

generallyreceivemoreweight.• Accuracygenerallygoesup!

• Change theobjective:

• Change thederivative:

),|(log)|,(log λλ DCPDCP =

∑∈

=),(),(

),|()|,(logDCdc

dcPDCP λλ

),(predicted),(actual/)|,(log λλλ iii fCfDCP −=∂∂

2σ2

=1

2σ2

= 10

2σ2 = ∞

−λi2

2σ i2

i∑ + k

+ logP(λ)

−λi /σ2

Takingpriormeanas0

ChristopherManning

Example:NERSmoothing

Feature Type Feature PERS LOC

Previous word at -0.73 0.94

Current word Grace 0.03 0.00

Beginning bigram <G 0.45 -0.04

Current POS tag NNP 0.47 0.45

Prev and cur tags IN NNP -0.10 0.14

Previous state Other -0.70 -0.92

Current signature Xx 0.80 0.46

Prev state, cur sig O-Xx 0.68 0.37

Prev-cur-next sig x-Xx-Xx -0.69 0.37

P. state - p-cur sig O-x-Xx -0.20 0.82

…

Total: -0.58 2.68

Prev Cur Next

State Other ??? ???

Word at Grace Road

Tag IN NNP NNP

Sig x Xx Xx

Local Context

Feature WeightsBecauseofsmoothing,themorecommonprefixandsingle-tagfeatureshavelargerweightseventhoughentire-wordandtag-pairfeaturesaremorespecific.

ChristopherManning

Example:POSTagging• From(Toutanovaetal.,2003):

• Smoothinghelps:• Softensdistributions.• Pushesweightontomoreexplanatoryfeatures.• Allowsmanyfeaturestobedumpedsafelyintothemix.• Speedsupconvergence (ifbothareallowedtoconverge)!

Overall Accuracy

Unknown Word Acc

Without Smoothing

96.54 85.20

With Smoothing

97.10 88.20

DevTest Performance

ChristopherManning

Smoothing/Regularization

• Talkingof“priors”and“MAPestimation”isBayesianlanguage• Infrequentist statistics,peoplewillinsteadtalkaboutusing

“regularization”,andinparticular,agaussian prioris“L2regularization”

• Thechoiceofnamesmakesnodifferencetothemath• Recently,L1 regularizationisalsoverypopular

• Givessparsesolutions– mostparameters becomezero [Yay!]• Harderoptimizationproblem(non-continuous derivative)

ChristopherManning

Smoothing:VirtualData• Anotheroption:smooththedata,nottheparameters.• Example:

• Equivalent toaddingtwoextradatapoints.• Similartoadd-onesmoothingforgenerative models.

• Forfeature-based models,hardtoknowwhatartificialdatatocreate!

Heads Tails

4 0Heads Tails

5 1

ChristopherManning

Smoothing:CountCutoffs

• InNLP,featureswithlowempiricalcountsareoftendropped.• Veryweakandindirectsmoothingmethod.• Equivalent tolockingtheirweighttobezero.• Equivalent toassigning themgaussian priorswithmeanzeroandvariancezero.• Droppinglowcountsdoesremovethefeatureswhichweremostinneedofsmoothing…

• …andspeeds uptheestimation byreducingmodelsize…• …butcountcutoffsgenerallyhurtaccuracyinthepresence ofpropersmoothing.

• Don’tusecountcutoffsunlessnecessaryformemoryusagereasons.PreferL1 regularizationforfindingfeaturestodrop.

Smoothing/Priors/RegularizationforMaxent Models

Maximum Entropy Models / Softmax Classifiers · Joint vs. Conditional Models • Discriminative...

Documents

Transcript of Maximum Entropy Models / Softmax Classifiers · Joint vs. Conditional Models • Discriminative...