Andrew Rosenberg- Lecture 22: Evaluation

8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

1/32

Lecture22:Evalua.on

April24,2010


2/32

LastTime

SpectralClustering


3/32

Today

Evalua.onMeasuresAccuracySignificanceTes.ngF-MeasureErrorTypes

ROCCurves EqualErrorRate

AIC/BIC


4/32

Howdoyouknowthatyouhavea

goodclassifier? Isafeaturecontribu.ngtooverallperformance?

IsclassifierAbeQerthanclassifierB? InternalEvalua.on:

MeasuretheperformanceoftheclassifierExternalEvalua.on:Measuretheperformanceonadownstreamtask


5/32

Accuracy

Easilythemostcommonandintui.vemeasureofclassifica.onperformance

Accuracy =#correct

N


6/32

Significancetes.ng

SayIhavetwoclassifiers A=50%accuracy B=75%accuracy BisbeQer,right?


7/32

SignificanceTes.ng

SayIhaveanothertwoclassifiers A=50%accuracy B=505%accuracy IsBbeQer?


8/32

BasicEvalua.on

Trainingdatausedtoiden.fymodelparameters

Tes.ngdatausedforevalua.on Op.onally:Development/tuningdatausedtoiden.fymodelhyperparameters

Difficulttogetsignificanceorconfidencevalues


9/32

Crossvalida.on

Iden.fynfoldsoftheavailabledata Trainonn-1folds Testontheremainingfold Intheextreme(n=N)thisisknownasleave-one-outcrossvalida.on

n-foldcrossvalida.on(xval)givesnsamplesoftheperformanceoftheclassifier


10/32

SignificanceTes.ng

Istheperformanceoftwoclassifiersdifferentwithsta.s.calsignificance?

Meanstes.ngIfwehavetwosamplesofclassifierperformance(accuracy),wewanttodetermineiftheyare

drawnfromthesamedistribu.on(nodifference)

ortwodifferentdistribu.ons


11/32

T-test

OneSamplet-test

Independentt-testUnequalvariancesandsamplesizes

Onceyouhaveat-

value,lookupthe

significancelevelona

table,keyedonthet-

valueanddegreesoffreedom


12/32

SignificanceTes.ng

RunCross-valida.ontogetn-samplesoftheclassifiermean

Usethisdistribu.ontocompareagainsteither:Aknown(published)levelofperformance

onesamplet-testAnotherdistribu.onofperformance

twosamplet-test Ifatallpossible,resultsshouldincludeinforma.onaboutthevarianceofclassifierperformance


13/32

SignificanceTes.ng

Caveatincludingmoresamplesoftheclassifierperformancecanar.ficiallyinflatethesignificancemeasure

Ifxandsareconstant(thesamplerepresentsthepopula.onmeanandvariance)thenraisingnwillincreaset

Ifthesesamplesarereal,thenthisisfineOencross-valida.onfoldassignmentisnottrulyrandomThussubsequentxvalrunsonlyresamplethesameinforma.on


14/32

ConfidenceBars

Varianceinforma.oncanbeincludedinplotsofclassifierperformancetoeasevisualiza.on

Plotstandarddevia.on,standarderrororconfidenceinterval?

= 10 = 1

SD = SE=

n

CI95% = 1.96 n

n = 10


15/32

ConfidenceBars

MostimportanttobeclearaboutwhatisploQed 95%confidenceintervalhastheclearestinterpreta.on

8

85

9

95

10

105

11

115

SD SE CI


16/32

BaselineClassifiers

MajorityClassbaselineEverydatapointisclassifiedastheclassthatismostfrequentlyrepresentedinthetrainingdata

RandombaselineRandomlyassignoneoftheclassestoeachdatapoint

withanevendistribu.onwiththetrainingclassdistribu.on


17/32

Problemswithaccuracy

Con.ngencyTableTrueValues

Posi1ve Nega1ve

Hyp

Values

Posi1ve True

Posi.ve

False

Posi.ve

Nega1ve False

Nega.ve

True

Nega.ve

ccuracy =TP+ TN

TP+ FP+ TN+ FN


18/32


Informa.onRetrievalExampleFindthe10documentsrelatedtoaqueryinasetof110documents

TrueValues

Posi1ve Nega1ve

HypValues

Posi1ve 0 0

Nega1ve 10 100

Accuracy = 90%


19/32


Precision:howmanyhypothesized

eventsweretrueevents

Recall:howmanyofthetrueeventswereiden.fied

F-Measure:Harmonicmeanofprecisionandrecall

TrueValues

Posi1ve Nega1v

e

Hyp

Values

Posi1ve 0 0

Nega1v

e

10 100

P =TP

TP+ FP

R =TP

TP+ FN

F =2PR

P+R


20/32

F-Measure

F-measurecanbeweightedtofavorPrecisionorRecall

beta>1favorsrecall

beta


21/32

F-Measure

TrueValues

Posi1ve Nega1ve

HypValues

Posi1ve 1 0

Nega1ve 9 100

F =(1 + 2)PR

(2P) + R

P = 1

R =1

10

F1 = .18


22/32

F-Measure

TrueValues

Posi1ve Nega1ve

HypValues

Posi1ve 10 50

Nega1ve 0 50

F =(1 + 2)PR

(2P) + R

P =

10

60

R = 1

F1 = .29


23/32

F-Measure

TrueValues

Posi1ve Nega1ve

HypValues

Posi1ve 9 1

Nega1ve 1 99

F =(1 + 2)PR

(2P) + R

P = .9

R = .9

F1 = .9


24/32

F-Measure

Accuracyisweightedtowardsmajorityclassperformance

F-measureisusefulformeasuringtheperformanceonminorityclasses


25/32

TypesofErrors

FalsePosi.vesThesystempredictedTRUEbutthevaluewasFALSE

akaFalseAlarmsorTypeIerror FalseNega.ves

ThesystempredictedFALSEbutthevaluewasTRUE

akaMissesorTypeIIerror


26/32

ROCcurves

Itiscommontoplotclassifierperformanceatavarietyofsengsorthresholds

ReceiverOpera.ngCharacteris.c(ROC)curvesplottrueposi.vesagainstfalseposi.ves

TheoverallperformanceiscalculatedbytheArea

UndertheCurve(AUC)


27/32

ROCCurves

EqualErrorRate(EER)iscommonlyreported EERrepresentsthehighestaccuracyoftheclassifier

Curvesprovidemoredetailaboutperformance

Gauvainetal1995


28/32

GoodnessofFit

Anotherviewofmodelperformance Measurethemodellikelihoodoftheunseendata

However,weveseenthatmodellikelihoodislikelytoimprovebyaddingparameters

Twoinforma.oncriteriameasuresincludeacosttermforthenumberofparametersinthe

model

l(x;)


29/32

AkaikeInforma.onCriterion

AkaikeInforma.onCriterion(AIC)basedonentropy

ThebestmodelhasthelowestAICGreatestmodellikelihoodFewestfreeparameters

AIC = 2k 2 ln(l(x; ))

Informa.onintheparameters

Informa.onlostbythemodeling


30/32

BayesianInforma.onCriterion

Anotherpenaliza.ontermbasedonBayesianarguments

Selectthemodelthatisaposteriorimostprobablywithaconstantpenaltytermforwrongmodels

IferrorsarenormallydistributedNotecompareses.matedmodelswhenxisconstant

BIC = k ln(n) 2 ln(l(x; ))

BIC = ln(2e) +k

nln(n)


31/32

Today

Accuracy SignificanceTes.ng F-Measure AIC/BIC


32/32

NextTime

RegressionEvalua.on ClusterEvalua.on

Andrew Rosenberg- Lecture 22: Evaluation

Documents

Transcript of Andrew Rosenberg- Lecture 22: Evaluation