Andrew Rosenberg- Lecture 22: Evaluation

download Andrew Rosenberg- Lecture 22: Evaluation

of 32

Transcript of Andrew Rosenberg- Lecture 22: Evaluation

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    1/32

    Lecture22:Evalua.on

    April24,2010

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    2/32

    LastTime

    SpectralClustering

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    3/32

    Today

    Evalua.onMeasuresAccuracySignificanceTes.ngF-MeasureErrorTypes

    ROCCurves EqualErrorRate

    AIC/BIC

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    4/32

    Howdoyouknowthatyouhavea

    goodclassifier? Isafeaturecontribu.ngtooverallperformance?

    IsclassifierAbeQerthanclassifierB? InternalEvalua.on:

    MeasuretheperformanceoftheclassifierExternalEvalua.on:Measuretheperformanceonadownstreamtask

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    5/32

    Accuracy

    Easilythemostcommonandintui.vemeasureofclassifica.onperformance

    Accuracy =#correct

    N

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    6/32

    Significancetes.ng

    SayIhavetwoclassifiers A=50%accuracy B=75%accuracy BisbeQer,right?

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    7/32

    SignificanceTes.ng

    SayIhaveanothertwoclassifiers A=50%accuracy B=505%accuracy IsBbeQer?

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    8/32

    BasicEvalua.on

    Trainingdatausedtoiden.fymodelparameters

    Tes.ngdatausedforevalua.on Op.onally:Development/tuningdatausedtoiden.fymodelhyperparameters

    Difficulttogetsignificanceorconfidencevalues

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    9/32

    Crossvalida.on

    Iden.fynfoldsoftheavailabledata Trainonn-1folds Testontheremainingfold Intheextreme(n=N)thisisknownasleave-one-outcrossvalida.on

    n-foldcrossvalida.on(xval)givesnsamplesoftheperformanceoftheclassifier

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    10/32

    SignificanceTes.ng

    Istheperformanceoftwoclassifiersdifferentwithsta.s.calsignificance?

    Meanstes.ngIfwehavetwosamplesofclassifierperformance(accuracy),wewanttodetermineiftheyare

    drawnfromthesamedistribu.on(nodifference)

    ortwodifferentdistribu.ons

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    11/32

    T-test

    OneSamplet-test

    Independentt-testUnequalvariancesandsamplesizes

    Onceyouhaveat-

    value,lookupthe

    significancelevelona

    table,keyedonthet-

    valueanddegreesoffreedom

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    12/32

    SignificanceTes.ng

    RunCross-valida.ontogetn-samplesoftheclassifiermean

    Usethisdistribu.ontocompareagainsteither:Aknown(published)levelofperformance

    onesamplet-testAnotherdistribu.onofperformance

    twosamplet-test Ifatallpossible,resultsshouldincludeinforma.onaboutthevarianceofclassifierperformance

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    13/32

    SignificanceTes.ng

    Caveatincludingmoresamplesoftheclassifierperformancecanar.ficiallyinflatethesignificancemeasure

    Ifxandsareconstant(thesamplerepresentsthepopula.onmeanandvariance)thenraisingnwillincreaset

    Ifthesesamplesarereal,thenthisisfineOencross-valida.onfoldassignmentisnottrulyrandomThussubsequentxvalrunsonlyresamplethesameinforma.on

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    14/32

    ConfidenceBars

    Varianceinforma.oncanbeincludedinplotsofclassifierperformancetoeasevisualiza.on

    Plotstandarddevia.on,standarderrororconfidenceinterval?

    = 10 = 1

    SD = SE=

    n

    CI95% = 1.96 n

    n = 10

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    15/32

    ConfidenceBars

    MostimportanttobeclearaboutwhatisploQed 95%confidenceintervalhastheclearestinterpreta.on

    8

    85

    9

    95

    10

    105

    11

    115

    SD SE CI

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    16/32

    BaselineClassifiers

    MajorityClassbaselineEverydatapointisclassifiedastheclassthatismostfrequentlyrepresentedinthetrainingdata

    RandombaselineRandomlyassignoneoftheclassestoeachdatapoint

    withanevendistribu.onwiththetrainingclassdistribu.on

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    17/32

    Problemswithaccuracy

    Con.ngencyTableTrueValues

    Posi1ve Nega1ve

    Hyp

    Values

    Posi1ve True

    Posi.ve

    False

    Posi.ve

    Nega1ve False

    Nega.ve

    True

    Nega.ve

    ccuracy =TP+ TN

    TP+ FP+ TN+ FN

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    18/32

    Problemswithaccuracy

    Informa.onRetrievalExampleFindthe10documentsrelatedtoaqueryinasetof110documents

    TrueValues

    Posi1ve Nega1ve

    HypValues

    Posi1ve 0 0

    Nega1ve 10 100

    Accuracy = 90%

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    19/32

    Problemswithaccuracy

    Precision:howmanyhypothesized

    eventsweretrueevents

    Recall:howmanyofthetrueeventswereiden.fied

    F-Measure:Harmonicmeanofprecisionandrecall

    TrueValues

    Posi1ve Nega1v

    e

    Hyp

    Values

    Posi1ve 0 0

    Nega1v

    e

    10 100

    P =TP

    TP+ FP

    R =TP

    TP+ FN

    F =2PR

    P+R

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    20/32

    F-Measure

    F-measurecanbeweightedtofavorPrecisionorRecall

    beta>1favorsrecall

    beta

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    21/32

    F-Measure

    TrueValues

    Posi1ve Nega1ve

    HypValues

    Posi1ve 1 0

    Nega1ve 9 100

    F =(1 + 2)PR

    (2P) + R

    P = 1

    R =1

    10

    F1 = .18

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    22/32

    F-Measure

    TrueValues

    Posi1ve Nega1ve

    HypValues

    Posi1ve 10 50

    Nega1ve 0 50

    F =(1 + 2)PR

    (2P) + R

    P =

    10

    60

    R = 1

    F1 = .29

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    23/32

    F-Measure

    TrueValues

    Posi1ve Nega1ve

    HypValues

    Posi1ve 9 1

    Nega1ve 1 99

    F =(1 + 2)PR

    (2P) + R

    P = .9

    R = .9

    F1 = .9

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    24/32

    F-Measure

    Accuracyisweightedtowardsmajorityclassperformance

    F-measureisusefulformeasuringtheperformanceonminorityclasses

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    25/32

    TypesofErrors

    FalsePosi.vesThesystempredictedTRUEbutthevaluewasFALSE

    akaFalseAlarmsorTypeIerror FalseNega.ves

    ThesystempredictedFALSEbutthevaluewasTRUE

    akaMissesorTypeIIerror

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    26/32

    ROCcurves

    Itiscommontoplotclassifierperformanceatavarietyofsengsorthresholds

    ReceiverOpera.ngCharacteris.c(ROC)curvesplottrueposi.vesagainstfalseposi.ves

    TheoverallperformanceiscalculatedbytheArea

    UndertheCurve(AUC)

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    27/32

    ROCCurves

    EqualErrorRate(EER)iscommonlyreported EERrepresentsthehighestaccuracyoftheclassifier

    Curvesprovidemoredetailaboutperformance

    Gauvainetal1995

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    28/32

    GoodnessofFit

    Anotherviewofmodelperformance Measurethemodellikelihoodoftheunseendata

    However,weveseenthatmodellikelihoodislikelytoimprovebyaddingparameters

    Twoinforma.oncriteriameasuresincludeacosttermforthenumberofparametersinthe

    model

    l(x;)

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    29/32

    AkaikeInforma.onCriterion

    AkaikeInforma.onCriterion(AIC)basedonentropy

    ThebestmodelhasthelowestAICGreatestmodellikelihoodFewestfreeparameters

    AIC = 2k 2 ln(l(x; ))

    Informa.onintheparameters

    Informa.onlostbythemodeling

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    30/32

    BayesianInforma.onCriterion

    Anotherpenaliza.ontermbasedonBayesianarguments

    Selectthemodelthatisaposteriorimostprobablywithaconstantpenaltytermforwrongmodels

    IferrorsarenormallydistributedNotecompareses.matedmodelswhenxisconstant

    BIC = k ln(n) 2 ln(l(x; ))

    BIC = ln(2e) +k

    nln(n)

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    31/32

    Today

    Accuracy SignificanceTes.ng F-Measure AIC/BIC

  • 8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation

    32/32

    NextTime

    RegressionEvalua.on ClusterEvalua.on