CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L10.pdf · Other Clustering Methods...
Transcript of CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L10.pdf · Other Clustering Methods...
CPSC340:MachineLearningandDataMining
OutlierDetectionFall2020
LastTime:HierarchicalClustering• Wediscussedhierarchicalclustering:– Performsclusteringatmultiplescales.– Outputisusuallyatreediagram (“dendrogram”).– Revealsmuchmorestructureindata.– Usuallynon-parametric:
• Atfinestscale,everypointisitsownclusters.
• Wediscussedsomeapplicationareas:– Animals(phylogenetics).– Languages.– Stories.– Fashion.
http://www.nature.com/nature/journal/v438/n7069/fig_tab/nature04338_F10.html
Biclustering• Biclustering:– Clusterthetrainingexamplesandfeatures.– Alsogivesfeaturerelationshipinformation.
• Simplestandmostpopularmethod:– Runclusteringmethodon‘X’(examples).– Runclusteringmethodon‘XT’(features).
• Oftenplottedwith‘X’asaheatmap.– Whererows/columnsarrangedbyclusters.– Helpsyou‘see’whythingsareclustered.
http://openi.nlm.nih.gov/detailedresult.php?img=2731891_gkp491f3&req=4
Biclustering• Visualization:hierarchicalbiclustering +heatmap+dendrograms.– Popularinbiology/medicine.
https://arxiv.org/pdf/1408.0856v1.pdf
Application:Medicaldata• Hierarchicalclusteringisverycommoninmedicaldataanalysis.– Biclustering differentsamplesofbreastcancer:
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Finetti2008Sixteen-kinase.pdf
OtherClusteringMethods• Mixturemodels:– Probabilisticclustering.
• Mean-shiftclustering:– Findslocal“modes”indensityofpoints.– Alternativeapproachtovectorquantization.
• Bayesianclustering:– Avariantonensemblemethods.– Averagesovermodels/clusterings,weightedby“prior”beliefinthemodel/clustering.
Graph-BasedClustering• Spectralclusteringandgraph-basedclustering:– Clusteringofdatadescribedbygraphs.
https://griffsgraphs.wordpress.com/tag/clustering/http://ascr-discovery.science.doe.gov/2013/09/sifting-genomes/https://www.hackdiary.com/2012/04/05/extracting-a-social-graph-from-wikipedia-people-pages/
(pause)
MotivatingExample:FindingHolesinOzoneLayer
• ThehugeAntarcticozoneholewas“discovered”in1985.
• Ithadbeeninsatellitedatasince1976:– Butitwasflaggedandfilteredoutbyaquality-controlalgorithm.
https://en.wikipedia.org/wiki/Ozone_depletion
OutlierDetection• Outlierdetection:
– Findobservationsthatare“unusuallydifferent”fromtheothers.– Alsoknownas“anomalydetection”.– Maywanttoremoveoutliers,orbeinterestedintheoutliersthemselves(security).
• Somesourcesofoutliers:– Measurementerrors.– Dataentryerrors.– Contaminationofdatafromdifferentsources.– Rareevents.
http://mathworld.wolfram.com/Outlier.html
ApplicationsofOutlierDetection• Datacleaning.• Securityandfaultdetection(networkintrusion,DOSattacks).• Frauddetection(creditcards,stocks,votingirregularities).
• Detectingnaturaldisasters(underwaterearthquakes).• Astronomy(findnewclassesofstars/planets).• Genetics(identifyingindividualswithnew/ancientgenes).
ClassesofMethodsforOutlierDetection1. Model-basedmethods.2. Graphicalapproaches.3. Cluster-basedmethods.4. Distance-basedmethods.5. Supervised-learningmethods.
• Warning:thisisthetopicwiththemostambiguous“solutions”.
Butfirst…• Usuallyit’sgoodtodosomebasicsanitychecking…
– WouldanyvaluesinthecolumncauseaPython/Julia“type”error?– Whatistherangeofnumericalfeatures?– Whataretheuniqueentriesforacategoricalfeature?– Doesitlooklikepartsofthetableareduplicated?
• ThesetypesofsimpleerrorsareVERYcommoninrealdata.
Egg Milk Fish Wheat Shellfish Peanuts Peanuts Sick?
0 0.7 0 0.3 0 0 0 1
0.3 0.7 0 0.6 -1 3 3 1
0 0 0 “sick” 0 1 1 0
0.3 0.7 1.2 0 0.10 0 0 2
900 0 1.2 0.3 0.10 0 0 1
Model-BasedOutlierDetection• Model-basedoutlierdetection:
1. Fitaprobabilisticmodel.2. Outliersareexampleswithlowprobability.
• Example:– Assumedatafollowsnormaldistribution.– Thez-score for1Ddataisgivenby:
– “Numberofstandarddeviationsawayfromthemean”.– Say“outlier”if|z|>4,orsomeotherthreshold.
http://mathworld.wolfram.com/Outlier.html
ProblemswithZ-Score• Unfortunately,themeanandvariancearesensitivetooutliers.
– Possiblefixes:usequantiles,orsequentiallyremoveworseoutlier.
• Thez-scorealsoassumesthatdatais“uni-modal”.– Dataisconcentratedaroundthemean.
http://mathworld.wolfram.com/Outlier.html
Globalvs.LocalOutliers• Istheredpoint anoutlier?
Globalvs.LocalOutliers• Istheredpoint anoutlier?Whatifweaddthebluepoints?
Globalvs.LocalOutliers• Istheredpoint anoutlier?Whatifweaddthebluepoints?
• Redpointhasthelowestz-score.– Inthefirstcaseitwasa“global”outlier.– Inthissecondcaseit’sa“local”outlier:
• Withinnormaldatarange,butfarfromotherpoints.
• It’shardtopreciselydefine“outliers”.
Globalvs.LocalOutliers• Istheredpoint anoutlier?Whatifweaddthebluepoints?
• Redpointhasthelowestz-score.– Inthefirstcaseitwasa“global”outlier.– Inthissecondcaseit’sa“local”outlier:
• Withinnormaldatarange,butfarfromotherpoints.
• It’shardtopreciselydefine“outliers”.– Canwehaveoutliergroups?
Globalvs.LocalOutliers• Istheredpoint anoutlier?Whatifweaddthebluepoints?
• Redpointhasthelowestz-score.– Inthefirstcaseitwasa“global”outlier.– Inthissecondcaseit’sa“local”outlier:
• Withinnormaldatarange,butfarfromotherpoints.
• It’shardtopreciselydefine“outliers”.– Canwehaveoutliergroups?Whataboutrepeatingpatterns?
GraphicalOutlierDetection• Graphicalapproachtooutlierdetection:
1. Lookataplotofthedata.2. Humandecidesifdataisanoutlier.
• Examples:1. Boxplot:
• Visualizationofquantiles/outliers.• Only1variableatatime.
http://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/boxplot/
GraphicalOutlierDetection• Graphicalapproachtooutlierdetection:
1. Lookataplotofthedata.2. Humandecidesifdataisanoutlier.
• Examples:1. Boxplot.2. Scatterplot:
• Candetectcomplexpatterns.• Only2variablesatatime.
http://mathworld.wolfram.com/Outlier.html
GraphicalOutlierDetection• Graphicalapproachtooutlierdetection:
1. Lookataplotofthedata.2. Humandecidesifdataisanoutlier.
• Examples:1. Boxplot.2. Scatterplot.3. Scatterplotarray:
• Lookatallcombinationsofvariables.• Butlaboriousinhigh-dimensions.• Stillonly2variablesatatime.
https://randomcriticalanalysis.wordpress.com/2015/05/25/standardized-tests-correlations-within-and-between-california-public-schools/
GraphicalOutlierDetection• Graphicalapproachtooutlierdetection:
1. Lookataplotofthedata.2. Humandecidesifdataisanoutlier.
• Examples:1. Boxplot.2. Scatterplot.3. Scatterplotarray.4. Scatterplotof2-dimensionalPCA:
• ‘See’high-dimensionalstructure.• Butlosesinformationandsensitivetooutliers.
http://scienceblogs.com/gnxp/2008/08/14/the-genetic-map-of-europe/
Cluster-BasedOutlierDetection• Detectoutliersbasedonclustering:
1. Clusterthedata.2. Findpointsthatdon’tbelongtoclusters.
• Examples:1. K-means:
• Findpointsthatarefarawayfromanymean.• Findclusterswithasmallnumberofpoints.
Cluster-BasedOutlierDetection• Detectoutliersbasedonclustering:
1. Clusterthedata.2. Findpointsthatdon’tbelongtoclusters.
• Examples:1. K-means.2. Density-basedclustering:
• Outliersarepointsnotassignedtocluster.
http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap10_anomaly_detection.pdf
Cluster-BasedOutlierDetection• Detectoutliersbasedonclustering:
1. Clusterthedata.2. Findpointsthatdon’tbelongtoclusters.
• Examples:1. K-means.2. Density-basedclustering.3. Hierarchicalclustering:
• Outlierstakelongertojoinothergroups.• Alsogoodforoutliergroups.
http://www.nature.com/nature/journal/v438/n7069/fig_tab/nature04338_F10.html
Distance-BasedOutlierDetection• Mostoutlierdetectionapproachesarebasedondistances.• Canweskipthemodel/plot/clusteringandjustmeasuredistances?– Howmanypointslieinaradius‘epsilon’?– Whatisdistancetokth nearestneighbour?
• UBCconnection(firstpaperonthistopic):
GlobalDistance-BasedOutlierDetection:KNN• KNNoutlierdetection:– Foreachpoint,computetheaveragedistancetoitsKNN.– Choosepointswithbiggestvalues(orvaluesaboveathreshold)asoutliers.
• “Outliers”arepointsthatarefarfromtheirKNNs.
• GoldsteinandUchida[2016]:– Compared19methodson10datasets.– KNNbestforfinding“global”outliers.– “Local”outliersbestfoundwithlocaldistance-based methods…
http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0152173
LocalDistance-BasedOutlierDetection• Aswithdensity-basedclustering,problemwithdifferingdensities:
• Outliero2 hassimilardensityaselementsofclusterC1.• Basicideabehindlocaldistance-based methods:– Outliero2is“relatively”farcomparedtoitsneighbours.
http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf
LocalDistance-BasedOutlierDetection• “Outlierness”ratio ofexample‘i’:
• Ifoutlierness >1,xi isfurtherawayfromneighbours thanexpected.
http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdfhttps://en.wikipedia.org/wiki/Local_outlier_factor
IsolationForests• Recentmethodbasedonrandomtreesisisolationforests.– Growatreewhereeachstumpusesarandomfeatureandrandomsplit.– Stopwheneachexampleis“isolated”(eachleafhasoneexample).– The“isolationscore”isthedepthbeforeexamplegetsisolated.
• Outliersshouldbeisolatedquickly,inliersshouldneedlotsofrulestoisolate.
– Repeatfordifferentrandomtrees,takeaveragescore.
https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
ProblemwithUnsupervisedOutlierDetection• Whywasn’ttheholeintheozonelayerdiscoveredfor9years?
• Canbehardtodecidewhentoreport anoutler:– Ifyoureporttoomanynon-outliers,userswillturnyouoff.– MostantivirusprogramsdonotuseMLmethods(see"base-ratefallacy“)
https://en.wikipedia.org/wiki/Ozone_depletion
SupervisedOutlierDetection• Finalapproachtooutlierdetectionistousesupervisedlearning:
• yi =1ifxi isanoutlier.• yi =0ifxi isaregularpoint.
• Wecanuseourmethodsforsupervisedlearning:– Wecanfindverycomplicatedoutlierpatterns.– Classiccreditcardfrauddetectionmethodsuseddecisiontrees.
• Butitneedssupervision:– Weneedtoknowwhatoutlierslooklike.– Wemaynotdetectnew“types”ofoutliers.
(pause)
Motivation:ProductRecommendation• Acustomercomestoyourwebsitelookingtobuyanitem:
• Youwanttofindsimilaritemsthattheymightalsobuy:
User-ProductMatrix
AmazonProductRecommendation• Amazonproductrecommendationmethod:
• ReturntheKNNsacrosscolumns.– Find‘j’valuesminimizing||xi – xj||.– Productsthatwereboughtbysimilarsetsofusers.
• Butfirstdivideeachcolumnbyitsnorm,xi/||xi||.– Thisiscallednormalization.– Reflectswhetherproductisboughtbymanypeopleorfewpeople.
EndofPart2:KeyConcepts• Wefocusedon3unsupervisedlearningtasks:– Clustering.
• Partitioning(k-means)vs.density-based.• “Flat”vs.hierarachial (agglomerative).• Vectorquantization.• Labelswitching.
– OutlierDetection.• Surveyedcommonapproaches(andsaidthatproblemisill-defined).
Summary• Biclustering:clusteringoftheexamplesand thefeatures.• Outlierdetectionistaskoffindingunusuallydifferentexample.
– Aconceptthatisverydifficulttodefine.– Model-basedfindunlikelyexamplesgivenamodelofthedata.– Graphicalmethods plotdataandusehumantofindoutliers.– Cluster-basedmethods checkwhetherexamplesbelongtoclusters.– Distance-basedoutlierdetection:measure(relative)distancetoneighbours.– Supervised-learningforoutlierdetection:turnstaskintosupervisedlearning.
• Nexttime:supervisedlearningwithcontinuouslabels.
Application:Medicaldata• Hierarchicalclusteringisverycommoninmedicaldataanalysis.– Clusteringdifferentsamplesofcolorectoral cancer:
– Thisplotisdifferent,it’snotabiclustering:• Thematrixis‘n’by‘n’.• Eachmatrixelementgivescorrelation.• Clustersshouldlooklike“blocks”ondiagonal.• Orderofexamplesisreversedincolumns.
– Thisiswhydiagonalgoesfrombottom-to-top.– Pleasedon’tdothisreversal,it’sconfusingtome.
https://gut.bmj.com/content/gutjnl/66/4/633.full.pdf
“QualityControl”:OutlierDetectioninTime-Series
• Afieldprimarilyfocusingonoutlierdetectionisqualitycontrol.• Oneofthemaintoolsisplottingz-scorethresholdsovertime:
• Usuallydon’tdotestslike“|zi|>3”,sincethishappensnormally.• Instead,identifyproblemswithtestslike“|zi|>2twiceinarow”.
https://en.wikipedia.org/wiki/Laboratory_quality_control
Outlierness (SymbolDefinition)• LetNk(xi)bethek-nearestneighbours ofxi.• LetDk(xi)betheaveragedistancetok-nearestneighbours:
• Outlierness isratioofDk(xi)toaverageDk(xj)foritsneighbours ‘j’:
• Ifoutlierness >1,xi isfurtherawayfromneighbours thanexpected.
Outlierness withCloseClusters• Ifclustersareclose,outlierness givesunintuitiveresults:
• Inthisexample,‘p’hashigheroutlierness than‘q’and‘r’:– ThegreenpointsarenotpartoftheKNNlistof‘p’forsmall‘k’.
http://www.comp.nus.edu.sg/~atung/publication/pakdd06_outlier.pdf
Outlierness withCloseClusters• ‘Influencedoutlierness’(INFLO)ratio:
– Includeindenominatorthe‘reverse’k-nearestneighbours:• Pointsthathave‘p’inKNNlist.
– Adds‘s’and‘t’frombiggerclusterthatincludes‘p’:
• Butstillhasproblems:– Dealingwithhierarchicalclusters.– Yieldsmanyfalsepositivesifyouhave“global”outliers.– GoldsteinandUchida[2016]recommendjustusingKNN.
http://www.comp.nus.edu.sg/~atung/publication/pakdd06_outlier.pdf
Training/Validation/Testing(Supervised)• Atypicalsupervisedlearningsetup:– Train parametersondatasetD1.– Validate hyper-parametersondatasetD2.– Test errorevaluatedondatasetD3.
• WhatshouldwechooseforD1,D2,andD3?
• Usualanswer:shouldallbeIIDsamplesfromdatadistributionDs.
Training/Validation/Testing(OutlierDetection)• Atypicaloutlierdetection setup:
– Train parametersondatasetD1 (theremaybeno“training”todo).• Forexample,findz-scores.
– Validate hyper-parametersondatasetD2 (foroutlierdetection).• Forexample,seewhichz-scorethresholdseparatesD1 andD2.
– Test errorevaluatedondatasetD3 (foroutlierdetection).• Forexample,checkwhetherz-scorerecognizesD3 asoutliers.
• D1 willstillbesamplesfromDs (datadistribution).• D2 coulduseIIDsamplesfromanotherdistributionDm.
– Dm representsthe“none”or“outlier” class.– TuneparameterssothatDm samplesareoutliersandDs samplesaren’t.
• Couldjustfitabinaryclassifierhere.
Training/Validation/Testing(OutlierDetection)• Atypicaloutlierdetection setup:
– Train parametersondatasetD1 (theremaybeno“training”todo).• Forexample,findz-scores.
– Validate hyper-parametersondatasetD2 (foroutlierdetection).• Forexample,seewhichz-scorethresholdseparatesD1 andD2.
– Test errorevaluatedondatasetD3 (foroutlierdetection).• Forexample,checkwhetherz-scorerecognizesD3 asoutliers.
• D1 willstillbesamplesfromDs (datadistribution).• D2 coulduseIIDsamplesfromanotherdistributionDm.• D3 coulduseIIDsamplesfromDm.
– Howwelldoyoudoatrecognizing“data”samplesfrom“none”samples?
Training/Validation/Testing(OutlierDetection)• Seemslikeareasonablesetup:– D1 willstillbesamplesfromDs (datadistribution).– D2 coulduseIIDsamplesfromanotherdistributionDm.– D3 coulduseIIDsamplesfromDm.
• Whatcangowrong?
• YouneededtopickadistributionDm torepresent“none”.– Butinthewild,youroutliersmightfollowanother“none”distribution.– Thisprocedurecanoverfit toyourDm.
• Youcanoverestimateyourabilitytodetectoutliers.
OD-Test:abetterwaytoevaluateoutlierdetections
• Areasonablesetup:– D1 willstillbesamplesfromDs (datadistribution).– D2 coulduseIIDsamplesfromanotherdistributionDm.– D3 coulduseIIDsamplesfromDm.– D3 coulduseIIDsamplesfromyet-anotherdistributionDt.
• “Howdoyouperformatdetectingdifferenttypesofoutliers?”– Seemslikeaharderproblem,butarguablyclosertoreality.
OD-Test:abetterwaytoevaluateoutlierdetections
• “Howdoyouperformatdetectingdifferenttypesofoutliers?”
https://arxiv.org/pdf/1809.04729.pdf