UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22:...

54
UVA CS 4501: Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of Computer Science 4/24/18 Dr. Yanjun Qi / UVA CS 1

Transcript of UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22:...

Page 1: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

UVACS4501:MachineLearning

Lecture22:UnsupervisedClustering(I)

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

4/24/18

Dr.YanjunQi/UVACS

1

Page 2: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Wherearewe?èmajorsecJonsofthiscourse

q Regression(supervised)q ClassificaJon(supervised)

q FeatureselecJonq Unsupervisedmodels

q DimensionReducJon(PCA)q Clustering(K-means,GMM/EM,Hierarchical)

q Learningtheoryq Graphicalmodels

4/24/18

Dr.YanjunQi/UVACS

2

Page 3: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

AnunlabeledDatasetX

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns]

4/24/18

Dr.YanjunQi/UVACS

a data matrix of n observations on p variables x1,x2,…xp

Unsupervisedlearning=learningfromraw(unlabeled,unannotated,etc)data,asopposedtosuperviseddatawherelabelofexamplesisgiven

3

Page 4: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Today:Whatisclustering?

•  Arethereany“groups”?•  Whatiseachgroup?•  Howmany?•  HowtoidenJfythem?

4

Page 5: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

•  Find groups (clusters) of data points such that data points in a group will be similar (or related) to one another and different from (or unrelated to) the data points in other groups

Whatisclustering?

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

4/24/18

Dr.YanjunQi/UVACS

5

Page 6: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Whatisclustering?•  Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjects–  highintra-classsimilarity–  lowinter-classsimilarity–  Itisthecommonestformofunsupervisedlearning

•  AcommonandimportanttaskthatfindsmanyapplicaJonsinScience,Engineering,informaJonScience,andotherplaces,e.g.

•  GroupgenesthatperformthesamefuncJon•  GroupindividualsthathassimilarpoliJcalview•  Categorizedocumentsofsimilartopics•  Idealitysimilarobjectsfrompictures

6

Page 7: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Whatisclustering?•  Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjects–  highintra-classsimilarity–  lowinter-classsimilarity–  Itisthecommonestformofunsupervisedlearning

•  AcommonandimportanttaskthatfindsmanyapplicaJonsinScience,Engineering,informaJonScience,andotherplaces,e.g.

•  GroupgenesthatperformthesamefuncJon•  GroupindividualsthathassimilarpoliJcalview•  Categorizedocumentsofsimilartopics•  Idealitysimilarobjectsfrompictures

7

Page 8: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ToyExamples•  People

•  Images

•  Language

•  species

8

Page 9: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Application (I): Search

Result Clustering

4/24/18 9

Dr.YanjunQi/UVACS

Page 10: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Application (II): Navigation

4/24/18 10

Dr.YanjunQi/UVACS

Page 11: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Issuesforclustering•  Whatisanaturalgroupingamongtheseobjects?

–  DefiniJonof"groupness"•  Whatmakesobjects“related”?

–  DefiniJonof"similarity/distance"•  RepresentaJonforobjects

–  Vectorspace?NormalizaJon?•  Howmanyclusters?

–  Fixedapriori?–  Completelydatadriven?

•  Avoid“trivial”clusters-toolargeorsmall•  ClusteringAlgorithms

–  ParJJonalalgorithms–  Hierarchicalalgorithms

•  FormalfoundaJonandconvergence11

Page 12: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence12

Page 13: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Whatisanaturalgroupingamongtheseobjects?

13

Page 14: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Anotherexample:clusteringissubjecJve

A

B

A

B

A

B

A

B A

B

A

B

TwopossibleSoluJons…

4/24/18 Dr.YanjunQi/UVACS 14

Page 15: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence15

Page 16: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

WhatisSimilarity?

•  TherealmeaningofsimilarityisaphilosophicalquesJon.WewilltakeamorepragmaJcapproach

•  DependsonrepresentaJonandalgorithm.Formanyrep./alg.,easiertothinkintermsofadistance(ratherthansimilarity)betweenvectors.

Hardtodefine!Butweknowitwhenweseeit

16

Page 17: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

WhatproperJesshouldadistancemeasurehave?

•  D(A,B)=D(B,A) Symmetry

•  D(A,A)=0 ConstancyofSelf-Similarity

•  D(A,B)=0IIfA=B Posi=vitySepara=on

•  D(A,B)<=D(A,C)+D(B,C) TriangularInequality

17

Page 18: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

•  D(A,B)=D(B,A) Symmetry–  Otherwiseyoucouldclaim"AlexlookslikeBob,butBoblooksnothing

likeAlex"

•  D(A,A)=0 ConstancyofSelf-Similarity–  Otherwiseyoucouldclaim"AlexlooksmorelikeBob,thanBobdoes"

•  D(A,B)=0IIfA=B Posi=vitySepara=on–  Otherwisethereareobjectsinyourworldthataredifferent,butyou

cannottellapart.

•  D(A,B)<=D(A,C)+D(B,C) TriangularInequality–  Otherwiseyoucouldclaim"AlexisverylikeBob,andAlexisverylike

Carl,butBobisveryunlikeCarl"

IntuiJonsbehinddesirableproperJesofdistancemeasure

18

Page 19: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

DistanceMeasures:MinkowskiMetric

•  Supposetwoobjectxandybothhavepfeatures

•  TheMinkowskimetricisdefinedby•  MostCommonMinkowskiMetrics

!!d(x , y)= |xi− yi

i=1

p

∑ |rr

!!

x = (x1 ,x2 ,!,xp)y = ( y1 , y2 ,!, yp)

1,r =2(Euclideandistance)d(x , y)= |xi− yii=1

p

∑ |22

2,r =1(Manhattandistance)d(x , y)= |xi− yii=1

p

∑ |

3,r = +∞("sup"distance)d(x , y)=max1≤i≤p

|xi− yi |19

Page 20: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

.},{max :distance sup"" :3. :distanceManhattan :2

. :distanceEuclidean :1

434734

5342 22

==+

=+

AnExample

4

3

x

y

20

Page 21: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

.},{max :distance sup"" :3. :distanceManhattan :2

. :distanceEuclidean :1

434734

5342 22

==+

=+

AnExample

4

3

x

y

21

Page 22: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

11011111100001110100111001001001101716151413121110987654321

GeneBGeneA

. :Distance Hamming 5141001 =+=+ )#()#(

•  ManhanandistanceiscalledHammingdistancewhenallfeaturesarebinaryordiscrete.

–  E.g.,GeneExpressionLevelsUnder17CondiJons(1-High,0-Low)

Hammingdistance:discretefeatures

!!d(x , y)= |xi− yi

i=1

p

∑ |

22

Page 23: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

EditDistance:Agenerictechniqueformeasuringsimilarity

•  Tomeasurethesimilaritybetweentwoobjects,transformoneoftheobjectsintotheother,andmeasurehowmucheffortittook.Themeasureofeffortbecomesthedistancemeasure.

ThedistancebetweenPanyandSelma.

Changedresscolor,1pointChangeearringshape,1pointChangehairpart,1point

D(Pany,Selma)=3

ThedistancebetweenMargeandSelma.

Changedresscolor,1pointAddearrings,1pointDecreaseheight,1pointTakeupsmoking,1pointLoseweight,1point

D(Marge,Selma)=5

ThisiscalledtheEditdistanceortheTransformaJondistance23

SelmaPanyMarge

Page 24: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

•  PearsoncorrelaJoncoefficient

•  Specialcase:cosinedistance4/24/18

Dr.YanjunQi/UVACS

. and where

)()(

))((),(

∑∑

∑ ∑

==

= =

=

==

−×−

−−=

p

iip

p

iip

p

i

p

iii

p

iii

yyxx

yyxx

yyxxyxs

1

1

1

1

1 1

22

1

1≤),( yxs

SimilarityMeasures:CorrelaJonCoefficient

yxyxyxs !!!!

⋅⋅=),(

•  MeasuringthelinearcorrelaLonbetweentwosequences,xandy,

•  givingavaluebetween+1and−1inclusive,where1istotalposiJvecorrelaLon,0isnocorrelaLon,and−1istotalnegaJvecorrelaLon.

CorrelaJonisunitindependent

24

Page 25: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

SimilarityMeasures:e.g.,CorrelaJonCoefficientonJmeseriessamples

Time

Gene A

Gene B

Gene A Time

Gene B

Expression Level Expression Level

Expression Level

Time

Gene A Gene B

25

CorrelaJonisunitindependent;IfyouscaleoneoftheobjectstenJmes,youwillgetdifferenteuclideandistancesandsamecorrelaJondistances.

Page 26: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence26

Page 27: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ClusteringAlgorithms

•  ParJJonalalgorithms– Usuallystartwitharandom(parJal)parJJoning

–  RefineititeraJvely•  Kmeansclustering•  Mixture-Modelbasedclustering

•  Hierarchicalalgorithms–  Bonom-up,agglomeraJve–  Top-down,divisive

27

Page 28: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ClusteringAlgorithms

•  ParJJonalalgorithms– Usuallystartwitharandom(parJal)parJJoning

–  RefineititeraJvely•  Kmeansclustering•  Mixture-Modelbasedclustering

•  Hierarchicalalgorithms–  Bonom-up,agglomeraJve–  Top-down,divisive

28

Page 29: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence29

Page 30: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

HierarchicalClustering•  Buildatree-basedhierarchicaltaxonomy(dendrogram)fromasetofobjects,e.g.organisms,documents.

•  NotethathierarchiesarecommonlyusedtoorganizeinformaJon,forexampleinawebportal.–  Yahoo!hierarchyismanuallycreated,wewillfocusonautomaJccreaJonofhierarchies

Withbackbone Withoutbackbone

30

Page 31: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

(How-to) Hierarchical Clustering The number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10  34,459,425

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjectsè

highintra-classsimilaritylowinter-classsimilarity

4/24/18

Dr.YanjunQi/UVACS

31

Page 32: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

(How-to) Hierarchical Clustering The number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10  34,459,425

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjectsè

highintra-classsimilaritylowinter-classsimilarity

Agreedylocal

opJmalsoluJon

4/24/18

Dr.YanjunQi/UVACS

32

Page 33: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

33

(How-to) Hierarchical Clustering The number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10  34,459,425

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjectsè

highintra-classsimilaritylowinter-classsimilarity

Agreedylocal

opJmalsoluJon

Page 34: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

D( , ) = 8 D( , ) = 1

We begin with a distance matrix which contains the distances between every pair of objects in our database.

4/24/18

Dr.YanjunQi/UVACS

34

Page 35: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

4/24/18

Dr.YanjunQi/UVACS

35

Page 36: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

4/24/18

Dr.YanjunQi/UVACS

36

Page 37: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best …

4/24/18

Dr.YanjunQi/UVACS

37

Page 38: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best … But how do we compute distances

between clusters rather than objects?

4/24/18

Dr.YanjunQi/UVACS

38

Page 39: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Howtodecidethedistancesbetweenclusters?

•  Single-Link

– NearestNeighbor:theirclosestmembers.

•  Complete-Link– FurthestNeighbor:theirfurthestmembers.

•  Average:– averageofallcross-clusterpairs.

39

Page 40: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Computing distance between clusters: Single Link

•  cluster distance = distance of two closest members in each class

- Potentially long and skinny clusters

4/24/18

Dr.YanjunQi/UVACS

40

Page 41: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Computing distance between clusters: : Complete Link

•  cluster distance = distance of two farthest members

+ tight clusters

4/24/18

Dr.YanjunQi/UVACS

41

Page 42: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Computing distance between clusters: Average Link

•  cluster distance = average distance of all pairs

the most widely used measure

Robust against noise

4/24/18

Dr.YanjunQi/UVACS

42

Page 43: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

4/24/18

Dr.YanjunQi/UVACS

43

Page 44: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

4/24/18

Dr.YanjunQi/UVACS

44

Page 45: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

4/24/18

Dr.YanjunQi/UVACS

45

Page 46: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

12 3 4

5

8}8,9min{},min{9}9,10min{},min{3}3,6min{},min{

5,25,15),2,1(

4,24,14),2,1(

3,23,13),2,1(

======

===

ddddddddd

4/24/18

Dr.YanjunQi/UVACS

46

Page 47: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

12 3 4

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5}5,8min{},min{7}7,9min{},min{

5,35),2,1(5),3,2,1(

4,34),2,1(4),3,2,1(

======

dddddd

4/24/18

Dr.YanjunQi/UVACS

47

Page 48: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

12 3 4

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5},min{ 5),3,2,1(4),3,2,1()5,4(),3,2,1( == ddd

4/24/18

Dr.YanjunQi/UVACS

48

Page 49: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

1

2

3

4

5

6

7

Average linkage

Single linkage

Height represents distance between objects / clusters

ParJJonsbycutngthedendrogramatadesiredlevel:eachconnectedcomponentformsacluster.

4/24/18

Dr.YanjunQi/UVACS

49

Page 50: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

HierarchicalClustering•  Bonom-UpAgglomeraJveClustering

–  Startswitheachobjectinaseparatecluster–  thenrepeatedlyjoinstheclosestpairofclusters,–  unJlthereisonlyonecluster.

Thehistoryofmergingformsabinarytreeorhierarchy(dendrogram)

•  Top-Downdivisive–  StarJngwithallthedatainasinglecluster,–  Considereverypossiblewaytodividetheclusterintotwo.Choosethebestdivision

–  Andrecursivelyoperateonbothsides.

50

Page 51: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ComputaJonalComplexity

•  InthefirstiteraJon,allHACmethodsneedtocomputesimilarityofallpairsofnindividualinstanceswhichisO(n2p).

•  Ineachofthesubsequentn−2mergingiteraJons,computethedistancebetweenthemostrecentlycreatedclusterandallotherexisJngclusters.

•  Forthesubsequentsteps,inordertomaintainanoverallO(n2)performance,compuJngsimilaritytoeachotherclustermustbedoneinconstantJme.ElseO(n2logn)orO(n3)ifdonenaively

51

Page 52: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

SummaryofHierarchalClusteringMethods

•  Noneedtospecifythenumberofclustersinadvance.

•  HierarchicalstructuremapsnicelyontohumanintuiJonforsomedomains

•  Theydonotscalewell:JmecomplexityofatleastO(n2),wherenisthenumberoftotalobjects.

•  LikeanyheurisJcsearchalgorithms,localopJmaareaproblem.

•  InterpretaJonofresultsis(very)subjecJve.4/24/18

Dr.YanjunQi/UVACS

52

Page 53: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Hierarchical Clustering

Clustering

n/a

No clearly defined loss

greedy bottom-up (or top-down)

Dendrogram (tree)

Task

Representation

Score Function

Search/Optimization

Models, Parameters

4/24/18 53

Dr.YanjunQi/UVACS

Page 54: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

References

q HasJe,Trevor,etal.Theelementsofsta=s=callearning.Vol.2.No.1.NewYork:Springer,2009.

q BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides

q BigthankstoProf.ZivBar-Joseph@CMUforallowingmetoreusesomeofhisslides

4/24/18

Dr.YanjunQi/UVACS

54