Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and...

51
Clustering and community detec2on Social and Technological Networks Rik Sarkar University of Edinburgh, 2017.

Transcript of Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and...

Page 1: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Clusteringandcommunitydetec2on

SocialandTechnologicalNetworks

RikSarkar

UniversityofEdinburgh,2017.

Page 2: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

•  Plan/proposalguidelinesareup•  Officehours– Wednesdays12:00–13:00–  (Maychangeinfuture.Alwayscheckwebpagefor2mesandannouncements.)

Page 3: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Communitydetec2on

•  Givenanetwork•  Whatarethe“communi2es”– Closelyconnectedgroupsofnodes– Rela2velyfewedgestooutsidethecommunity

•  Similartoclusteringindatasets– Grouptogetherpointsthataremorecloseorsimilartoeachotherthanotherpoints

Page 4: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Communitydetec2onbyclustering

•  First,defineametricbetweennodes– Eithercomputeintrinsicmetricslikeallpairsshortestpaths[Floyd-WarshallalgorithmO(n3)]

– OrembedthenodesinaEuclideanspace,andusethemetricthere• Wewilllaterstudyembeddingmethods

•  Applyaclusteringalgorithmwiththemetric

Page 5: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Clustering

•  Acoreproblemofmachinelearning:– Whichitemsareinthesamegroup?

•  Iden2fiesitemsthataresimilarrela2vetorestofdata

•  Simplifiesinforma2onbygroupingsimilaritems– Helpsinalltypesofotherproblems

Page 6: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Clustering•  Outlineapproach:•  Givenasetofitems

–  Defineadistancebetweenthem•  E.g.Euclideandistancebetweenpointsinaplane;Euclideandistancebetweenothera_ributes;non-euclideandistances;pathlengthsinanetwork;2estrengthsinanetwork…

–  Determineagrouping(par22oning)thatop2misessomefunc2on(prefers‘close’itemsinsamegroup).

•  Referenceforclustering:–  CharuAggarwal:TheDataMiningTextbook,Springer

•  FreeonSpringersite(fromuniversitynetwork)–  Blumetal.Founda2onsofDataScience(freeonline)

Page 7: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

K-meansclustering

•  Findk-clusters

– Withcenters

– Thatminimizethesumofsquareddistancesofnodestotheirclusters(calledthek-meanscost)

Page 8: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

K-meansclustering:Lloyd’salgorithm

•  Therearenitems•  Selectk‘centers’

– Mayberandomkloca2onsinspace– Maybeloca2onofkoftheitemsselectedrandomly– Maybechosenaccordingtosomemethod

•  Iterate2llconvergence:–  Assigneachitemtotheclusterforitsclosestcenter–  Recomputeloca2onofcenterasthemeanloca2onofallelementsinthecluster

–  Repeat•  Warning:Lloyd’salgorithmisaHeuris2c.Doesnotguaranteethatthek-meanscostisminimised

Page 9: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

K-means

•  Visualisa2ons•  h_p://stanford.edu/class/ee103/visualiza2ons/kmeans/kmeans.html

•  h_p://shabal.in/visuals/kmeans/1.html

Page 10: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

K-means

•  Ward’salgorithm(alsoHeuris2c)– Startwitheachnodeasitsowncluster– Ateachround,findtwoclusterssuchthatmergingthemwillreducethek-meanscostthemost

– Mergethesetwoclusters– Repeatun2ltherearek-clusters

Page 11: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Kmeans:discussion•  Triestominimisesumofdistancesofitemstocluster

centers–  Computa2onallyhardproblem–  Algorithmgiveslocalop2mum

•  Dependsonini2alisa2on(star2ngsetofcenters)–  Cangivepoorresults–  Slowspeed

•  Theright‘k’maybeunknown–  Possiblestrategy:trydifferentpossibili2esandtakethebest

•  Canbeimprovedbyheuris2cslikechoosingcenterscarefully–  E.g.choosingcenterstobeasfarapartaspossible:chooseone,choosepointfarthesttoit,choosepointfarthesttoboth(maximisemindistancetoexis2ngsetetc)…

–  Trymul2ple2mesandtakebestresult..

Page 12: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

K-medoids

•  Similar,butnoweachcentermustbeoneofthegivenitems–  Ineachcluster,findtheitemthatisthebest‘center’andrepeat

•  Usefulwhenthereisnoambientspace(extrinsicmetric)– E.g.Adistancebetweenitemscanbecomputedbetweennodes,buttheyarenotinanypar2cularEuclideanspace,sothe‘center’isnotameaningfulpoint

Page 13: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Othercenterbasedmethods

•  K-center:Minimisemaximumdistancetocenter:

•  K-median:Minimisesumofdistances:

Page 14: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Hierarchicalclustering

•  Hierarchicallygroupitems

Page 15: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Hierarchicalclustering•  Topdown(divisive):–  Startwitheverythingin1cluster

– Makethebestdivision,andrepeatineachsubcluster

•  Bo_omup(agglomera2ve):–  Startwithndifferentclusters– Mergetwoata2mebyfindingpairsthatgivethebestimprovement

Page 16: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Hierarchicalclustering•  Givesmanyop2onsforaflatclustering

•  Problem:whatisagood‘cut’ofthedendogram?

Page 17: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Densitybasedclustering

•  Groupdenseregionstogether

•  Be_eratnon-linearsepara2ons

•  Workswithunknownnumberofclusters

Page 18: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

DBSCAN•  Densityatadatapoint:

–  NumberofdatapointswithinradiusEps•  Acorepoint:

–  Pointwithdensityatleastτ•  Borderpoint

–  Densitylessthanτ,butatleastonecorepointwithinradiusEps•  Noisepoint

–  Neithercorenorborder.Farfromdenseregions

Page 19: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

DBSCAN:Discussions

•  Requiresknowledgeofsuitableradiusanddensityparameters(Epsandτ)

•  Doesnotallowforpossibilitythatdifferentclustersmayhavedifferentdensi2es

Page 20: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Densitybasedclustering

•  Singlelinkage(sameasKruskal’sMSTalgorithm)– Startwithnclusters– Mergetwoclusterswiththeshortestbridginglink– Repeatun2lkclusters

•  Other,morerobustmethodsexist

Page 21: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Communi2es

•  Groupsoffriends•  Colleagues/collaborators•  Webpagesonsimilartopics•  Biologicalreac2ongroups•  Similarcustomers/users…

Page 22: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Otherapplica2ons

•  Acoarserrepresenta2onofnetworks•  Oneormoremeta-nodeforeachcommunity•  Iden2fybridges/weak-links•  Structuralholes

Page 23: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Communitydetec2oninnetworks

•  Asimplestrategy:– Chooseasuitabledistancemeasurebasedonavailabledata•  E.g.Pathlengths;distancebasedoninverse2estrengths;sizeoflargestenclosinggrouporcommona_ribute;distanceinaspectral(eigenvector)embedding;etc..

– Applyastandardclusteringalgorithm

Page 24: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Clusteringisnotalwayssuitableinnetworks

•  Smallworldnetworkshavesmalldiameter– Andsome2meintegerdistances– Adistancebasedmethoddoesnothavealotofop2ontorepresentsimilari2es/dissimilari2es

•  Highdegreenodesarecommon– Connectdifferentcommuni2es– Hardtoseparatecommuni2es

•  Edgedensi2esvaryacrossthenetwork– Samethresholddoesnotworkwelleverywhere

Page 25: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Defini2onsofcommuni2es

•  Varies.Dependingonapplica2on

•  Generalidea:Densesubgraphs:Morelinkswithincommunity,fewlinksoutside

•  Sometypesandconsidera2ons:– Par22ons:Eachnodeinexactlyonecommunity– Overlapping:Eachnodecanbeinmul2plecommuni2es

Page 26: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Findingdensesubgraphsishardingeneral

•  Findinglargestclique– NP-hard–  Computa2onallyintractable–  Polynomial2me(efficient)algorithmsunlikelytoexist

•  Decisionversion:Doesacliqueofsizekexist?– NP-complete–  Computa2onallyintractable–  Polynomial2me(efficient)algorithmsunlikelytoexist

Page 27: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Densesubgraphs:Fewpreliminarydefini2ons

•  ForS,TsubgraphsofV•  e(S,T):SetofedgesfromStoT– e(S)=e(S,S):EdgeswithinS

•  dS(v):numberofedgesfromvtoS•  EdgedensityofS:|e(S)|/|S|– Largestforcompletegraphsorcliques

Page 28: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Densesubgraph

•  Thesubgraphwithlargestedgedensity•  Therealsoexistsadecisionversion:–  Isthereasubgraphwithedgedensity>α

•  CanbesolvedusingMaxFlowalgorithms– O(n2m):inefficientinlargedatasets–  Findstheonedensestsubgraph

•  Variant:FinddensestScontaininggivensubsetX•  Otherversions:Findsubgraphssizekorless•  NP-hard

Page 29: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Efficientapproxima2onforfindingdenseScontainingX

•  Givesa1/2approxima2on•  EdgedensityofoutputSsetisatleasthalfofop2malsetS*

•  (ProofinKempe2011).

Page 30: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Modularity

•  Wewanttofindthemanycommuni2es,notjustone

•  Clusteringagraph•  Problem:Whatistherightclustering?•  Idea:Maximizeaquan2tycalledmodularity

Page 31: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

ModularityofsubsetS

•  GivengraphG•  ConsiderarandomG’graphwithsamenodedegrees(rememberconfigura2onmodel)– NumberofedgesinSinG:|e(S)|G–  ExpectednumberofedgesinSinG’:E[|e(S)|G’]– ModularityofS:|e(S)|-E[|e(S)|G’]– Morecoherentcommuni2eshavemoreedgesinsidethanwouldbeexpectedinarandomgraphwithsamedegrees

– Note:modularitycanbenega2ve

Page 32: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Modularityofaclustering

•  Takeapar22on(clustering)ofV:•  Writed(Si)forsumofdegreesofallnodesinSi•  CanbeshownthatE[|e(S)|G’]~d(Si)2•  Defini2on:Sumoverthepar22on:

Page 33: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Modularitybasedclustering•  Modularityismeantforusemoreasameasureofquality,notso

muchasaclusteringmethod

•  FindingclusteringwithhighestmodularityisNP-hard•  Heuris2c:

–  Usemodularitymatrix–  Takeitsfirsteigenvector

•  Note:Modularityisarela2vemeasureforcomparingcommunitystructure.

•  Noten2relyclearinwhichcasesitmayormaynotgivegoodresults

•  Athresholdof0.3ormoreissome2mesconsideredtogivegoodclustering

Page 34: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

•  Canbeusedasastoppingcriterion(orfindingrightlevelofpar22oning)inothermethods– Eg.Girvan-newman

Page 35: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Karateclubhierarchicclustering

•  Shapeofnodesgivesactualsplitintheclubduetointernalconflicts– Newman2003

Page 36: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Overlappingcommuni2es

•  i

Page 37: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,
Page 38: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Non-Overlappingcommuni2es

Page 39: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Overlappingcommuni2es

•  s

Page 40: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Affilia2ongraphmodel

•  Genera2vemodel:•  Eachnodebelongstosomecommuni2es•  Ifbothaandbareincommunityc– Edge(a,b)iscreatedwithprobabilitypc

Page 41: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Affilia2ongraphmodel

•  Problem:•  Giventhenetwork,recover:–  Communi2es:C– MembershipsorAffilia2ons:M

•  Probabili2es:pc

Page 42: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

•  A

Page 43: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Maximumlikelihoodes2ma2on

•  GivendataX•  AssumedataisgeneratedbysomemodelfwithparametersΘ

•  ExpressprobabilityP[f(X|Θ)]:fgeneratesX,givenspecificvaluesofΘ.

•  ComputeargmaxΘ(P[f(X|Θ)])

Page 44: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

MLEforAGM:TheBIGCLAMmethod

•  Findingthebestpossiblebipar2tenetworkiscomputa2onallyhard(toomanypossibili2es)

•  Instead,takeamodelwheremembershipsarerealnumbers:Membershipstrengths– FuAStrengthofmembershipofuinA– PA(u,v)=1-exp(-FuA.FvA):Eachcommunitylinksindependently,byproductofstrengths

– Totalprobabilityofanedgeexis2ng:•  P(u,v)=1-ΠC(1-Pc(u,v))

Page 45: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

BIGCLAM

•  FindtheFthatmaximizesthelikelihoodthatexactlytherightsetofedgesexist.

•  DetailsOmi_ed

•  Op2onally,See•  OverlappingCommunityDetec2onatScale:ANonnega2veMatrixFactoriza2onApproachbyJ.Yang,J.Leskovec.ACMInterna2onalConferenceonWebSearchandDataMining(WSDM),2013.

Page 46: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Correla2onclustering•  Someedgesareknowntobesimilar/friends/trusted

•  marked“+”•  Someedgesareknowntobedissimilar/enemies/distrusted

•  marked“-”•  Maximizethenumberof+edgesinsideclustersand

•  Maximizethenumberof-edgesbetweenclusters

Page 47: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Applica2ons

•  Communitydetec2onbasedonsimilarpeople/users

•  Documentclusteringbasedonknownsimilarityordissimilaritybetweendocuments

Page 48: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Features

•  Clusteringwithoutneedtoknownumberofclusters–  k-means,medians,clustersetcneedtoknownumberofclustersorotherparameterslikethreshold

–  Numberofclustersdependsonnetworkstructure•  Actually,doesnotneedanyparameter•  NPhard•  Notethatgraphmaybecompleteornotcomplete

–  Insomeapplica2onswithunlabelededges,itmaybereasonabletochangeedgesto“+”edgesandnon-edgesto“-”edges

Page 49: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Approxima2on

•  Naive1/2approxima2on(notveryuseful):–  Iftherearemore+edges•  Putthemallin1cluster

–  Iftherearemore-edges•  Putnodesinndifferentclusters

Page 50: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Be_erapproxima2ons

•  2waysoflookingatit:– MaximizeagreementorMinimizedisagreement– Similaridea,butweknowdifferentapproxima2onalgorithms

•  NikhilBansaletal.developPTAS(polynomial2meapproxima2onscheme)formaximizingagreement:–  (1-ε)approxima2on,running2me

Page 51: Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Approxima2on

•  Min-disagree:– 4-approxima2on