Social and Technological Network Data Analytics Lecture … · Social and Technological Network...

47
Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo

Transcript of Social and Technological Network Data Analytics Lecture … · Social and Technological Network...

SocialandTechnologicalNetworkDataAnalytics

Lecture5:StructureoftheWeb,SearchandPowerLaws

ProfCeciliaMascolo

InThisLecture

• Wedescribepowerlawnetworksandtheirpropertiesandshowexamplesofnetworkswhicharepowerlawinnature,includingtheweb.

• Wepresentthepreferentialattachmentmodelwhichallowsthegenerationofpowerlawnetworks.

• Westudypredictionofpowerlaws• WeintroducesearchandPageRank

TheWebisaGraph…

Thiscoursepage.Me

Mywebpage.MyProfile

Myprofilepage.LinktoNSPCC

NSPCCPage

Precursorofhypertexts

• Citationnetworksofbooksandarticles.

• Difference:linkspointonlybackwardsintime

WebisaDirectedGraph

• Path:ApathfromAtoBexistsifthereisasequenceofnodesbeginningwithAandendingwithBsuchthateachconsecutivepairofnodesisconnectedbyanedgepointingintheforwarddirection.

A

B

CD

E

StronglyConnectedComponent

• Astronglyconnectedcomponent(SCC)inadirectedgraphisasubsetofnodessuchthat:

i)Everypairinthesubsethasapathtoeachotherii)Thesubsetisnotpartofsomelargersubsetwithpropertyi)

• Weaklyconnectedcomponent(WCC)istheconnectedcomponentintheundirectedgraphderivedfromthedirectedgraph.– TwonodescanbeinthesameWCCeveniftherenodirectedpathbetweenthem.

SCCexample

TheWeb

• Broder’00• DatafromAltavista (200millionpages)

• 186MnodesintheWCC(90%oflinks)

PopularityofWebPages

• Howdoweexpectthepopularityofwebpagestobedistributed?–Whatfractionofwebpageshavek in-links?

– Ifeachpagedecidesindependentlyatrandomwhethertolinktoanygivenotherpagethenthenofin-linksofapageisthesumofindependentrandomquantities->normaldistribution

– Inthiscase,thenumberpageswithkin-linksdecreasesexponentiallyink

– IsthistruefortheWeb?

DegreedistributionfortheWeb• Finding:degreedistr.proportionalto~1/k2• 1/k2 decreasesmuchmoreslowlythananormaldistribution

PowerLawvs Exponential

p(x) = x−α

p(x) = e−λx

Powerlaw

Exponential

DistributionofWCCandSCC

Reachability

• Followedlinksbackwardsandforward

DiameteroftheWeb

• 75%ofthetimethereisnodirectedpathbetweentworandomnodes

• Averagedistanceofexistingpaths:16• Averagedistanceofundirectedpaths:6.83

• DiameterintheSCCisatleast28

PowerLawsakaScaleFreeNetworks

• Wehaveseenthatthedegreedistributionfollowedastraightlineinlog-log

• α definestheslopeofthecurve• α istypicallybetween2and3.

ln pk = −α lnk + cpk = Ck−α

PowerLawsinvariousdomains

Whatdoesitmean?

Randomvs PowerLawNetworks

Example

What’sagoodmodelforscalefreenetworks

• Let’susethewebnetworkasexample:• Pagesarecreatedinorder(1,2,3..)• Pagej createdanditlinkstoanearlierpageinthefollowingway:– Withprob.p,j choosespagei atrandomandlinksit;– Withprob.1-p,jchoosespagei atrandomandlinkstothepagei pointsto.

– Repeat.• Themiddlestepisessentiallyacopyofthenodeibehaviour…

Preferentialattachment

• Pagesarecreatedinorder(1,2,3..)• Pagej createdanditlinkstoanearlierpageinthefollowingway:–Withprob.p,j choosespagei atrandomandlinksit;

–Withprob.1-p,j choosesapagez withprob.proportionaltoz’s currentnumberofin-linksandlinkstoz (ie proportionaltodegree).

– Repeat.Rich-get-richermodelIfwerunthisformanypagesthefraction ofpageswithkin-linkswillbedistributedapproximately according toapowerlaw1/kccdepends onp

Intuition

• Withprobability1-ppagejchoosesapageiwithprobabilityproportionaltoi’snumberofin-linksandcreatesalinktoi.

• Thismechanismpredictsthatthegrowthhappenssothat– Apage’spopularitygrowthatarateproportionaltoitscurrentvalue.

– Therichgetrichereffectamplifiesthelargervalues

PreferentialAttachment

• Whathaveweshown?• Thereisa“copying”behaviour happeninginthesenetworkswherenodeseemtoemulateothernodes.

• Thisisshowntrueforselectionofbooks,songs,webpages,moviesetc.

Howpredictableistherich-get-richerprocess?

• Isthepopularityofitemsinthepowerlawpredictable?

• Wouldapopularbookstillbepopularifwegobackintimeandstarttheprocessagain?

• Experimentsshowitwouldnot…

Unpredictability[Salganik etal06]

• 48songs,14,000participants,8servers

Viewofthecurve

• Thewaywehaveseenthecurvesofar…

Weconcentratedonthis

Let’stransformthefunction

• Iftheinitialfunctionisapowerlaw,thisoneistoo(wedonotprovethis)

Saleranking

Nichetastes

Popularitymeansthis

Search

– Informationretrievalproblem:synonyms(jump/leap),polysemy(Leopard),etc

– Nowwiththeweb:diversityinauthoringintroducesissuesofcommoncriteriaforrankingdocuments

– Theweboffersabundanceofinformation:whomdowetrustassource?

• Stilloneissue:staticcontentversusrealtime–Worldtradecenterqueryon11/9/01– Twitterhelpssolvingtheseissuesthesedays

AutomatetheSearch

• Whensearching“ComputerLaboratory”onGooglethefirstlinkisforthedepartment’spage.

• HowdoesGoogleknowthisisthebestanswer?• Wecouldcollectalargesampleofpagesrelevantto“computerlaboratory”andcollecttheirvotesthroughtheirlinks.

• Thepagesreceivingmorein-linksarerankedfirst.• Butifweusethenetworkstructuremoredeeplywecanimproveresults.

Example:Query“newspaper”Authorities

• Linksareseenasvotes.

• Authoritiesareestablished:thehighlyendorsedpages

ARefinement:Hubs

• Numbersarereportedbackonthesourcepageandaggregate.

• Hubsarehighvaluelists

PrincipleofRepeatedImprovement

• Andwearenowreweightingtheauthorities

• Whendowestop?

RepeatingandNormalizing

• Theprocesscanberepeated• Normalization:– Eachauthorityscoreisdividedbythesumofallauthorityscores

– Eachhubscoreisdividedbythesumofallhubscores

MoreFormally:doestheprocessconverge?

• Eachpagehasanauthorityai andahubhiscore

• Initiallyai=hi =1

• Ateachstep

• Normalize

ai = h jj−> i∑

h j = aij−> i∑

ai∑ =1

h j∑ =1

Theprocessconverges

PageRank

• Wehaveseenhubsandauthorities– Hubscan“collect”linkstoimportantauthoritieswhodonotpointtoeachothers

– Thereareothermodels:betterfortheweb,whereoneprominentcanendorseanother.

• ThePageRankmodelisbasedontransferrableimportance.

PageRank Concepts

• Pagespassendorsementsonoutgoinglinksasfractionswhichdependonout-degree

• InitialPageRankvalueofeachnodeinanetworkofnnodes:1/n.

• Chooseanumberofstepsk.• [Basic]Updaterule:eachpagedividesitspagerank equallyovertheoutgoinglinksandpassesanequalsharetothepointedpages.Eachpage’snewrankisthesumofreceivedpageranks.

Example

• AllpagesstartwithPageRank=1/8

Abecomes importantandB,Cbenefittooatstep2

Convergence

• Exceptforsomespecialcases,PageRankvaluesofallnodesconvergetolimitingvalueswhenthenumberofstepsgoestoinfinity.

• TheconvergencecaseisonewherethePageRankofeachpagedoesnotchangeanymore,i.e.,theyregeneratethemselves.

ExampleofEquilibrium

ProblemswiththebasicPageRankDeadends

• F,Gconvergeto½andalltheothernodesto0

Solution:TheREALPageRank

• [Scaled]UpdateRule:– Applybasicupdaterule.Then,scaledownallvaluesbyscalingfactors [chosenbetween0and1].

– [TotalnetworkPageRankvaluechangesfrom1tos]– Divide1-sresidualunitsofPageRank equallyoverallnodes:(1-s)/neach.

• Itcanbeproventhatvaluesconvergeagain.• Scalingfactorusuallychosenbetween0.8and0.9

SearchRankingisveryimportanttobusiness

• Achangeinresultsinthesearchpagesmightmeanlossofbusiness– I.e.,notappearingonfirstpage.

• Rankingalgorithmsarekeptverysecretandchangedcontinuously.

ExamplesofGoogleBombs

RandomWalks

• Startingfromanode,followoneoutgoinglinkwithanequalprobability

PageRank asRandomWalk

• TheprobabilityofbeingatapageXafterkstepsofarandomwalkispreciselythePageRank ofXafterk applicationsoftheBasicPageRank UpdateRule

• ScaledUpdateRuleequivalent:followarandomoutgoinglinkwithprobabilitys whilewithprobability1-sjumptoarandomnodeinthenetwork.

References• Chapter13,14and18

• AndreiBroder,RaviKumar,Farzin Maghoul,Prabhakar Raghavan,SridharRajagopalan,Raymie Stata,AndrewTomkins, andJanetWiener.GraphstructureintheWeb.InProc.9thInternationalWorldWideWebConference,pages309-320,2000.

• A.Clauset,C.R.Shalizi andM.E.J.Newman,2009.“Power-lawdistributionsinempiricaldata.”SIAMReviewVol.51,No.4.(2Feb2009),661.

• Barabási,Albert-László andRéka Albert,"Emergenceofscalinginrandomnetworks",Science,286:509-512,October15,1999

• MatthewSalganik,PeterDodds,andDuncanWatts.Experimentalstudyofinequality andunpredictabilityinanartificialculturalmarket.Science,311:854-856,2006.

Barabasi’s bookhasagoodchapteronscalefreenetworkstoo!