Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita...

21
Universit ` a degli Studi di Milano Dipartimento di Scienze dell’Informazione Rapporto interno N. 305-05 The Choice of a Damping Function for Propagating Importance in Link-Based Ranking Ricardo Baeza-Yates, Paolo Boldi, Carlos Castillo

Transcript of Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita...

Page 1: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

Universita degli Studi di Milano

Dipartimento di Scienze dellrsquoInformazione

Rapporto interno N 305-05

amp

$

The Choice of a Damping Function forPropagating Importance in Link-Based

Ranking

Ricardo Baeza-Yates Paolo Boldi Carlos Castillo

The Choice of a Damping Functionfor Propagating Importance in Link-Based Ranking

Ricardo Baeza-YatesICREA Professor

Depto de TecnologıaUniversitat Pompeu Fabra

Spain

Paolo Boldilowast

Dipartimento di ScienzedellrsquoInformazione

Universita degli Studi di MilanoItaly

Carlos CastilloCatedra Telefonica

Depto de TecnologıaUniversitat Pompeu Fabra

Spain

Abstract

This paper studies a family of link-based algorithms that propagate page importance through linksIn these algorithms there is a damping function that decreases with the distance so a direct link impliesmore endorsement than a link through a long path PageRank is the most widely known ranking functionof this family

We focus on three damping functions having linear exponential and hyperbolic decay on the lengthsof the paths The exponential decay corresponds to PageRank and the other functions are new Ouranalysis includes a comparison among them and experiments for studying their behavior under differentparameters

1 Introduction

One of the measures of importance of a scientific paper is the number of citations that the article receivesFollowing this idea several authors proposed to use links for ranking web pages [Mar97 JM98 Li98] how-ever it quickly become clear that just counting the links was not a very reliable measure of authoritativeness(it was not in scientific citations either) because it is very easy to manipulate in the context of the webwhere creating a page costs nearly nothing

The PageRank technique introduced by Pageet al [PBMW98] actually tries to mend this problem bylooking at the importance of a page in a recursive manner In other words ldquoa page with high PageRank is apage referenced by many pages with high PageRankrdquo PageRank not only counts the direct links to a pagebut also includes indirect links The same is valid for scientific citations

PageRank turns out to be a simple robust and reliable way to measure the importance of web pages andit can be computed in a very efficient way For these reasons most of todayrsquos commercial search enginesare believed to use PageRank as an important source of ranking

In this paper we

bull describe general ranking functions that depend on incoming paths of varying lengths

bull show that PageRank belongs to this class of functions

bull show how to compute these rankings

lowastPartially supported by MIUR COFIN Project ldquoLinguaggi formali e automirdquo

1

bull suggest why 085 is a good choice of the parameter for PageRank and

bull compare the ranking orders induced by different ranking functions finding ways of approximatingPageRank up to very high precision

The rest of this paper is organized as followsSection2 introduces the notion of functional rankingSec-tion 3 describes three damping functionsSection4 compares them analytically andSection5 experimentswith different parameters for each function Finallythe last sectionpresents our conclusions

2 Functional Rankings

In this section we introduce the notion offunctional ranking a general family of ranking functions thatincludes PageRank To describe PageRank formally consider a web graph ofN pages LetANtimesN be thelink matrix in this graphai j = 1 iff there is a link from pagei to pagej This link matrix is seldom used asit is mainly for two reasons

Normalization In the Web creating an out-link is free so there is an incentive for web page authors tocreate pages with many out-links this is the reason why a metaphor of ldquovotingrdquo is enforced [Lif00]in which each page has only one ldquovoterdquo that has to be split among its linked pages This is typicallydone in link-based ranking by normalizingA row-wise the normalization process means that everyweb page can only decide how to divide its own score among the pages it leads to but it cannot assignmore score than it has Another way to look at normalization is that the matrix is turned into thetransition matrix of a stochastic process

The normalization does not need to give each out-link the same value as there is evidence that weblinks have different purposes such as navigating in a multi-page set expanding the contents of the cur-rent page pointing to another resource etc [HG98] Also links within the same site can be consideredself-links and as such do not confer as much authority as a link between different sites indeed thereare ranking methods like BHITS [BH98] or Altavistarsquos Eigenrank that treat them differently Othercharacteristics of links such as if they appear at the beginning or the bottom of the page or if theyappear within a certain HTML element can also be used for non-uniform normalization [BYD04]

We will assume uniform normalization so if a page hasd out-links each of those links has a weightof 1d but the results of this paper can be applied to other forms of normalization

Dangling nodesSpecial attention should be paid to the possible presence of nodes with no outgoing arcs(also known as ldquodangling nodesrdquo) in fact dangling nodes fail to produce a row-stochastic matrixbecause the rows of dangling nodes are filled with zeroes Dangling nodes can be dealt with byadding an extra node that is linked to and from all other nodes or by introducing new arcs from eachdangling node to every node in the graph In our analysis we shall assume that all dangling nodeshave been eliminated already in some way so that we do not have to worry about their presence Allthe algorithms we will present can be modified so that dangling nodes can be dealt with explicitly andwith virtually no additional cost

Let P be the row-normalized link matrix of the graph withN nodes PageRankr(α) is defined as thestationary distribution of the matrix

αP+(1minusα)1Tv

whereα isin [01) is a parameter calleddamping factor(sometimes also called a dampening factor) andv isa fixedpreference vectorthat may represent the interests of a particular user or another ranking vector that

2

is used for weighting pages Note that the above matrix is ergodic (at least if every entry ofv is strictlypositive) so it has exactly one stationary distribution Even though most of our results can be easily restatedwith a non-uniform preference vectorv for the sake of clarity we shall only consider the uniform preference1N in the rest of the paper

As observed in [BSV05] the PageRank vectorr(α) can be written as

r(α) = (1minusα)infin

sumt=0

αt 1N

1Pt

Or in matricial form

r(α) = (1minusα)1N

1(I minusαP)minus1 ||αP||lt 1

There is an equivalent and actually very intriguing way of rewriting this formula mentioned in [NSW01]that leads to a conclusion similar to those of [Bri06] given a pathp= 〈x1x2 xk〉 in the graph we defineits branching contributionas follows

branching(p) =1

d1d2 middot middot middotdkminus1

whered j is the outdegree this is the number of outgoing arcs of nodex j Then the ranking of nodei according to PageRank is

r i(α) = sumpisinPath(minusi)

(1minusα)α|p|

Nbranching(p)

where Path(minus i) is the set of all paths into nodei and |p| is the length of pathp this is because(Pt)i j

contains the sum of the branching contributions of all paths of lengtht from i to j as one can easily show byinduction ont This way of expressing the PageRank of a node is interesting because it highlights the factthat the rank of a node is essentially obtained as a weighted sum of contributions coming from every pathentering into the node with weights that decay exponentially in the length of the path

A natural generalization of this idea consists in taking into consideration a rankingR of the generalform

R =infin

sumt=0

damping(t)1N

1middotPt

or equivalently

Ri = sumpisinPath(minusi)

damping(|p|) 1N

branching(p)

where the damping function is a suitable choice of weights We will refer to this form of ranking asfunc-tional ranking because it is parametrized by the damping function As we have seen generic PageRank isa functional ranking where the damping function

damping(t) = (1minusα)αt

decays exponentially fast

3

3 Damping Functions

In this section we show several functional rankings by describing their damping functions First we showwhich class of damping functions generate well-defined functional rankings

As shown in [Bri06 Corollary 24] for every pair of nodesi and j and for every lengtht

sumpisinPath(i j)|p|=t

branching(p)le 1

A more general property holds

Theorem 1 For every node i and every length t

sumpisinPath(iminus)|p|=t

branching(p) = 1

Proof By induction ont For t = 0 there is only one path fromi of length 0 and its branching is 1 Forthe inductive stepsumpisinPath(iminus)|p|=t+1branching(p) can be rewritten by observing that ifi has outdegreedi every path fromi of lengtht +1 is the concatenation ofi with a path of lengtht from an out-neighbor ofi

sumpisinPath(iminus)|p|=t+1

branching(p) = sumjirarr j

1di

sumpisinPath( jminus)|p|=t

branching(p) = sumjirarr j

1di

= 1

As a consequence to guarantee that the functional ranking is well-defined and normalized (ie that rankvalues sum to 1) we need

N

sumi=1

sumpisinPath(minusi)

damping(|p|) 1N

branching(p) = 1

that isinfin

sumt=0

damping(t)1N sum

pisinPath(minusminus)branching(p) = 1

UsingTheorem1 sumpisinPath(minusminus) branching(p) = N so the latter equality is equivalent to

infin

sumt=0

damping(t) = 1

Hence every choice of the damping function such thatsuminfint=0damping(t) = 1 yields a well-defined nor-

malized functional ranking Nonetheless not all choices are equivalent so we have to find out whichfunctions generate better rankings Since a direct link should be more valuable as a source of evidence thana distant link we focus on damping functions that are decreasing ont the length of the paths

31 Linear damping

Letrsquos start by considering a simple damping function such as

damping(t) =

2(Lminust)L(L+1) t lt L

0 t ge L

4

this is a damping function that decreases linearly with distance and reaches zero at distanceL The trivialcaseL = 1 gives a uniform ranking andL = 2 is ranking by indegree as in the latter case all paths of lengthge 2 are not considered

From the definition

R =infin

sumt=0

damping(t)vPt

=L

sumt=0

2(Lminus t)L(L+1)

vPt

=2

L(L+1)v

Lminus1

sumt=0

(Lminus t)Pt

=2

L(L+1)v(L(I minusP)minusP(I minusPL))((I minusP)2)minus1

provided that(I minusP)2 is not singularAn advantage of this type of ranking is that only the first few levels are considered so the computation

is fast and the number of iterations is fixed

Computation For computing this functional ranking we can define the following sequence

R(0) =2

L+1v

R(k+1) =(Lminuskminus1)

(Lminusk)RkP

The functional ranking with linear damping issumLminus1k=0 R(k) An algorithm for computing this ranking

shown inFigure1 arises directly from this summation

32 Exponential damping PageRank

As we already noted PageRank can be seen as a functional ranking where the damping function decaysexponentially

damping(t) = (1minusα)αt

As longer paths have less importance in the calculation of PageRank it could be approximated by usingonly a few levels of links In [CGS04] it is shown that by using only the nodes at distance 1 from a targetnode (equivalent to linear damping withL = 2) PageRank can be approximated with 30 of average errorUsing nodes at distance 2 the average error drops to 20 and at distance 3 to 10 After that there are nosignificant improvements by adding more levels and the cost (the number of nodes to be explored) is muchhigher

Computation Since PageRank is the principal eigenvector of the modified graph matrix it can be easilyapproximated by the iterative Power Method algorithm as suggested by Pageet al in their original pa-per [PBMW98] this iterative algorithm gives good approximations (both in norm and with respect to theinduced node order) in few iterations even though convergence speed and numerical stability decay whenαgets close to 1 [HK03b HK03a] Other methods to compute PageRank have been proposed some of them

5

Require L maximum path length N number of nodesv preference vector1 for i 1 N do Initialization2 S[i] larr R[i] larr 2v[i](L+1)3 end for4 for step 1 Lminus1 do Iteration step5 Auxlarr 06 for i 1 N do Follow links in the graph7 for all j such that there is a link fromi to j do8 Aux[j] larr Aux[j] + R[i]outdegree(i)9 end for

10 end for11 for i 1 N do Add to ranking value12 R[i] larr Aux[i] times (Lminusstep)

(Lminus(stepminus1))13 S[i] larr S[i] + R[i]14 end for15 end for16 return R

Figure 1 Algorithm for computing a functional ranking with linear damping

using techniques for the solution of systems of linear equations some other concentrating on some specificfeatures of the web as a graph that determine forms of locality in the computation of PageRank (see forexample [PBMW98 Hav99 GG04 LGZ04 KHMG03b KHMG03a])

33 Quadratic hyperbolic damping TotalRank

Recently a ranking method called TotalRank [Bol05] has been proposed The method aims at eliminatingthe necessity for an arbitrary parameter by integrating PageRank over the entire range ofα If r(α) is thevector of PageRank then TotalRank is defined as

T =int 1

0r(α)dα

T can be written as

int 1

0r(α)dα =

1N

infin

sumt=0

int 1

0(1minusα)αt1middotPtdα

=1N

infin

sumt=0

1(t +1)(t +2)

1middotPt

By using the definition of the logarithm of a matrix

ln(I minusP) =minusinfin

sumk=1

Pk

k

we can write TotalRank as

T = Pminus1(I +(I minusPminus1) ln(I minusP))

6

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 2: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

The Choice of a Damping Functionfor Propagating Importance in Link-Based Ranking

Ricardo Baeza-YatesICREA Professor

Depto de TecnologıaUniversitat Pompeu Fabra

Spain

Paolo Boldilowast

Dipartimento di ScienzedellrsquoInformazione

Universita degli Studi di MilanoItaly

Carlos CastilloCatedra Telefonica

Depto de TecnologıaUniversitat Pompeu Fabra

Spain

Abstract

This paper studies a family of link-based algorithms that propagate page importance through linksIn these algorithms there is a damping function that decreases with the distance so a direct link impliesmore endorsement than a link through a long path PageRank is the most widely known ranking functionof this family

We focus on three damping functions having linear exponential and hyperbolic decay on the lengthsof the paths The exponential decay corresponds to PageRank and the other functions are new Ouranalysis includes a comparison among them and experiments for studying their behavior under differentparameters

1 Introduction

One of the measures of importance of a scientific paper is the number of citations that the article receivesFollowing this idea several authors proposed to use links for ranking web pages [Mar97 JM98 Li98] how-ever it quickly become clear that just counting the links was not a very reliable measure of authoritativeness(it was not in scientific citations either) because it is very easy to manipulate in the context of the webwhere creating a page costs nearly nothing

The PageRank technique introduced by Pageet al [PBMW98] actually tries to mend this problem bylooking at the importance of a page in a recursive manner In other words ldquoa page with high PageRank is apage referenced by many pages with high PageRankrdquo PageRank not only counts the direct links to a pagebut also includes indirect links The same is valid for scientific citations

PageRank turns out to be a simple robust and reliable way to measure the importance of web pages andit can be computed in a very efficient way For these reasons most of todayrsquos commercial search enginesare believed to use PageRank as an important source of ranking

In this paper we

bull describe general ranking functions that depend on incoming paths of varying lengths

bull show that PageRank belongs to this class of functions

bull show how to compute these rankings

lowastPartially supported by MIUR COFIN Project ldquoLinguaggi formali e automirdquo

1

bull suggest why 085 is a good choice of the parameter for PageRank and

bull compare the ranking orders induced by different ranking functions finding ways of approximatingPageRank up to very high precision

The rest of this paper is organized as followsSection2 introduces the notion of functional rankingSec-tion 3 describes three damping functionsSection4 compares them analytically andSection5 experimentswith different parameters for each function Finallythe last sectionpresents our conclusions

2 Functional Rankings

In this section we introduce the notion offunctional ranking a general family of ranking functions thatincludes PageRank To describe PageRank formally consider a web graph ofN pages LetANtimesN be thelink matrix in this graphai j = 1 iff there is a link from pagei to pagej This link matrix is seldom used asit is mainly for two reasons

Normalization In the Web creating an out-link is free so there is an incentive for web page authors tocreate pages with many out-links this is the reason why a metaphor of ldquovotingrdquo is enforced [Lif00]in which each page has only one ldquovoterdquo that has to be split among its linked pages This is typicallydone in link-based ranking by normalizingA row-wise the normalization process means that everyweb page can only decide how to divide its own score among the pages it leads to but it cannot assignmore score than it has Another way to look at normalization is that the matrix is turned into thetransition matrix of a stochastic process

The normalization does not need to give each out-link the same value as there is evidence that weblinks have different purposes such as navigating in a multi-page set expanding the contents of the cur-rent page pointing to another resource etc [HG98] Also links within the same site can be consideredself-links and as such do not confer as much authority as a link between different sites indeed thereare ranking methods like BHITS [BH98] or Altavistarsquos Eigenrank that treat them differently Othercharacteristics of links such as if they appear at the beginning or the bottom of the page or if theyappear within a certain HTML element can also be used for non-uniform normalization [BYD04]

We will assume uniform normalization so if a page hasd out-links each of those links has a weightof 1d but the results of this paper can be applied to other forms of normalization

Dangling nodesSpecial attention should be paid to the possible presence of nodes with no outgoing arcs(also known as ldquodangling nodesrdquo) in fact dangling nodes fail to produce a row-stochastic matrixbecause the rows of dangling nodes are filled with zeroes Dangling nodes can be dealt with byadding an extra node that is linked to and from all other nodes or by introducing new arcs from eachdangling node to every node in the graph In our analysis we shall assume that all dangling nodeshave been eliminated already in some way so that we do not have to worry about their presence Allthe algorithms we will present can be modified so that dangling nodes can be dealt with explicitly andwith virtually no additional cost

Let P be the row-normalized link matrix of the graph withN nodes PageRankr(α) is defined as thestationary distribution of the matrix

αP+(1minusα)1Tv

whereα isin [01) is a parameter calleddamping factor(sometimes also called a dampening factor) andv isa fixedpreference vectorthat may represent the interests of a particular user or another ranking vector that

2

is used for weighting pages Note that the above matrix is ergodic (at least if every entry ofv is strictlypositive) so it has exactly one stationary distribution Even though most of our results can be easily restatedwith a non-uniform preference vectorv for the sake of clarity we shall only consider the uniform preference1N in the rest of the paper

As observed in [BSV05] the PageRank vectorr(α) can be written as

r(α) = (1minusα)infin

sumt=0

αt 1N

1Pt

Or in matricial form

r(α) = (1minusα)1N

1(I minusαP)minus1 ||αP||lt 1

There is an equivalent and actually very intriguing way of rewriting this formula mentioned in [NSW01]that leads to a conclusion similar to those of [Bri06] given a pathp= 〈x1x2 xk〉 in the graph we defineits branching contributionas follows

branching(p) =1

d1d2 middot middot middotdkminus1

whered j is the outdegree this is the number of outgoing arcs of nodex j Then the ranking of nodei according to PageRank is

r i(α) = sumpisinPath(minusi)

(1minusα)α|p|

Nbranching(p)

where Path(minus i) is the set of all paths into nodei and |p| is the length of pathp this is because(Pt)i j

contains the sum of the branching contributions of all paths of lengtht from i to j as one can easily show byinduction ont This way of expressing the PageRank of a node is interesting because it highlights the factthat the rank of a node is essentially obtained as a weighted sum of contributions coming from every pathentering into the node with weights that decay exponentially in the length of the path

A natural generalization of this idea consists in taking into consideration a rankingR of the generalform

R =infin

sumt=0

damping(t)1N

1middotPt

or equivalently

Ri = sumpisinPath(minusi)

damping(|p|) 1N

branching(p)

where the damping function is a suitable choice of weights We will refer to this form of ranking asfunc-tional ranking because it is parametrized by the damping function As we have seen generic PageRank isa functional ranking where the damping function

damping(t) = (1minusα)αt

decays exponentially fast

3

3 Damping Functions

In this section we show several functional rankings by describing their damping functions First we showwhich class of damping functions generate well-defined functional rankings

As shown in [Bri06 Corollary 24] for every pair of nodesi and j and for every lengtht

sumpisinPath(i j)|p|=t

branching(p)le 1

A more general property holds

Theorem 1 For every node i and every length t

sumpisinPath(iminus)|p|=t

branching(p) = 1

Proof By induction ont For t = 0 there is only one path fromi of length 0 and its branching is 1 Forthe inductive stepsumpisinPath(iminus)|p|=t+1branching(p) can be rewritten by observing that ifi has outdegreedi every path fromi of lengtht +1 is the concatenation ofi with a path of lengtht from an out-neighbor ofi

sumpisinPath(iminus)|p|=t+1

branching(p) = sumjirarr j

1di

sumpisinPath( jminus)|p|=t

branching(p) = sumjirarr j

1di

= 1

As a consequence to guarantee that the functional ranking is well-defined and normalized (ie that rankvalues sum to 1) we need

N

sumi=1

sumpisinPath(minusi)

damping(|p|) 1N

branching(p) = 1

that isinfin

sumt=0

damping(t)1N sum

pisinPath(minusminus)branching(p) = 1

UsingTheorem1 sumpisinPath(minusminus) branching(p) = N so the latter equality is equivalent to

infin

sumt=0

damping(t) = 1

Hence every choice of the damping function such thatsuminfint=0damping(t) = 1 yields a well-defined nor-

malized functional ranking Nonetheless not all choices are equivalent so we have to find out whichfunctions generate better rankings Since a direct link should be more valuable as a source of evidence thana distant link we focus on damping functions that are decreasing ont the length of the paths

31 Linear damping

Letrsquos start by considering a simple damping function such as

damping(t) =

2(Lminust)L(L+1) t lt L

0 t ge L

4

this is a damping function that decreases linearly with distance and reaches zero at distanceL The trivialcaseL = 1 gives a uniform ranking andL = 2 is ranking by indegree as in the latter case all paths of lengthge 2 are not considered

From the definition

R =infin

sumt=0

damping(t)vPt

=L

sumt=0

2(Lminus t)L(L+1)

vPt

=2

L(L+1)v

Lminus1

sumt=0

(Lminus t)Pt

=2

L(L+1)v(L(I minusP)minusP(I minusPL))((I minusP)2)minus1

provided that(I minusP)2 is not singularAn advantage of this type of ranking is that only the first few levels are considered so the computation

is fast and the number of iterations is fixed

Computation For computing this functional ranking we can define the following sequence

R(0) =2

L+1v

R(k+1) =(Lminuskminus1)

(Lminusk)RkP

The functional ranking with linear damping issumLminus1k=0 R(k) An algorithm for computing this ranking

shown inFigure1 arises directly from this summation

32 Exponential damping PageRank

As we already noted PageRank can be seen as a functional ranking where the damping function decaysexponentially

damping(t) = (1minusα)αt

As longer paths have less importance in the calculation of PageRank it could be approximated by usingonly a few levels of links In [CGS04] it is shown that by using only the nodes at distance 1 from a targetnode (equivalent to linear damping withL = 2) PageRank can be approximated with 30 of average errorUsing nodes at distance 2 the average error drops to 20 and at distance 3 to 10 After that there are nosignificant improvements by adding more levels and the cost (the number of nodes to be explored) is muchhigher

Computation Since PageRank is the principal eigenvector of the modified graph matrix it can be easilyapproximated by the iterative Power Method algorithm as suggested by Pageet al in their original pa-per [PBMW98] this iterative algorithm gives good approximations (both in norm and with respect to theinduced node order) in few iterations even though convergence speed and numerical stability decay whenαgets close to 1 [HK03b HK03a] Other methods to compute PageRank have been proposed some of them

5

Require L maximum path length N number of nodesv preference vector1 for i 1 N do Initialization2 S[i] larr R[i] larr 2v[i](L+1)3 end for4 for step 1 Lminus1 do Iteration step5 Auxlarr 06 for i 1 N do Follow links in the graph7 for all j such that there is a link fromi to j do8 Aux[j] larr Aux[j] + R[i]outdegree(i)9 end for

10 end for11 for i 1 N do Add to ranking value12 R[i] larr Aux[i] times (Lminusstep)

(Lminus(stepminus1))13 S[i] larr S[i] + R[i]14 end for15 end for16 return R

Figure 1 Algorithm for computing a functional ranking with linear damping

using techniques for the solution of systems of linear equations some other concentrating on some specificfeatures of the web as a graph that determine forms of locality in the computation of PageRank (see forexample [PBMW98 Hav99 GG04 LGZ04 KHMG03b KHMG03a])

33 Quadratic hyperbolic damping TotalRank

Recently a ranking method called TotalRank [Bol05] has been proposed The method aims at eliminatingthe necessity for an arbitrary parameter by integrating PageRank over the entire range ofα If r(α) is thevector of PageRank then TotalRank is defined as

T =int 1

0r(α)dα

T can be written as

int 1

0r(α)dα =

1N

infin

sumt=0

int 1

0(1minusα)αt1middotPtdα

=1N

infin

sumt=0

1(t +1)(t +2)

1middotPt

By using the definition of the logarithm of a matrix

ln(I minusP) =minusinfin

sumk=1

Pk

k

we can write TotalRank as

T = Pminus1(I +(I minusPminus1) ln(I minusP))

6

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 3: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

bull suggest why 085 is a good choice of the parameter for PageRank and

bull compare the ranking orders induced by different ranking functions finding ways of approximatingPageRank up to very high precision

The rest of this paper is organized as followsSection2 introduces the notion of functional rankingSec-tion 3 describes three damping functionsSection4 compares them analytically andSection5 experimentswith different parameters for each function Finallythe last sectionpresents our conclusions

2 Functional Rankings

In this section we introduce the notion offunctional ranking a general family of ranking functions thatincludes PageRank To describe PageRank formally consider a web graph ofN pages LetANtimesN be thelink matrix in this graphai j = 1 iff there is a link from pagei to pagej This link matrix is seldom used asit is mainly for two reasons

Normalization In the Web creating an out-link is free so there is an incentive for web page authors tocreate pages with many out-links this is the reason why a metaphor of ldquovotingrdquo is enforced [Lif00]in which each page has only one ldquovoterdquo that has to be split among its linked pages This is typicallydone in link-based ranking by normalizingA row-wise the normalization process means that everyweb page can only decide how to divide its own score among the pages it leads to but it cannot assignmore score than it has Another way to look at normalization is that the matrix is turned into thetransition matrix of a stochastic process

The normalization does not need to give each out-link the same value as there is evidence that weblinks have different purposes such as navigating in a multi-page set expanding the contents of the cur-rent page pointing to another resource etc [HG98] Also links within the same site can be consideredself-links and as such do not confer as much authority as a link between different sites indeed thereare ranking methods like BHITS [BH98] or Altavistarsquos Eigenrank that treat them differently Othercharacteristics of links such as if they appear at the beginning or the bottom of the page or if theyappear within a certain HTML element can also be used for non-uniform normalization [BYD04]

We will assume uniform normalization so if a page hasd out-links each of those links has a weightof 1d but the results of this paper can be applied to other forms of normalization

Dangling nodesSpecial attention should be paid to the possible presence of nodes with no outgoing arcs(also known as ldquodangling nodesrdquo) in fact dangling nodes fail to produce a row-stochastic matrixbecause the rows of dangling nodes are filled with zeroes Dangling nodes can be dealt with byadding an extra node that is linked to and from all other nodes or by introducing new arcs from eachdangling node to every node in the graph In our analysis we shall assume that all dangling nodeshave been eliminated already in some way so that we do not have to worry about their presence Allthe algorithms we will present can be modified so that dangling nodes can be dealt with explicitly andwith virtually no additional cost

Let P be the row-normalized link matrix of the graph withN nodes PageRankr(α) is defined as thestationary distribution of the matrix

αP+(1minusα)1Tv

whereα isin [01) is a parameter calleddamping factor(sometimes also called a dampening factor) andv isa fixedpreference vectorthat may represent the interests of a particular user or another ranking vector that

2

is used for weighting pages Note that the above matrix is ergodic (at least if every entry ofv is strictlypositive) so it has exactly one stationary distribution Even though most of our results can be easily restatedwith a non-uniform preference vectorv for the sake of clarity we shall only consider the uniform preference1N in the rest of the paper

As observed in [BSV05] the PageRank vectorr(α) can be written as

r(α) = (1minusα)infin

sumt=0

αt 1N

1Pt

Or in matricial form

r(α) = (1minusα)1N

1(I minusαP)minus1 ||αP||lt 1

There is an equivalent and actually very intriguing way of rewriting this formula mentioned in [NSW01]that leads to a conclusion similar to those of [Bri06] given a pathp= 〈x1x2 xk〉 in the graph we defineits branching contributionas follows

branching(p) =1

d1d2 middot middot middotdkminus1

whered j is the outdegree this is the number of outgoing arcs of nodex j Then the ranking of nodei according to PageRank is

r i(α) = sumpisinPath(minusi)

(1minusα)α|p|

Nbranching(p)

where Path(minus i) is the set of all paths into nodei and |p| is the length of pathp this is because(Pt)i j

contains the sum of the branching contributions of all paths of lengtht from i to j as one can easily show byinduction ont This way of expressing the PageRank of a node is interesting because it highlights the factthat the rank of a node is essentially obtained as a weighted sum of contributions coming from every pathentering into the node with weights that decay exponentially in the length of the path

A natural generalization of this idea consists in taking into consideration a rankingR of the generalform

R =infin

sumt=0

damping(t)1N

1middotPt

or equivalently

Ri = sumpisinPath(minusi)

damping(|p|) 1N

branching(p)

where the damping function is a suitable choice of weights We will refer to this form of ranking asfunc-tional ranking because it is parametrized by the damping function As we have seen generic PageRank isa functional ranking where the damping function

damping(t) = (1minusα)αt

decays exponentially fast

3

3 Damping Functions

In this section we show several functional rankings by describing their damping functions First we showwhich class of damping functions generate well-defined functional rankings

As shown in [Bri06 Corollary 24] for every pair of nodesi and j and for every lengtht

sumpisinPath(i j)|p|=t

branching(p)le 1

A more general property holds

Theorem 1 For every node i and every length t

sumpisinPath(iminus)|p|=t

branching(p) = 1

Proof By induction ont For t = 0 there is only one path fromi of length 0 and its branching is 1 Forthe inductive stepsumpisinPath(iminus)|p|=t+1branching(p) can be rewritten by observing that ifi has outdegreedi every path fromi of lengtht +1 is the concatenation ofi with a path of lengtht from an out-neighbor ofi

sumpisinPath(iminus)|p|=t+1

branching(p) = sumjirarr j

1di

sumpisinPath( jminus)|p|=t

branching(p) = sumjirarr j

1di

= 1

As a consequence to guarantee that the functional ranking is well-defined and normalized (ie that rankvalues sum to 1) we need

N

sumi=1

sumpisinPath(minusi)

damping(|p|) 1N

branching(p) = 1

that isinfin

sumt=0

damping(t)1N sum

pisinPath(minusminus)branching(p) = 1

UsingTheorem1 sumpisinPath(minusminus) branching(p) = N so the latter equality is equivalent to

infin

sumt=0

damping(t) = 1

Hence every choice of the damping function such thatsuminfint=0damping(t) = 1 yields a well-defined nor-

malized functional ranking Nonetheless not all choices are equivalent so we have to find out whichfunctions generate better rankings Since a direct link should be more valuable as a source of evidence thana distant link we focus on damping functions that are decreasing ont the length of the paths

31 Linear damping

Letrsquos start by considering a simple damping function such as

damping(t) =

2(Lminust)L(L+1) t lt L

0 t ge L

4

this is a damping function that decreases linearly with distance and reaches zero at distanceL The trivialcaseL = 1 gives a uniform ranking andL = 2 is ranking by indegree as in the latter case all paths of lengthge 2 are not considered

From the definition

R =infin

sumt=0

damping(t)vPt

=L

sumt=0

2(Lminus t)L(L+1)

vPt

=2

L(L+1)v

Lminus1

sumt=0

(Lminus t)Pt

=2

L(L+1)v(L(I minusP)minusP(I minusPL))((I minusP)2)minus1

provided that(I minusP)2 is not singularAn advantage of this type of ranking is that only the first few levels are considered so the computation

is fast and the number of iterations is fixed

Computation For computing this functional ranking we can define the following sequence

R(0) =2

L+1v

R(k+1) =(Lminuskminus1)

(Lminusk)RkP

The functional ranking with linear damping issumLminus1k=0 R(k) An algorithm for computing this ranking

shown inFigure1 arises directly from this summation

32 Exponential damping PageRank

As we already noted PageRank can be seen as a functional ranking where the damping function decaysexponentially

damping(t) = (1minusα)αt

As longer paths have less importance in the calculation of PageRank it could be approximated by usingonly a few levels of links In [CGS04] it is shown that by using only the nodes at distance 1 from a targetnode (equivalent to linear damping withL = 2) PageRank can be approximated with 30 of average errorUsing nodes at distance 2 the average error drops to 20 and at distance 3 to 10 After that there are nosignificant improvements by adding more levels and the cost (the number of nodes to be explored) is muchhigher

Computation Since PageRank is the principal eigenvector of the modified graph matrix it can be easilyapproximated by the iterative Power Method algorithm as suggested by Pageet al in their original pa-per [PBMW98] this iterative algorithm gives good approximations (both in norm and with respect to theinduced node order) in few iterations even though convergence speed and numerical stability decay whenαgets close to 1 [HK03b HK03a] Other methods to compute PageRank have been proposed some of them

5

Require L maximum path length N number of nodesv preference vector1 for i 1 N do Initialization2 S[i] larr R[i] larr 2v[i](L+1)3 end for4 for step 1 Lminus1 do Iteration step5 Auxlarr 06 for i 1 N do Follow links in the graph7 for all j such that there is a link fromi to j do8 Aux[j] larr Aux[j] + R[i]outdegree(i)9 end for

10 end for11 for i 1 N do Add to ranking value12 R[i] larr Aux[i] times (Lminusstep)

(Lminus(stepminus1))13 S[i] larr S[i] + R[i]14 end for15 end for16 return R

Figure 1 Algorithm for computing a functional ranking with linear damping

using techniques for the solution of systems of linear equations some other concentrating on some specificfeatures of the web as a graph that determine forms of locality in the computation of PageRank (see forexample [PBMW98 Hav99 GG04 LGZ04 KHMG03b KHMG03a])

33 Quadratic hyperbolic damping TotalRank

Recently a ranking method called TotalRank [Bol05] has been proposed The method aims at eliminatingthe necessity for an arbitrary parameter by integrating PageRank over the entire range ofα If r(α) is thevector of PageRank then TotalRank is defined as

T =int 1

0r(α)dα

T can be written as

int 1

0r(α)dα =

1N

infin

sumt=0

int 1

0(1minusα)αt1middotPtdα

=1N

infin

sumt=0

1(t +1)(t +2)

1middotPt

By using the definition of the logarithm of a matrix

ln(I minusP) =minusinfin

sumk=1

Pk

k

we can write TotalRank as

T = Pminus1(I +(I minusPminus1) ln(I minusP))

6

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 4: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

is used for weighting pages Note that the above matrix is ergodic (at least if every entry ofv is strictlypositive) so it has exactly one stationary distribution Even though most of our results can be easily restatedwith a non-uniform preference vectorv for the sake of clarity we shall only consider the uniform preference1N in the rest of the paper

As observed in [BSV05] the PageRank vectorr(α) can be written as

r(α) = (1minusα)infin

sumt=0

αt 1N

1Pt

Or in matricial form

r(α) = (1minusα)1N

1(I minusαP)minus1 ||αP||lt 1

There is an equivalent and actually very intriguing way of rewriting this formula mentioned in [NSW01]that leads to a conclusion similar to those of [Bri06] given a pathp= 〈x1x2 xk〉 in the graph we defineits branching contributionas follows

branching(p) =1

d1d2 middot middot middotdkminus1

whered j is the outdegree this is the number of outgoing arcs of nodex j Then the ranking of nodei according to PageRank is

r i(α) = sumpisinPath(minusi)

(1minusα)α|p|

Nbranching(p)

where Path(minus i) is the set of all paths into nodei and |p| is the length of pathp this is because(Pt)i j

contains the sum of the branching contributions of all paths of lengtht from i to j as one can easily show byinduction ont This way of expressing the PageRank of a node is interesting because it highlights the factthat the rank of a node is essentially obtained as a weighted sum of contributions coming from every pathentering into the node with weights that decay exponentially in the length of the path

A natural generalization of this idea consists in taking into consideration a rankingR of the generalform

R =infin

sumt=0

damping(t)1N

1middotPt

or equivalently

Ri = sumpisinPath(minusi)

damping(|p|) 1N

branching(p)

where the damping function is a suitable choice of weights We will refer to this form of ranking asfunc-tional ranking because it is parametrized by the damping function As we have seen generic PageRank isa functional ranking where the damping function

damping(t) = (1minusα)αt

decays exponentially fast

3

3 Damping Functions

In this section we show several functional rankings by describing their damping functions First we showwhich class of damping functions generate well-defined functional rankings

As shown in [Bri06 Corollary 24] for every pair of nodesi and j and for every lengtht

sumpisinPath(i j)|p|=t

branching(p)le 1

A more general property holds

Theorem 1 For every node i and every length t

sumpisinPath(iminus)|p|=t

branching(p) = 1

Proof By induction ont For t = 0 there is only one path fromi of length 0 and its branching is 1 Forthe inductive stepsumpisinPath(iminus)|p|=t+1branching(p) can be rewritten by observing that ifi has outdegreedi every path fromi of lengtht +1 is the concatenation ofi with a path of lengtht from an out-neighbor ofi

sumpisinPath(iminus)|p|=t+1

branching(p) = sumjirarr j

1di

sumpisinPath( jminus)|p|=t

branching(p) = sumjirarr j

1di

= 1

As a consequence to guarantee that the functional ranking is well-defined and normalized (ie that rankvalues sum to 1) we need

N

sumi=1

sumpisinPath(minusi)

damping(|p|) 1N

branching(p) = 1

that isinfin

sumt=0

damping(t)1N sum

pisinPath(minusminus)branching(p) = 1

UsingTheorem1 sumpisinPath(minusminus) branching(p) = N so the latter equality is equivalent to

infin

sumt=0

damping(t) = 1

Hence every choice of the damping function such thatsuminfint=0damping(t) = 1 yields a well-defined nor-

malized functional ranking Nonetheless not all choices are equivalent so we have to find out whichfunctions generate better rankings Since a direct link should be more valuable as a source of evidence thana distant link we focus on damping functions that are decreasing ont the length of the paths

31 Linear damping

Letrsquos start by considering a simple damping function such as

damping(t) =

2(Lminust)L(L+1) t lt L

0 t ge L

4

this is a damping function that decreases linearly with distance and reaches zero at distanceL The trivialcaseL = 1 gives a uniform ranking andL = 2 is ranking by indegree as in the latter case all paths of lengthge 2 are not considered

From the definition

R =infin

sumt=0

damping(t)vPt

=L

sumt=0

2(Lminus t)L(L+1)

vPt

=2

L(L+1)v

Lminus1

sumt=0

(Lminus t)Pt

=2

L(L+1)v(L(I minusP)minusP(I minusPL))((I minusP)2)minus1

provided that(I minusP)2 is not singularAn advantage of this type of ranking is that only the first few levels are considered so the computation

is fast and the number of iterations is fixed

Computation For computing this functional ranking we can define the following sequence

R(0) =2

L+1v

R(k+1) =(Lminuskminus1)

(Lminusk)RkP

The functional ranking with linear damping issumLminus1k=0 R(k) An algorithm for computing this ranking

shown inFigure1 arises directly from this summation

32 Exponential damping PageRank

As we already noted PageRank can be seen as a functional ranking where the damping function decaysexponentially

damping(t) = (1minusα)αt

As longer paths have less importance in the calculation of PageRank it could be approximated by usingonly a few levels of links In [CGS04] it is shown that by using only the nodes at distance 1 from a targetnode (equivalent to linear damping withL = 2) PageRank can be approximated with 30 of average errorUsing nodes at distance 2 the average error drops to 20 and at distance 3 to 10 After that there are nosignificant improvements by adding more levels and the cost (the number of nodes to be explored) is muchhigher

Computation Since PageRank is the principal eigenvector of the modified graph matrix it can be easilyapproximated by the iterative Power Method algorithm as suggested by Pageet al in their original pa-per [PBMW98] this iterative algorithm gives good approximations (both in norm and with respect to theinduced node order) in few iterations even though convergence speed and numerical stability decay whenαgets close to 1 [HK03b HK03a] Other methods to compute PageRank have been proposed some of them

5

Require L maximum path length N number of nodesv preference vector1 for i 1 N do Initialization2 S[i] larr R[i] larr 2v[i](L+1)3 end for4 for step 1 Lminus1 do Iteration step5 Auxlarr 06 for i 1 N do Follow links in the graph7 for all j such that there is a link fromi to j do8 Aux[j] larr Aux[j] + R[i]outdegree(i)9 end for

10 end for11 for i 1 N do Add to ranking value12 R[i] larr Aux[i] times (Lminusstep)

(Lminus(stepminus1))13 S[i] larr S[i] + R[i]14 end for15 end for16 return R

Figure 1 Algorithm for computing a functional ranking with linear damping

using techniques for the solution of systems of linear equations some other concentrating on some specificfeatures of the web as a graph that determine forms of locality in the computation of PageRank (see forexample [PBMW98 Hav99 GG04 LGZ04 KHMG03b KHMG03a])

33 Quadratic hyperbolic damping TotalRank

Recently a ranking method called TotalRank [Bol05] has been proposed The method aims at eliminatingthe necessity for an arbitrary parameter by integrating PageRank over the entire range ofα If r(α) is thevector of PageRank then TotalRank is defined as

T =int 1

0r(α)dα

T can be written as

int 1

0r(α)dα =

1N

infin

sumt=0

int 1

0(1minusα)αt1middotPtdα

=1N

infin

sumt=0

1(t +1)(t +2)

1middotPt

By using the definition of the logarithm of a matrix

ln(I minusP) =minusinfin

sumk=1

Pk

k

we can write TotalRank as

T = Pminus1(I +(I minusPminus1) ln(I minusP))

6

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 5: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

3 Damping Functions

In this section we show several functional rankings by describing their damping functions First we showwhich class of damping functions generate well-defined functional rankings

As shown in [Bri06 Corollary 24] for every pair of nodesi and j and for every lengtht

sumpisinPath(i j)|p|=t

branching(p)le 1

A more general property holds

Theorem 1 For every node i and every length t

sumpisinPath(iminus)|p|=t

branching(p) = 1

Proof By induction ont For t = 0 there is only one path fromi of length 0 and its branching is 1 Forthe inductive stepsumpisinPath(iminus)|p|=t+1branching(p) can be rewritten by observing that ifi has outdegreedi every path fromi of lengtht +1 is the concatenation ofi with a path of lengtht from an out-neighbor ofi

sumpisinPath(iminus)|p|=t+1

branching(p) = sumjirarr j

1di

sumpisinPath( jminus)|p|=t

branching(p) = sumjirarr j

1di

= 1

As a consequence to guarantee that the functional ranking is well-defined and normalized (ie that rankvalues sum to 1) we need

N

sumi=1

sumpisinPath(minusi)

damping(|p|) 1N

branching(p) = 1

that isinfin

sumt=0

damping(t)1N sum

pisinPath(minusminus)branching(p) = 1

UsingTheorem1 sumpisinPath(minusminus) branching(p) = N so the latter equality is equivalent to

infin

sumt=0

damping(t) = 1

Hence every choice of the damping function such thatsuminfint=0damping(t) = 1 yields a well-defined nor-

malized functional ranking Nonetheless not all choices are equivalent so we have to find out whichfunctions generate better rankings Since a direct link should be more valuable as a source of evidence thana distant link we focus on damping functions that are decreasing ont the length of the paths

31 Linear damping

Letrsquos start by considering a simple damping function such as

damping(t) =

2(Lminust)L(L+1) t lt L

0 t ge L

4

this is a damping function that decreases linearly with distance and reaches zero at distanceL The trivialcaseL = 1 gives a uniform ranking andL = 2 is ranking by indegree as in the latter case all paths of lengthge 2 are not considered

From the definition

R =infin

sumt=0

damping(t)vPt

=L

sumt=0

2(Lminus t)L(L+1)

vPt

=2

L(L+1)v

Lminus1

sumt=0

(Lminus t)Pt

=2

L(L+1)v(L(I minusP)minusP(I minusPL))((I minusP)2)minus1

provided that(I minusP)2 is not singularAn advantage of this type of ranking is that only the first few levels are considered so the computation

is fast and the number of iterations is fixed

Computation For computing this functional ranking we can define the following sequence

R(0) =2

L+1v

R(k+1) =(Lminuskminus1)

(Lminusk)RkP

The functional ranking with linear damping issumLminus1k=0 R(k) An algorithm for computing this ranking

shown inFigure1 arises directly from this summation

32 Exponential damping PageRank

As we already noted PageRank can be seen as a functional ranking where the damping function decaysexponentially

damping(t) = (1minusα)αt

As longer paths have less importance in the calculation of PageRank it could be approximated by usingonly a few levels of links In [CGS04] it is shown that by using only the nodes at distance 1 from a targetnode (equivalent to linear damping withL = 2) PageRank can be approximated with 30 of average errorUsing nodes at distance 2 the average error drops to 20 and at distance 3 to 10 After that there are nosignificant improvements by adding more levels and the cost (the number of nodes to be explored) is muchhigher

Computation Since PageRank is the principal eigenvector of the modified graph matrix it can be easilyapproximated by the iterative Power Method algorithm as suggested by Pageet al in their original pa-per [PBMW98] this iterative algorithm gives good approximations (both in norm and with respect to theinduced node order) in few iterations even though convergence speed and numerical stability decay whenαgets close to 1 [HK03b HK03a] Other methods to compute PageRank have been proposed some of them

5

Require L maximum path length N number of nodesv preference vector1 for i 1 N do Initialization2 S[i] larr R[i] larr 2v[i](L+1)3 end for4 for step 1 Lminus1 do Iteration step5 Auxlarr 06 for i 1 N do Follow links in the graph7 for all j such that there is a link fromi to j do8 Aux[j] larr Aux[j] + R[i]outdegree(i)9 end for

10 end for11 for i 1 N do Add to ranking value12 R[i] larr Aux[i] times (Lminusstep)

(Lminus(stepminus1))13 S[i] larr S[i] + R[i]14 end for15 end for16 return R

Figure 1 Algorithm for computing a functional ranking with linear damping

using techniques for the solution of systems of linear equations some other concentrating on some specificfeatures of the web as a graph that determine forms of locality in the computation of PageRank (see forexample [PBMW98 Hav99 GG04 LGZ04 KHMG03b KHMG03a])

33 Quadratic hyperbolic damping TotalRank

Recently a ranking method called TotalRank [Bol05] has been proposed The method aims at eliminatingthe necessity for an arbitrary parameter by integrating PageRank over the entire range ofα If r(α) is thevector of PageRank then TotalRank is defined as

T =int 1

0r(α)dα

T can be written as

int 1

0r(α)dα =

1N

infin

sumt=0

int 1

0(1minusα)αt1middotPtdα

=1N

infin

sumt=0

1(t +1)(t +2)

1middotPt

By using the definition of the logarithm of a matrix

ln(I minusP) =minusinfin

sumk=1

Pk

k

we can write TotalRank as

T = Pminus1(I +(I minusPminus1) ln(I minusP))

6

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 6: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

this is a damping function that decreases linearly with distance and reaches zero at distanceL The trivialcaseL = 1 gives a uniform ranking andL = 2 is ranking by indegree as in the latter case all paths of lengthge 2 are not considered

From the definition

R =infin

sumt=0

damping(t)vPt

=L

sumt=0

2(Lminus t)L(L+1)

vPt

=2

L(L+1)v

Lminus1

sumt=0

(Lminus t)Pt

=2

L(L+1)v(L(I minusP)minusP(I minusPL))((I minusP)2)minus1

provided that(I minusP)2 is not singularAn advantage of this type of ranking is that only the first few levels are considered so the computation

is fast and the number of iterations is fixed

Computation For computing this functional ranking we can define the following sequence

R(0) =2

L+1v

R(k+1) =(Lminuskminus1)

(Lminusk)RkP

The functional ranking with linear damping issumLminus1k=0 R(k) An algorithm for computing this ranking

shown inFigure1 arises directly from this summation

32 Exponential damping PageRank

As we already noted PageRank can be seen as a functional ranking where the damping function decaysexponentially

damping(t) = (1minusα)αt

As longer paths have less importance in the calculation of PageRank it could be approximated by usingonly a few levels of links In [CGS04] it is shown that by using only the nodes at distance 1 from a targetnode (equivalent to linear damping withL = 2) PageRank can be approximated with 30 of average errorUsing nodes at distance 2 the average error drops to 20 and at distance 3 to 10 After that there are nosignificant improvements by adding more levels and the cost (the number of nodes to be explored) is muchhigher

Computation Since PageRank is the principal eigenvector of the modified graph matrix it can be easilyapproximated by the iterative Power Method algorithm as suggested by Pageet al in their original pa-per [PBMW98] this iterative algorithm gives good approximations (both in norm and with respect to theinduced node order) in few iterations even though convergence speed and numerical stability decay whenαgets close to 1 [HK03b HK03a] Other methods to compute PageRank have been proposed some of them

5

Require L maximum path length N number of nodesv preference vector1 for i 1 N do Initialization2 S[i] larr R[i] larr 2v[i](L+1)3 end for4 for step 1 Lminus1 do Iteration step5 Auxlarr 06 for i 1 N do Follow links in the graph7 for all j such that there is a link fromi to j do8 Aux[j] larr Aux[j] + R[i]outdegree(i)9 end for

10 end for11 for i 1 N do Add to ranking value12 R[i] larr Aux[i] times (Lminusstep)

(Lminus(stepminus1))13 S[i] larr S[i] + R[i]14 end for15 end for16 return R

Figure 1 Algorithm for computing a functional ranking with linear damping

using techniques for the solution of systems of linear equations some other concentrating on some specificfeatures of the web as a graph that determine forms of locality in the computation of PageRank (see forexample [PBMW98 Hav99 GG04 LGZ04 KHMG03b KHMG03a])

33 Quadratic hyperbolic damping TotalRank

Recently a ranking method called TotalRank [Bol05] has been proposed The method aims at eliminatingthe necessity for an arbitrary parameter by integrating PageRank over the entire range ofα If r(α) is thevector of PageRank then TotalRank is defined as

T =int 1

0r(α)dα

T can be written as

int 1

0r(α)dα =

1N

infin

sumt=0

int 1

0(1minusα)αt1middotPtdα

=1N

infin

sumt=0

1(t +1)(t +2)

1middotPt

By using the definition of the logarithm of a matrix

ln(I minusP) =minusinfin

sumk=1

Pk

k

we can write TotalRank as

T = Pminus1(I +(I minusPminus1) ln(I minusP))

6

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 7: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

Require L maximum path length N number of nodesv preference vector1 for i 1 N do Initialization2 S[i] larr R[i] larr 2v[i](L+1)3 end for4 for step 1 Lminus1 do Iteration step5 Auxlarr 06 for i 1 N do Follow links in the graph7 for all j such that there is a link fromi to j do8 Aux[j] larr Aux[j] + R[i]outdegree(i)9 end for

10 end for11 for i 1 N do Add to ranking value12 R[i] larr Aux[i] times (Lminusstep)

(Lminus(stepminus1))13 S[i] larr S[i] + R[i]14 end for15 end for16 return R

Figure 1 Algorithm for computing a functional ranking with linear damping

using techniques for the solution of systems of linear equations some other concentrating on some specificfeatures of the web as a graph that determine forms of locality in the computation of PageRank (see forexample [PBMW98 Hav99 GG04 LGZ04 KHMG03b KHMG03a])

33 Quadratic hyperbolic damping TotalRank

Recently a ranking method called TotalRank [Bol05] has been proposed The method aims at eliminatingthe necessity for an arbitrary parameter by integrating PageRank over the entire range ofα If r(α) is thevector of PageRank then TotalRank is defined as

T =int 1

0r(α)dα

T can be written as

int 1

0r(α)dα =

1N

infin

sumt=0

int 1

0(1minusα)αt1middotPtdα

=1N

infin

sumt=0

1(t +1)(t +2)

1middotPt

By using the definition of the logarithm of a matrix

ln(I minusP) =minusinfin

sumk=1

Pk

k

we can write TotalRank as

T = Pminus1(I +(I minusPminus1) ln(I minusP))

6

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 8: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

provided thatPminus1 is not singular andP 6= I TotalRank is as a weighted sum of the score associated with paths of varying lengths in which the

weights are hyperbolically decreasing on the lengths of the paths In other words TotalRank is a functionalranking with damping function

damping(t) =1

(t +1)(t +2)

and it is well defined assuminfint=0damping(t) = 1

Computation It is known that the cost of calculating TotalRank is the same as the cost of calculatingPageRank via the Power Method [BSV05] even though some more iterations are required to obtain thesame precision

34 General hyperbolic damping HyperRank

TotalRank is part of a more general family of weighting schemes for paths of different lengths that can beapproximated using

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β 1middotPt

Again this way of ranking follows the general scheme with damping function chosen as

damping(t) =1

ζ(β)(t +1)β

Here we are using Riemannrsquos zeta functionζ(β) = suminfint=1

1tβ for normalization and we needβ gt 1 for it

to converge Note that whenβ = 2 we get weights similar to those of TotalRank in which thet-th coefficientis 1(t +1)(t +2) whereas here it is 1ζ(2)(t +1)2

A meaningful choice forβ should be done considering the distribution of paths of different lengths in ascale-free graph A largeα in PageRank or a smallβ in HyperRank means increasing the effect of longerpaths in the score

Computation Figure2 shows an algorithm for approximating HyperRank Let us define a vector se-quenceR(t) as follows

R(0) =1

Nζ(β)

R(k+1) =(

k+1k+2

)βR(k)P

It is easy to see thatsuminfint=0R(k) = s(β) becauseR(k) = 1(N middotζ(β)(k+1)β))1middotPk this observation leads

to the algorithm Note that convergence speed is much slower than ordinary PageRank especially whenβis close to 1 the norm of thek-th summand being bound by 1(1+ 1k)β Interestingly enough thoughconvergence speed is reasonable ifβ is sufficiently large

7

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 9: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

Require N number of nodesv preference vectorβ damping parameter1 for i 1 N do Initialization2 S[i] larr R[i] larr v[i]ζ(β)3 end for4 klarr 05 while not convergeddo Iteration step6 klarr k + 17 Auxlarr 08 for i 1 N do Follow links in the graph9 for all j such that there is a link fromi to j do

10 Aux[j] larr Aux[j] + R[i]outdegree(i)11 end for12 end for13 for i 1 N do Add to ranking value14 R[i] larr Aux[i] times(k(k+1))β

15 S[i] larr S[i] + R[i]16 end for17 end while18 return S

Figure 2 An algorithm to compute general hyperbolic rank

35 An empirical damping

An empirical damping function would consider how much the value of an endorsement decreases by fol-lowing longer paths in the real web graph This cannot be known exactly but we can attempt to measureit indirectly Pages that link to each other are more similar than pages chosen at random [Dav00] evi-dence from topical crawlers [SPM05] shows that when doing breadth-first exploring the topic ldquodriftsrdquo asthe distance increases On the same line of though we propose to use the decrease of text similarity as anapproximation to an ldquoempiricalrdquo damping function

To find out which is the correlation between link-distance and similarity we performed the followingexperiment we considered a web graph corresponding to a partial snapshot of theuk domain and sampled200 nodes at random For each sampled node we followed links backwards to obtain nodes at a minimumdistance of 1 2 3 4 or 5 links Then we sampled 12000 pairs at each minimum distance at randomand computed their similarities with the original nodes Similarity was measured using the normalization ofTFIDF [BYRN99] without stemming nor stopwords removal

The resulting averages are shown inFigure3 with standard deviation error bars Text similarity clearlydecreases with distance and in some applications the empirical distribution of text similarity versus distancecould be used as an ldquoempiricalrdquo damping function Different measures of text similarity can yield differentdistributions for instance [WHAT04] uses the number of repeated words and phrases between pages andobtains a faster decrease in similarity

8

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 10: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

02

03

04

05

06

07

1 2 3 4 5

Ave

rage

text

sim

ilari

ty

Link distance

Figure 3 Link distance vs average text similarity

4 Comparing Damping Functions

A comparison of the damping functions described in the previous section is shown inFigure4 of coursehyperbolic damping functions decay asymptotically more slowly than exponential damping but notice thatfor short paths the latter may dominate the former in many cases

000

005

010

015

020

025

1 2 3 4 5 6 7

Wei

ght

Length of the path (t)

PageRank α=075PageRank α=050

TotalRankHyperRank β=2HyperRank β=3

Linear damping L=10

Figure 4 Weights given by the different damping functions for some values ofαandβ

In this section we aim at analyzing how similar are these functional rankings and how could we useone of the damping functions to approximate another with a suitable choice of parameters

9

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 11: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

41 Approximating TotalRank with PageRank

It has been shown experimentally that the rank correlation (Kendallrsquosτ) between TotalRank and PageRankis maximal whenα asymp 07 [Bol05] the maximum value forτ is over 095 meaning that for that specificchoice ofα PageRank and TotalRank induce almost equivalent ranking orders

We now want to approach the same problem in an analytic fashion more precisely we aim at study-ing the difference between TotalRank and PageRank by calculating the difference between their respectivedamping functions

dampingTotalRank(t) =1

(t +1)(t +2)dampingPageRank(α)(t) = (1minusα)αt

As they are normalized both damping functions have the same summation over the entire range oftOur approach is to consider the summation of their differences up to a maximum length for a path As thetwo functions are decreasing the difference in the first levels makes most of the difference in the rankingsIf ` is the maximum path length we are interested in we aim at minimizing this sum

`

sumt=0

(1

(t +1)(t +2)minus (1minusα)αt

)= α`+1minus 1

`+2

The minimum absolute value is 0 and it is obtained whenα is equal to

αlowast(`) =1

`+1radic

`+2= 1minus log`

`+O

(log2`

`2

)

Figure5 showsαlowast(`) as a function of Recall that for the World-Wide Web graph the average lengthof a path between two nodes when a path exists has been estimated in about 16 [BKM+00] or 19 [AJB99]but clearly today is over 20 Now in the range of path lengths between 15 and 20 the value ofαlowast(`)parameters that minimizes the difference between the exponentially decaying weights of PageRank and thehyperbolically decaying weights of TotalRank is roughly 085

Difference Optimal value

0506

0708

09

α0 5 10 15 20 25

Length

-02

-01

0

01

02

03

04

045

050

055

060

065

070

075

080

085

090

0 5 10 15 20 25

α that

min

imiz

es th

e di

ffer

ence

of

sum

of w

eigh

ts w

ith T

otal

Ran

k

Maximum length of the paths considered

Figure 5 Bestαlowast(`) for minimizing the difference of the sum of weights betweenPageRank and TotalRank

10

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 12: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

42 Approximating HyperRank with PageRank

Now we want to approximate the weights of

s(β) =1

Nζ(β)

infin

sumt=0

1

(t +1)β Pt

using the weights of

r(α) =1minusα

N

infin

sumt=0

αtPt

and we proceed again by considering paths up to a certain length

`

sumt=0

(1

ζ(β)(t +1)β minus (1minusα)αt

)The minimum can be zero and it is attained at

αlowast(`β) =

radic1minus 1

ζ(β)

`

sumt=0

1

(t +1)β

Theα that minimizes the difference of weights for different values ofβ and of the maximum path lengths` is shown inFigure6 In the case ofβ = 2 for instance for path lengths up to 10 to 20 the bestα is between075 and 085

α

15202530β 5 10 15 20 25

Length

03

04

05

06

07

08

09

1

00

01

02

03

04

05

06

07

08

09

10

1 15 2 25 3 35 4

α th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent β

Max path length=25Max path length=20Max path length=15Max path length=10

Max path length=5

Figure 6 Bestα for minimizing the difference of the sum of weights betweenPageRank and HyperRank with exponentβ for various path lengths

11

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 13: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

43 Approximating PageRank with LinearRank

For approximating the damping function of PageRank with the damping function of LinearRank we con-sider the summation of the differences up to a certain path length If`le L

`

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)And if ` gt L

Lminus1

sumt=0

((1minusα)αt minus 2(Lminus t)

L(L+1)

)+

`

sumt=L

(1minusα)αt

We will assume thatle L so the evaluation of the difference between the two rankings is done in an areain which both rankings have non-zero values TheL that minimizes the difference for a given combinationof α and` is

Llowast(α `) = `+2`α`+1 +α`+1 +1+

radic(1+α`+1)2 +4`(`+2)α`+1

2(1minusα`+1)

= `+1+O(`α(`+1)2

)and we have plotted it for different values ofα and` in Figure7

L

0405

0607

0809

α5 10 15 20 25

Length

5

10

15

20

25

30

5

10

15

20

25

30

35

40

05 06 07 08 09

L th

at m

inim

izes

the

diff

eren

ce o

fsu

m o

f wei

ghts

Exponent α

Max path length=25Max path length=20Max path length=15Max path length=10Max path length=5

Figure 7 BestL for minimizing the difference of the sum of weights betweenLinearRank and LinearRank for various path lengths

12

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 14: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

5 Parameters for the Damping Functions

For our experiments we used several snapshots from the Web including theuk it andeuint domainsFor comparison we also considered a synthetic scale-free network produced according to the evolving modeldescribed by Kumaret al [KRR+00] (a combination of preferential attachment and random links) with theparameters suggested by Panduranganet al [PRU02] As far as the latter is concerned in the generatedgraph the exponents for the power-law in the center part of the distributions are -21 for in-degree andPageRank and -27 for out-degree we generated a 100000-nodes graph without disconnected nodes

In this section we study the behavior of the ranking functions for varying values of their parameters

51 Characteristic path lengths

In scale-free networks the distances between pairs of nodes follow a Gaussian distribution [AJB99] (theaverage is not given in their paper) Analytic estimations for the average distance of a graph of scale-freenetwork ofn nodes include

bull O(log(n)) [WS98]

bull O(log(n) log(np)) in sparse graphs withp links [CL01]

bull 1+ log(nz1) log(z2z1) wherez1 is the average indegree andz2 is the average number of nodes atdistance 2 [NSW01] and

bull O(log(n) log(log(n))) [BR04]

We did the following experiment starting from a node picked at random we followed the links back-wards and counted the number of nodes at different distancesFigure8 plots the average distances foundwhich appear to be growing (sub)logarithmically with the size of the graphFigure9 shows the distributionobtained in each sample For this experiment we are not counting the pages without in-links

3456789

10111213141516

105 106 107 108

Ave

rage

dis

tanc

e

Number of nodes (N)

Figure 8 Average distances versus number of nodes

If graphs of different sizes show different path lengths what is the effect of this in the ranking calcu-lation Letrsquos suppose that for a graph withN1 nodes it is found by experimental or analytic means that agood parameter for PageRank isαlowast1 Now we would like to have a good parameterαlowast2 for a graph with thesame properties except that the size of the new graph isN2 lt N1

13

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 15: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

it Web graph uk Web graph40 mill nodes 18 mill nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 149 Avg distance 148

euint Web graph Synthetic graph800000 nodes 100000 nodes

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

000

005

010

015

020

025

030

5 10 15 20 25 30

Freq

uenc

y

Distance

Avg distance 100 Avg distance 42

Figure 9 Distribution of the average number of nodes at a certain distance froma given node

One possible approach consistent with what we have done so far is to consider that the sum of theweights up to the average path lengths of the graphs (L1 L2) have to be similar for both rankings to behavein a similar way If we take this approach the solution is

1minus (αlowast1)L1+1 = 1minus (αlowast2)

L2+1

αlowast2 = (αlowast1)L1+1L2+1

αlowast2 asymp (αlowast1)log(N1)log(N2)

An example that can be used in practice is the following letrsquos consider a web graph withN1 = 115times109

pages (the size of the full Web estimated by [GS05]) and another graph with onlyN2 = 50times106 pages (thesize of the Web of a large country) the second graph is roughly 3 orders of magnitude smaller

If it is shown empirically thatαlowast1 = 085 is a good value for the PageRank parameter for the whole Webthenαlowast2 = 081 should have a similar behavior in the 50-million page set which is natural as the path lengthsare shorter If the subset of web pages were even smaller for instanceN2 = 106 pages (the size of the webof a large organization) thenαlowast2 = 076

14

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 16: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

52 Damping parameters and in-degree

In this section we are using data from theuk Web graph and a 8500-nodes synthetic graph with the sameindegree and outdegree distribution than the previous synthetic graph We first measured the variance ofthe values from the ranking function as we consider that a high variance is good in a ranking functionas the relative values differ more We also measured the relationship between the ranking function andin-degree for different values of the parameters in terms of covariance correlation coefficient and rankingorders (Kendallrsquosτ) The results are shown inFigure10

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00200e-13400e-13600e-13800e-13100e-12120e-12140e-12160e-12180e-12200e-12

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

00 times 100

50 times 10-5

10 times 10-4

15 times 10-4

20 times 10-4

25 times 10-4

30 times 10-4

35 times 10-4

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

07000710072007300740075007600770078007900800

01 02 03 04 05 06 07 08 09

Corr

elat

ion

coef

ficie

nt w

ith in

-deg

ree

Damping factor α

042

044

046

048

050

052

054

056

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

005

010

015

020

025

030

035

040

045

050

055

01 02 03 04 05 06 07 08 09

Var

ianc

e of

Pag

eRan

k

Damping factor α

0 times 100

1 times 10-3

2 times 10-3

3 times 10-3

4 times 10-3

5 times 10-3

6 times 10-3

01 02 03 04 05 06 07 08 09

Cov

aria

nce

betw

een

Page

Ran

k an

d in

-deg

ree

Damping factor α

0995

0996

0997

0998

0999

1000

01 02 03 04 05 06 07 08 09

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor α

079

080

081

082

083

084

085

086

087

01 02 03 04 05 06 07 08 09Ken

dallrsquo

s τ b

etw

een

Page

Rank

and

in-d

egre

e

Damping factor α

Figure 10 Variance of PageRank and its relationship with indegree in terms ofcovariance correlation coefficient and Kendallrsquosτ coefficient for varying valuesof the parameterα Top uk web graph bottom synthetic graph

The variance is higher asα increases In regard to the relationship with indegree for company homepages it has been shown that the logarithm of the in-degree is correlated with PageRank [UCH03] Ourresults are consistent with this observation The covariance monotonically increases withα both in the caseof the synthetic graph and in the web graph Not surprisingly using the generative model the correlationsare higher We observe a maximum correlation atα = 07 in the synthetic graph and atα = 05 in the webgraph We also notice that the correlation drops significativelly asα increases because a largeα means thatlonger paths have an effect in the calculation note however that this phenomenon does not significantlyimpact on the correlation coefficient that is still very large

A high correlation between PageRank and in-degree is bad from the point of view of a search enginebecause it makes link-spam easier In particular as the correlation coefficient is higher in theuk web graphnear 05 if we chooseα close to this value we are helping link spammers Note however that a highcorrelation was foreseeable because as shown in[CGS04] even approximating PageRank with just only 1level of links gets 70 of accuracy

The behavior of the Kendallrsquosτ coefficient which measures the similarity between ranking orders isthe opposite than the one observed in the real graph This also happens for HyperRank as inFigure11we made the same measurements for this functional ranking and the results were consistent The graphseems inverted because a low value ofβ has the same effect as a high value ofα longer paths have moreimportance in the calculation

The differences in the behavior of the ranking order in the synthetic graph might be explained by thefact that the generative model we are using does not capture some properties that might be relevant for theranking order under a functional ranking such as the clustering coefficient Also the synthetic graph is

15

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 17: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

Variance Covariance Correlation coefficient Kendallrsquosτ

000e+00

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

100e-04

200e-04

300e-04

400e-04

500e-04

600e-04

700e-04

800e-04

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0730

0740

0750

0760

0770

0780

0790

0800

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

04600

04700

04800

04900

05000

05100

05200

05300

1 15 2 25 3

Ken

dallrsquo

s τ

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

000e+00

200e-05

400e-05

600e-05

800e-05

100e-04

120e-04

140e-04

160e-04

1 15 2 25 3 35 4

Var

ianc

e of

Gen

eral

Hyp

erbo

lic R

ank

Damping factor β

000e+00

200e-03

400e-03

600e-03

800e-03

100e-02

120e-02

1 15 2 25 3 35 4Cov

aria

nce

betw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

0997

0997

0998

0998

0999

0999

1 15 2 25 3 35 4

Cor

rela

tion

coef

fici

ent w

ith in

-deg

ree

Damping factor β

0815008200082500830008350084000845008500085500860008650

1 15 2 25 3 35 4

Ken

dallrsquo

s τ b

etw

een

Gen

eral

Hyp

erbo

lic R

ank

and

in-d

egre

e

Damping factor β

Figure 11 Variance of general hyperbolic rank for some values ofβ and itsrelation with in-degree Experiments have been performed on theuk graph (top)and synthetic graph (bottom)

assortative (highly linked pages are mostly linked to other highly linked pages) while the real web graph isdisassortative (most of the neighbours of highly linked pages have small indegree)

53 Experimental comparison of ranking orders

In this section we present experimental results about the similarity between the ranking orders induced bysome of the functional rankings discussed in the previous sections To perform the comparison we usedKendallrsquosτ and data from the U K Web graphFigure12shows how PageRank compares with HyperRankfor various pairs ofα andβ In the limit αβrarr 1 both rankings are equivalent and they remain similar in alarge region of the parameter space

α

β

07

08

09

1

τ ge 095

τ

02 03

04 05

06 07

08 09

2 25

3 35

4

01

02

03

04

05

06

07

08

09

15 2 25 3 35 4

α th

at m

axim

izes

Ken

dallrsquo

s τ

Exponent β

Actual optimumPredicted optimum with length=5

Figure 12 Comparison (using Kendallrsquosτ) between PageRank and HyperRankwith various damping parameters in the UK web graph The optimum predictedin the analysis with = 5 is very close to the real one

In the figure we can see that the rankings obtained with HyperRank and PageRank can be almostequivalent (Kendallrsquosτ ge 095) moreover the analysis shown insection42 considering only paths of

16

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 18: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

lengths less than 5 provides a very good approximation for the optimum combination of parameters Thismeans that in fact the difference in the damping functions in the first few levels is crucial

The exponentsβ required for giving a good approximation of PageRank are very small whenα ge 07limiting the practical applicability of HyperRank as its convergence is not faster than the one of PageRank

As far as LinearRank and PageRank are concerned long paths and largeα should be considered toobtain a sufficiently similar ranking as shown inFigure13

05 06

07 08

09

5 10

15 20

25

080085090095100

τ

τ ge 095

αL

5

10

15

20

25

05 06 07 08 09L

that

max

imiz

es K

enda

llrsquos

τExponent α

Actual optimumPredicted optimum with length=5

Figure 13 Comparison (using Kendallrsquosτ) between PageRank and LinearRankin the UK web graph with various damping parameters Again the predictedoptimum with` = 5 is very close to the actual optimum

The predicted optimum given insection43 with ` = 5 this is considering only the summation of thedifferences between both damping functions up to paths of length 5 is very close to what was obtained inpractice Forα = 08 calculating LinearRank withL = 10 (which means the same number of iterations)givesτ ge 098 for α = 09 calculating LinearRank withL = 15 also givesτ ge 098 In both cases theranking order of PageRank is approximated by the ranking order of LinearRank with very high precision

6 Conclusions

In this paper we have defined a broad class of link-based ranking algorithms based on the contribution ofdamping factors along all different paths reaching a page We introduce four particular damping decays lin-ear exponential quadratic hyperbolic and general hyperbolic where exponential is equivalent to PageRankand quadratic hyperbolic to TotalRank

We studied the differences and similarities among these ranking algorithms and we have found that

bull Functional rankings using different damping functions can provide similar orderings if the parametersare chosen carefully

bull LinearRank can be used for calculating a ranking that is as good as PageRank but with a fixed andsmaller number of iterations

bull The parameters for the damping functions depend on the characteristic of path lengths in the graphwhich is known to grow sub-logarithmically on the size of the graph

More work is needed to find other damping functions that compute rankings similar to PageRank butare easier and faster to compute We use global ranking similarity but another measure could be the ranking

17

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 19: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

similarity in the top 20 results of real queries In this setting our results can change so future work willinclude this variation

Because of their high cost link-based ranking methods that involve iterative calculations at query timeare probably not used by large-scale search engines at this moment but the functional ranking with lineardamping we have presented can provide a good approximation with few iterations Also the approach wehave presented could be also applied to multivalued ranking functions such as HITS [Kle99] and topic-sensitive PageRank [Hav02] to obtain for instance a method for approximating the hubs and authorityscores using less iterations and a linear damping function

Our approach also helps to understand how easy or difficult is to collude many pages to modify theranking of a given page Clearly there are many different factors path lengths damping function branchingdegrees and number of colluded pages The graph structure of the collusion will affect those factors and weplan to analyze them

Acknowledgements

We would like to thank Daniel Fogaras for a valuable discussion about TotalRank that motivated part of thisresearch

References

[AJB99] Reka Albert Hawoong Jeong and Albert L Barabasi Diameter of the world wide webNature 401130ndash131 1999

[BH98] Krishna Bharat and Monika R HenzingerImproved algorithms for topic distillation in ahyperlinked environment In Proceedings of the 21st Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval pages 104ndash111 MelbourneAustralia August 1998 ACM Press New York

[BKM +00] Andrei Broder Ravi Kumar Farzin Maghoul Prabhakar Raghavan Sridhar RajagopalanRaymie Stata Andrew Tomkins and Janet WienerGraph structure in the web Experimentsand models In Proceedings of the Ninth Conference on World Wide Web pages 309ndash320Amsterdam Netherlands May 2000 ACM Press

[Bol05] Paolo BoldiTotalrank ranking without damping In Poster proceedings of the 14th interna-tional conference on World Wide Web pages 898ndash899 Chiba Japan 2005 ACM Press

[BR04] Bela Bollobas and Oliver RiordanThe diameter of a scale-free random graph Combinator-ica 24(1)5ndash34 2004

[Bri06] Michael Brinkmeier PageRank revisited ACM Transaction on Internet Technologies6(3)257ndash279 2006

[BSV05] Paolo Boldi Massimo Santini and Sebastiano VignaPagerank as a function of the dampingfactor In Proceedings of the 14th international conference on World Wide Web pages 557ndash566 Chiba Japan 2005 ACM Press

[BYD04] Ricardo Baeza-Yates and Emilio DavisWeb page ranking using link attributes In Alternatetrack papers amp posters of the 13th international conference on World Wide Web pages 328ndash329 New York NY USA 2004 ACM Press

18

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 20: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-NetoModern Information Retrieval AddisonWesley May 1999

[CGS04] Yen-Yu Chen Qingqing Gan and Torsten SuelLocal methods for estimating pagerank val-ues In Proceedings of the thirteenth ACM conference on Information and knowledge man-agement (CIKM) pages 381ndash389 New York NY USA 2004 ACM Press

[CL01] Fan Chung and Linyuan LuThe diameter of random sparse graphs Adv Appl Math26257ndash279 2001

[Dav00] Brian D DavisonTopical locality in the web In Proceedings of the 23rd annual internationalACM SIGIR conference on research and development in information retrieval pages 272ndash279Athens Greece 2000 ACM Press

[GG04] Gene H Golub and Chen GreifArnoldi-type algorithms for computing stationary distributionvectors with application to pagerank Technical Report SCCM-04-15 Stanford University2004

[GS05] Antonio Gulli and Alessio SignoriniThe indexable Web is more than 115 billion pages InPoster proceedings of the 14th international conference on World Wide Web pages 902ndash903Chiba Japan 2005 ACM Press

[Hav99] Taher HaveliwalaEfficient computation of pagerank Technical report Stanford University1999

[Hav02] Taher H HaveliwalaTopic-sensitive pagerank In Proceedings of the Eleventh World WideWeb Conference pages 517ndash526 Honolulu Hawaii USA May 2002 ACM Press

[HG98] Stephanie W Haas and Erika S GramsPage and link classifications connecting diverseresources In DL rsquo98 Proceedings of the third ACM conference on Digital libraries pages99ndash107 New York NY USA 1998 ACM Press

[HK03a] Taher Haveliwala and Sepandar KamvarThe condition number of the pagerank problemTechnical Report 36 Stanford University 2003

[HK03b] Taher Haveliwala and Sepandar KamvarThe second eigenvalue of the google matrix Tech-nical Report 20 Stanford University 2003

[JM98] W-K Joo and S H Myaeng Improving retrieval effectiveness with hyperlink informationIn Proceedings of International Workshop on Information Retrieval with Asian Languages(IRAL) Singapore October 1998

[KHMG03a] S Kamvar T Haveliwala C Manning and G GolubExploiting the block structure of theweb for computing pagerank 2003

[KHMG03b] Sepandar D Kamvar Taher H Haveliwala Christopher D Manning and Gene H GolubExtrapolation methods for accelerating pagerank computations In Proceedings of the twelfthinternational conference on World Wide Web pages 261ndash270 ACM Press 2003

[Kle99] Jon M KleinbergAuthoritative sources in a hyperlinked environment Journal of the ACM46(5)604ndash632 1999

19

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions
Page 21: Universita degli Studi di Milanoboldi.di.unimi.it/download/TRdampingWithCover.pdf · Universita degli Studi di Milano` Italy Carlos Castillo Catedra Telef´ onica´ Depto. de Tecnolog´ıa

[KRR+00] R Kumar P Raghavan S Rajagopalan D Sivakumar A Tomkins and E UpfalStochasticmodels for the web graph In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS) pages 57ndash65 Redondo Beach CA USA 2000 IEEE CS Press

[LGZ04] Chris P Lee Gene H Golub and Stefanos A ZeniosA fast two-stage algorithm for com-puting pagerank and its extensions Technical report Stanford University 2004

[Li98] Yanhong LiToward a qualitative search engine IEEE Internet Computing July 1998

[Lif00] Maxim LifantsevVoting model for ranking Web pages In Peter Graham and MuthucumaruMaheswaran editorsProceedings of the International Conference on Internet Computingpages 143ndash148 Las Vegas Nevada USA June 2000 CSREA Press

[Mar97] Massimo MarchioriThe quest for correct information of the Web hyper search engines InProc of the sixth international conference on the Web Santa Clara USA April 1997

[NSW01] M E Newman S H Strogatz and D J WattsRandom graphs with arbitrary degree distri-butions and their applicationsPhys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2) August2001

[PBMW98] Lawrence Page Sergey Brin Rajeev Motwani and Terry WinogradThe PageRank citationranking bringing order to the Web Technical report Stanford Digital Library TechnologiesProject 1998

[PRU02] Gopal Pandurangan Prabhakara Raghavan and Eli UpfalUsing Pagerank to characterizeWeb structure In Proceedings of the 8th Annual International Computing and CombinatoricsConference (COCOON) volume 2387 ofLecture Notes in Computer Science pages 330ndash390Singapore August 2002 Springer

[SPM05] Padmini Srinivasan Gautam Pant and Filippo MenczerA general evaluation framework fortopical crawlers Information Retrieval 8(3)417ndash447 2005

[UCH03] Trystan Upstill Nick Craswell and David HawkingPredicting fame and fortune Page-rank or indegree In Proceedings of the Australasian Document Computing SymposiumADCS2003 pages 31ndash40 Canberra Australia December 2003

[WHAT04] Fang Wu Bernardo A Huberman Lada A Adamic and Joshua R TylerInformation flow insocial groups Physica A Statistical and Theoretical Physics 337(1-2)327ndash335 June 2004

[WS98] Duncan J Watts and Steven H StrogatzCollective dynamics of small-world networksNa-ture 393(6684)440ndash442 June 1998

20

  • Introduction
  • Functional Rankings
  • Damping Functions
    • Linear damping
    • Exponential damping PageRank
    • Quadratic hyperbolic damping TotalRank
    • General hyperbolic damping HyperRank
    • An empirical damping
      • Comparing Damping Functions
        • Approximating TotalRank with PageRank
        • Approximating HyperRank with PageRank
        • Approximating PageRank with LinearRank
          • Parameters for the Damping Functions
            • Characteristic path lengths
            • Damping parameters and in-degree
            • Experimental comparison of ranking orders
              • Conclusions