1002.2858v3

8/13/2019 1002.2858v3

1/9

arXiv:1002.2858v3

[cs.IR]14Aug2010

PageRank: Standing on the shoulders of giants

Massimo FranceschetDepartment of Mathematics and Computer Science

University of Udine

Via delle Scienze 206, 33100 Udine, [email protected]

KeywordsPageRank, Web information retrieval, Bibliometrics, Sociom-etry, Econometrics.

1. INTRODUCTIONPageRank[3]is a Web page ranking technique that has beena fundamental ingredient in the development and success ofthe Google search engine. The method is still one of themany signals that Google uses to determine which pages aremost important.1 The main idea behind PageRank is todetermine the importance of a Web page in terms of theimportance assigned to the pages hyperlinking to it. In fact,this thesis is not new, and has been previously successfullyexploited in different contexts. We review the PageRankmethod and link it to some renowned previous techniquesthat we have found in the fields of Web information retrieval,bibliometrics, sociometry, and econometrics.

2. WEB INFORMATION RETRIEVALIn 1945 Vannevar Bush wrote a today celebrated article inThe Atlantic Monthlyentitled As We May Think describ-ing a futuristic device he called Memex[5]. Bush writes:

Wholly new forms of encyclopedias will appear,ready made with ameshof associative trails run-ning through them, ready to be dropped into theMemex and there amplified.

Bushs prediction came true in 1989, when Tim Berners-Leeproposed the Hypertext Markup Language(HTML) to keeptrack of experimental data at the European Organizationfor Nuclear Research (CERN). In the original far-sightedproposal in which Berners-Lee attempts to persuade CERNmanagement to adopt the new global hypertext system wecan read the following paragraph2:

1http://www.google.com/corporate/tech.html2http://www.w3.org/History/1989/proposal.html

We should work toward a universal linked infor-mation system, in which generality and portabil-ity are more important than fancy graphics tech-niques and complex extra facilities. The aim wouldbe to allow a place to be found for any informa-tion or reference which one felt was important,

and a way of finding it afterwards. The resultshould be sufficiently attractive to use that theinformation contained would grow past a critical

threshold.

As we all know, the proposal was accepted and later imple-mented in amesh this was the only name that Berners-Leeoriginally used to describe the Web of interconnected doc-uments that rapidly grew beyond the CERN threshold, asBerners-Lee anticipated, and became the World Wide Web.

Today, the Web is a huge, dynamic, self-organized, and hy-perlinked data source, very different from traditional doc-ument collections which are nonlinked, mostly static, cen-trally collected and organized by specialists. These featuresmake Web information retrieval quite different from tradi-

tional information retrieval and call for new search abilities,like automatic crawling and indexing of the Web. Moreover,early search engines ranked responses using only a contentscore, which measures the similarity between the page andthe query. One simple example is just a count of the num-ber of times the query words occur on the page, or perhapsa weighted count with more weight on title words. Thesetraditional query-dependent techniques suffered under thegigantic size of the Web and the death grip of spammers.

In 1998, Sergey Brin and Larry Page revolutionised the fieldof Web information retrieval by introducing the notion of animportance score, which gauges the status of a page, inde-pendently from the user query, by analysing the topology ofthe Web graph. The method was implemented in the famous

PageRank algorithm and both the traditional content scoreand the new importance score were efficiently combined ina new search engine named Google.

3. RANKING WEB PAGES USING PAGE-

RANKWe briefly recall how the PageRank method works keepingthe mathematical machinery to the minimum. Interestedreaders can more thoroughly investigate the topic in a recentbook of Langville and Meyer which elegantly describes the
http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3http://arxiv.org/abs/1002.2858v3

8/13/2019 1002.2858v3

2/9

science of search engine rankings in a rigorous yet playfulstyle[15].

We start by providing an intuitive interpretation of Page-Rank in terms of random walks on graphs[21]. The Web isviewed as a directed graph of pages connected by hyperlinks.A random surferstarts from an arbitrary page and simplykeeps clicking on successive links at random, bouncing frompage to page. The PageRank value of a page corresponds

to the relative frequency that the random surfer visits thatpage, assuming that the surfer goes on infinitely. The moretime spent by the random surfer on a page, the higher thePageRank importance of the page.

A little more formally, the method can be described as fol-lows. Let us denote by qi the number of distinct outgoing(hyper)links of page i. Let H = (hi,j) be a square ma-trix of size equal to the number n of Web pages such thathi,j = 1/qi if there exists a link from page i to page j andhi,j = 0 otherwise. The value hi,j can be interpreted asthe probability that the random surfer moves from page i topagej by clicking on one of the distinct links of page i. ThePageRank j of page j is recursivelydefined as:

j =

i

ihi,j

or, in matrix notation, = H. Hence, the PageRank ofpagej is the sum of the PageRank scores of pages i linkingto j, weighted by the probability of going from i to j. Inwords, the PageRank thesis reads as follows:

A Web page is important if it is pointed to byother important pages.

There are in fact three distinct factors that determine thePageRank of a page: (i) the number of links it receives, (ii)

the link propensity, that is, the number of outgoing links,of the linking pages, and (iii) the PageRank of the linkingpages. The first factor is not surprising: the more links apage receives, the more important it is perceived. Reason-ably, the link value depreciates proportionally to the num-ber of links given out by a page: endorsements coming fromparsimonious pages are worthier than those emanated byspendthrift ones. Finally, not all pages are created equal:links from important pages are more valuable than thosefrom obscure ones.

Unfortunately, this ideal model has two problems that pre-vent the solution of the system. The first one is due to thepresence of dangling nodes, that are pages with no forwardlinks.3 These pages capture the random surfer indefinitely.Notice that a dangling node corresponds to a row in ma-trix Hwith all entries equal to 0. To tackle the problem ofdangling nodes, the corresponding rows in H are replacedby the uniform probability vector u = 1/n e, where e is avector of length n with all components equal to 1. Alter-natively, one may use any fixed probability vector in placeofu. This means that the random surfer escapes from the

3The term dangling refers to the fact that many danglingnodes are in fact pendent Web pages found by the crawlingspiders but whose links have not been yet explored.

A

3.3

B

38.4

C

34.3

D

3.9

E

8.1

F

3.9

G

1.6

H

1.6

I

1.6

L

1.6

M

1.6

Figure 1: A PageRank instance with solution. Eachnode is labelled with its PageRank score. Scoreshave been normalized to sum to 100. We assumed= 0.85.

dangling page by jumping to a randomly chosen page. WecallSthe resulting matrix.

The second problem with the ideal model is that the surfercan get trapped into a bucketof the Web graph, which is areachable strongly connected component without outgoingedges towards the rest of the graph. The solution proposedby Brin and Page is to replace matrix Sby theGoogle matrix

G= S+ (1 )E

where Eis the teleportation matrixwith identical rows eachequal to the uniform probability vector u, and is a freeparameter of the algorithm often called the damping factor.Alternatively, a fixed personalization probability vector vcan be used in place onu. In particular, the personalizationvector can be exploited to bias the result of the methodtowards certain topics. The interpretation of the new systemis that, with probability the random surfer moves forwardby following links, and, with the complementary probability1the surfer gets bored of following links and enters a newdestination in the browsers URL line, possibly unrelatedto the current page. The surfer is hence teleported, like aStar Trek character, to that page, even if there exists nolink connecting the current and the destination pages in theWeb universe. The inventors of PageRank propose to setthe damping factor = 0.85, meaning that after about five

link clicks the random surfer chooses a random page.

The PageRank vector is then defined as the solution of equa-tion:

= G (1)

An example is provided in Figure 1. Node A is a danglingnode, while nodes B and C form a bucket. Notice the dy-

8/13/2019 1002.2858v3

3/9

namics of the method: page C receives just one link but fromthe most important page B; its importance is much higherthan that of page E, which receives many more links, butfrom anonymous pages. Pages G, H, I, L, and M do not re-ceive endorsements; their scores correspond to the minimumamount of status of each page.

Typically, the normalization condition

i i = 1 is alsoadded. In this case Equation 1 becomes = S+ (1

)u. The latter distinguishes two factors contributing to thePageRank vector: an endogenousfactor equal to S whichtakes into consideration the real topology of the Web graph,and an exogenous factor equal to the uniform probabilityvector u, which can be interpreted as a minimal amount ofstatus assigned to each page independently of the hyperlinkgraph. The parameterbalances between these two factors.

4. COMPUTING THE PAGERANK VECTORDoes Equation 1 have a solution? Is the solution unique?Can we efficiently compute it? The success of the PageRankmethod rests on the answers to these queries. Luckily, allthese questions have nice answers.

Thanks to the dangling nodes patch, matrix Sis a stochas-tic matrix4, and clearly the teleportation matrix E is alsostochastic. It follows that G is stochastic as well, since it isdefined as a convex combination of stochastic matricesSandE. It is easy to show that, ifG is stochastic, Equation1 hasalways at least one solution. Hence, we have got at least onePageRank vector. Having two independent PageRank vec-tors, however, would be already too much: which one shouldwe use to rank Web pages? Here, a fundamental result ofalgebra comes to the rescue : Perron-Frobenius theorem [23,6]. It states that, ifA is an irreducible5 nonnegative squarematrix, then there exists a unique vector x, called the Perronvector, such that xA = rx,x >0, and

i xi = 1, wherer is

the maximum eigenvalue ofA in absolute value, that alge-braists call thespectral radiusofA. The Perron vector is theleft dominant eigenvector ofA, that is, the left eigenvectorassociated with the largest eigenvalue in magnitude.

The matrix S is most likely reducible, since experimentshave shown that the Web has a bow-tie structure fragmentedinto four main continents that are not mutually reachable, asfirst observed in[4]. Thanks to the teleportation trick, how-ever, the graph of matrix G is strongly connected. Hence Gis irreducible and Perron-Frobenius theorem applies6. There-fore, a positive PageRank vector exists and is furthermoreunique.

Interestingly, we can arrive at the same result using Markovtheory [19]. The above described random walk on the Web

graph, modified with the teleportation jumps, naturally in-duces a finite-state Markov chain, whose transition matrixis the stochastic matrix G. SinceG is irreducible, the chainhas a unique stationary distribution corresponding to thePageRank vector.

4This simply means that all rows sum up to 1.5A matrix is irreducible if and only if the directed graphassociated with it is strongly connected, that is, for everypair i and j of graph nodes there are paths leading from itoj and from j to i.6SinceG is stochastic, its spectral radius is 1.

Year Author Contribution1906 Markov Markov theory [19]1907 Perron Perron theorem[23]1912 Frobenius Perron-Frobenius theorem [6]1929 von Mises & Power method[30]

Pollaczek-Geiringer1941 Leontief Econometric model[17]1949 Seeley Sociometric model[28]1952 Wei Sport ranking model[31]

1953 Katz Sociometric model[10]1965 Hubbell Sociometric model[9]1976 Pinski & Narin Bibliometric model[25]1998 Kleinberg HITS[13]1998 Brin & Page PageRank[3]

Table 1: PageRank history.

A last crucial question remains: can we efficiently computethe PageRank vector? The success of PageRank is largelydue to the existence of a fast method to compute its val-ues: the power method, a simple iteration method to find

the dominant eigenpair of a matrix developed by von Misesand Pollaczek-Geiringer [30]. It works as follows on the

Google matrix G. Let (0) = u = 1/n e. Repeatedly com-

pute (k+1) = (k)G until ||(k+1) (k)||< , where || ||measures the distance between the two successive PageRankvectors and is the desired precision.

The convergence rate of the power method is approximatelythe rate at which k approaches to 0: the closer to unity,the lower the convergence speed of the power method. If, forinstance, = 0.85, as many as 43 iterations are sufficient togain 3 digits of accuracy, and 142 iterations are enough for10 digits of accuracy. Notice that the power method appliedto matrix G can be easily expressed in terms of matrix H,which, unlike G, is a very sparse matrix that can be storedusing a linear amount of memory with respect to the size ofthe Web.

5. STANDING ON THE SHOULDERS OF GI-

ANTSDwarfs standing on the shoulders of giants is a Westernmetaphor meaning One who develops future intellectualpursuits by understanding the research and works createdby notable thinkers of the past.7 The metaphor was fa-mously uttered by Isaac Newton: If I have seen a littlefurther it is by standing on the shoulders of Giants. More-over,Stand on the shoulders of giants is Google Scholarsmotto: the phrase is our acknowledgement that much ofscholarly research involves building on what others have al-ready discovered.

There are many giants upon whose shoulders PageRank firmlystands: Markov [19], Perron [23], Frobenius [6], von Misesand Pollaczek-Geiringer[30]provided at the beginning of the1900s the necessary mathematical machinery to investigateand effectively solve the PageRank problem. Moreover, thecircular PageRank thesis has been previously exploited in

7From the Wikipedia page for Standing on the shoulders ofgiants.

8/13/2019 1002.2858v3

4/9

A

4.7

0 . 0

B

45 . 9

0 . 0

C

0 .0

8 .1

D

5. 3

8 . 9

E

38 . 9

9 . 9

F5 . 3

14 . 9

G

0 . 0

14 . 9

H

0 . 0

14 . 9

I

0 . 0

14 . 9

L

0 . 0

6 . 8

M

0 .0

6 . 8

Figure 2: A HITS instance with solution (comparewith PageRank scores in Figure 1). Each node islabelled with its authority (top) and hub (bottom)scores. Scores have been normalized to sum to 100.The dominant eigenvalue for both authority and hubmatrices is 10.7.

different contexts, including Web information retrieval, bib-liometrics, sociometry, and econometrics. In the following,we review these contributions and link them to the Page-Rank method. Table1 contains a brief summary of Page-Rank history. All the ranking techniques surveyed in this

paper have been implemented in R[26] and the code is freelyavailable at the authors Web page.

5.1 Hubs and authorities on the WebHypertext Induced Topic Search(HITS) is a Web page rank-ing method proposed by Jon Kleinberg [13, 14]. The connec-tions between HITS and PageRank are striking. Despite theclose conceptual, temporal and even geographical proximityof the two approaches, it appears that HITS and PageRankhave been developed independently. In fact, both paperspresenting PageRank [3]and HITS[14]are today citationalblockbusters: the PageRank article collected 6167 citations,while the HITS paper has been cited 4617 times.8

HITS thinks of Web pages as authorities and hubs. HITScircular thesis reads as follows:

Good authorities are pages that are pointed to bygood hubs and good hubs are pages that point togood authorities.

Let L = (li,j) be the adjacency matrix of the Web graph,

8Source: Google Scholar on February 5, 2010.

i.e., li,j = 1 if page i links to page j and li,j = 0 otherwise.We denote with LT the transpose ofL. HITS defines a pairof recursive equations as follows, where x is the authorityvector containing the authority scores andyis the hub vectorcontaining the hub scores:

x(k) = LTy(k1)

y(k) = Lx(k) (2)

where k 1 and y(0) = e, the vector of all ones. The firstequation tells us that authoritative pages are those pointedto by good hub pages, while the second equation claims thatgood hubs are pages that point to authoritative pages. No-tice that Equation2 is equivalent to:

x(k) = LTLx(k1)

y(k) = LLTy(k1) (3)

It follows that the authority vector x is the dominant righteigenvector of the authority matrix A= LTL, and the hubvector y is the dominant right eigenvector of the hub matrix

H = LLT

. This is very similar to the PageRank method,except the use of the authority and hub matrices instead ofthe Google matrix.

To compute the dominant eigenpair (eigenvector and eigen-value) of the authority matrix we can again exploit the

power method as follows: let x(0) = e. Repeatedly com-pute x(k) = Ax(k1) and normalize x(k) = x(k)/m(x(k)),

where m(x(k)) is the signed component of maximal magni-tude, until the desired precision is achieved. It follows thatx(k) converges to the dominant eigenvector x (the authority

vector) and m(x(k)) converges to the dominant eigenvalue(the spectral radius, which is not necessarily 1). The hubvector y is then given by y = Lx. While the convergence ofthe power method is guaranteed, the computed solution isnot necessarily unique, since the authority and hub matri-ces are not necessarily irreducible. A modification similar tothe teleportation trick used for the PageRank method canbe applied to HITS to recover the uniqueness of the solu-tion[34].

An example of HITS is given in Figure 2. We stress thedifference among importance, as computed by PageRank,and authority and hubness, as computed by HITS. Page Bis both important and authoritative, but it is not a goodhub. Page C is important but by no means authoritative.Pages G, H, I are neither important nor authoritative, butthey are the best hubs of the network, since they point togood authorities only. Notice that the hub score of B is 0

although B has one outgoing edge; unfortunately for B, theonly page C linked by B has no authority. Similarly, C hasno authority because it is pointed to only by B, whose hubscore is zero. This shows the difference between indegreeand authority, as well as between outdegree and hubness.Finally, we observe that nodes with null authority scores(respectively, null hub scores) correspond to isolated nodesin the graph whose adjacency matrix is the authority matrixA (respectively, the hub matrixH).

An advantage of HITS with respect to PageRank is that it

8/13/2019 1002.2858v3

5/9

provides two scores at the price of one. The user is henceprovided with two rankings: the most authoritative pagesabout the research topic, which can be exploited to investi-gate in depth a research subject, and the most hubby pages,which correspond to portal pages linking to the researchtopic from which a broad search can be started. A disad-vantage of HITS is the higher susceptibility of the method tospamming: while it is difficult to add incoming links to ourfavourite page, the addition of outgoing links is much eas-

ier. This leads to the p ossibility of purposely inflating thehub score of a page, indirectly influencing also the authorityscores of the pointed pages.

An following algorithm that incorporates ideas from bothPageRank and HITS is SALSA[16]: like HITS, SALSA com-putes both authority and hub scores, and like PageRank,these scores are obtained from Markov chains.

5.2 BibliometricsBibliometrics, also known as scientometrics, is the quantita-tive study of the process of scholarly publication of researchachievements. The most mundane aspect of this branch ofinformation and library science is the design and applica-

tion ofbibliometric indicators to determine the influence ofbibliometric units like scholars and academic journals. TheImpact Factor is, undoubtedly, the most popular and con-troversial journal bibliometric indicator available at the mo-ment. It is defined, for a given journal and a fixed year,as the mean number of citations in the year to papers pub-lished in the two previous years. It has been proposed in1963 by Eugene Garfield, the founder of the Institute for Sci-entific Information (ISI), working together with Irv Sher [7].Journal Impact Factors are currently published in the popu-lar Journal Citation Reports by Thomson-Reuters, the newowner of the ISI.

The Impact Factor does not take into account the impor-tance of the citing journals: citations from highly reputedjournals are weighted as those from obscure journals. In1976 Gabriel Pinski and Francis Narin developed an innova-tive journal ranking method [25]. The method measures theinfluence of a journal in terms of the influence of the citingjournals. The Pinski and Narin thesis is:

A journal is influential if it is cited by other in-fluential journals.

This is the same circular thesis of the PageRank method.Given a source time window T1 and a previous target timewindow T2, the journal citation system can be viewed as

a weighteddirected graph in which nodes are journals andthere is an edge from journal i to journal j if there is somearticle published iniduring T1that cites an article publishedin j during T2. The edge is weighted with the numberci,jof such citations from i to j. Let ci =

jci,j be the total

number of cited references of journal i.

In the method described by Pinski and Narin, a citation ma-trix H= (hi,j) is constructed such that hi,j =ci,j/cj . Thecoefficienthi,j is the amount of citations received by journalj from journal i per reference given out by journal j. For

A

28.0

B

44.7

8

5

1

D

14.95

C

12.48

10

5

Figure 3: An instance with solution of the jour-nal ranking method proposed by Pinski and Narin.

Nodes are labelled with influence scores and edgeswith the citation flow between journals. Scores havebeen normalized to sum to 100.

each journal an influence score is determined which mea-sures the relative journal performance per given reference.The influence score j of journal j is defined as:

j =

i

ici,jcj

=

i

ihi,j

or, in matrix notation:

= H (4)

Hence, journalsj with a large total influencejcj are thosethat receive significant endorsements from influential jour-nals. Notice that the influence per reference scorej of ajournal j is a size independent measure, since the formulanormalizes by the number of cited referencescj contained inarticles of the journal, which is an estimation of the size ofthe journal. Moreover, the normalization neutralizes the ef-fect of journal self-citations, that are citations between arti-cles in the same journal. These citations are indeed countedboth at the numerator and at the denominator of the influ-ence score formula. This avoids over inflating journals thatengage in the practice of opportunistic self-citations.

It can be proved that the spectral radius of matrix H is1, hence the influence score vector corresponds to the domi-nant eigenvector ofH [8]. In principle, the uniqueness of thesolution and the convergence of the power method to it arenot guaranteed. Nevertheless, both properties are not diffi-cult to obtain in real cases. If the citation graph is stronglyconnected, then the solution is unique. When journals be-long to the same research field, this condition is typicallysatisfied. Moreover, if there exists a self-loop in the graph,that is an article that cites an article in the same journal,then the power method converges.

Figure3provides an example of the Pinski and Narin method.

Notice that the graph is strongly connected and has a self-loop, hence the solution is unique and can be computed withthe power method. Both journals A and C receive the samenumber of citations and give out the same number of refer-ences. Nevertheless, the influence of A is bigger, since it iscited by a more influential journal (B instead of D). Further-more, A and D receive the same number of citations fromthe same journals, but D is larger than A, since it containsmore references, hence the influence of A is higher.

Similar recursive methods have been independently proposed

8/13/2019 1002.2858v3

6/9

by[18]and [22] in the context of ranking of economics jour-nals. Recently, various PageRank-inspired bibliometric in-dicators to evaluate the importance of journals using theacademic citation network have been proposed and exten-sively tested: journal PageRank [2], Eigenfactor [33], andSCImago [27].

5.3 Sociometry

Sociometry, the quantitative study of social relationships,contains remarkably old PageRank predecessors. Sociolo-gists were the first to use the network approach to investi-gate the properties of groups of people related in some way.They devised measures like indegree, closeness, betweeness,as well as eigenvector centrality which are still used todayin modern (not necessarily social) network analysis [20]. Inparticular, eigenvector centrality uses the same central in-gredient of PageRank applied to a social network:

A person is prestigious if he is endorsed by pres-tigious people.

John R. Seeley in 1949 is probably the first in this contextto use the circular argument of PageRank [28]. Seeley rea-sons in terms of social relationships among children: eachchild chooses other children in a social group with a non-negative strength. The author notices that the total choicestrengths received by each children is inadequate as an in-dex of popularity, since it does not consider the popularityof the chooser. Hence, he proposes to define the popularityof a child as a function of the popularity of those childrenwho chose the child, and the popularity of the choosers as afunction of the popularity of those who chose them and soin an indefinitely repeated reflection. Seeley exposes theproblem in terms of linear equations and uses Cramers ruleto solve the linear system. He does not discuss the issue ofuniqueness.

Another model is proposed in 1953 by Leo Katz [10]. Katzviews a social network as a directed graph where nodes arepeople and person i is connected by an edge to person j ifi chooses, or endorses, j . The status of member i is definedas the number of weighted paths reaching j in the network,a generalization of the indegree measure. Long paths areweighted less than short ones, since endorsements devalueover long chains. Notice that this method indirectly takesaccount of who endorses as well as how many endorse anindividual: if a node i points to a node j and i is reachedby many paths, then the paths leading to i arrive also at jin one additional step.

Katz builds an adjacency matrix L = (li,j) such thatli,j = 1if person i chooses person j and li,j = 0 otherwise. He de-fines a matrix W =

k=1(aL)k, where a is an attenuation

constant. Notice that the (i, j) component ofLk is the num-ber of paths of length k from i toj , and this number is at-tenuated by ak in the computation ofW. Hence, the (i, j)component of the limit matrix Wis the weighted number ofarbitrary paths from i to j . Finally, the status of member jisj =

i

wi,j, that is, the number of weighted paths reach-ing j. If the attenuation factora < 1/(L), with (L) thespectral radius ofL, then the above series for W converges.

A

2.7

5.7

B

46.4

39.6

C

41.9

8.8

D

2.9

7.9

E

3.2

30.1

F2.9

7.9

G

0.0

0.0

H

0.0

0.0

I

0.0

0.0

L

0.0

0.0

M

0.0

0.0

Figure 4: An example of the Katz model using twoattenuation factors: a= 0.9 and a= 0.1 (the spectralradius of the adjacency matrix L is 1). Each nodeis labelled with the Katz score corresponding to a=0.9 (top) and a = 0.1 (bottom). Scores have beennormalized to sum to 100.

Figure4illustrates the method with an example. Notice theimportant role of the attenuation factor: when it is large

(close to 1/(L)), long paths are devalued smoothly, andKatz scores are strongly correlated with PageRank ones. Inthe shown example, PageRank and Katz methods providethe same ranking of nodes when the attenuation factor is0.9. On the other hand, if the attenuation factor is small(close to 0), then the contribution given by paths longerthan 1 rapidly declines, and thus Katz scores converge toindegrees, the number of incoming links of nodes. In theexample, when the attenuation factor drops to 0.1, nodes Cand E switch their positions in the ranking: node E, whichreceives many short paths, significantly increases its score,while node C, which is the destination of just one short pathand many (devalued) long ones, significantly decreases itsscore.

In 1965 Charles H. Hubbell generalizes the proposal of Katz[9].Given a set of members of a social context, Hubbell definesa matrixW= (wi,j) such that wi,j is the strength at whichi endorses j. Interestingly, these weights can be arbitrary,and in particular, they can be negative. The prestige of amember is recursively defined in terms of the prestige of theendorsers and takes account of the endorsement strengths:

= W+ v (5)

8/13/2019 1002.2858v3

7/9

Alice

0.49

Bob

0.41

0.4 David

-0.9

-0.4

0.4

-0.4

Charles

0.2

0.6

0.1

-0.1

0.8

Figure 5: An instance of the Hubbell model withsolution: each node is labelled with its prestigescore and each edge is labelled with the endorsementstrength between the connected members; negativestrength is highlighted with dashed edges. The min-imal amount of status has been fixed to 0.2 for allmembers.

The termv is an exogenousvector such that vi is a minimalamount of status assigned to i from outside the system.

The original aspects of the method are the presence of anexogenous initial input and the possibility of giving nega-

tive endorsements. A consequence of negative endorsementsis that the status of an actor can also be negative. An ac-tor that receives a positive (respectively, negative) judgmentfrom a member of positive status increases (respectively, de-creases) his prestige. On the other hand, and interestingly,receiving a positive judgment from a member of negativestatus makes a negative contribution to the prestige of theendorsed member (if you are endorsed by some person af-filiated to the Mafia your reputation might drop indeed).Moreover, receiving a negative endorsement from a mem-ber of negative status makes a positive contribution to theprestige of the endorsed person (if the same Mafioso opposesyou, then your reputation might raise).

Figure5 shows an example for the Hubbell model. Noticethat Charles does not receive any endorsement and hencehas the minimal amount of status given by default to eachmember. David receives only negative judgments; interest-ingly, the fact that he has a positive self opinion furtherdecreases his status. A b etter strategy for him, knowing inadvance of his negative status, would be to negatively judgehimself, acknowledging the negative judgment given by theother members.

Equation 5 is equivalent to (I W) = v, where I is theidentity matrix, that is = v(I W)1 =v

i=0 Wi. The

series converge if and only if the spectral radius ofW is lessthan 1. It is now clear that the Hubbell model is a general-ization of the Katz model to general matrices that adds an

initial exogenous input v. Indeed, Katz equation for socialstatus is = e

i=1(aL)i, wheree is a vector of all ones. In

an unpublished note Vigna traces the history of the mathe-matics of spectral ranking and shows that there is a reduc-tion from the path summation formulation of Hubbell-Katzto the eigenvector formulation with teleportation of Page-Rank and vice versa [29]. In the mapping the attenuationconstant is the counterpart of the PageRank damping fac-tor, and the exogenous vector corresponds to the PageRankpersonalization vector. The interpretation of PageRank asa sum of weighted paths is also investigated in[1].

Spectral ranking methods have been also exploited to ranksport teams in competitions that involve teams playing inpairs [31,12]. The underlying idea is that a team is strongif it won against other strong teams. Much of the art of thesport ranking problem is how to define the matrix entriesai,j expressing how much team i is better than teamj (e.g.,we could pick ai,j to be 1 ifj beatsi, 0.5 if the game endedin a tie, and 0 otherwise) [11].

5.4 EconometricsWe conclude with a succinct description of the input-outputmodel developed in 1941 by Nobel Prize winner Wassily W.Leontief in the field of econometrics the quantitative studyof economic principles[17]. According to the Leontief input-output model, the economy of a country may be divided intoany desired number of sectors, called industries, each con-sisting of firms producing a similar product. Each industryrequires certain inputs in order to produce a unit of its ownproduct, and sells its products to other industries to meettheir ingredient requirements. The aim is to find prices forthe unit of product produced by each industry that guar-antee the reproducibilityof the economy, which holds wheneach sector balances the costs for its inputs with the rev-

enues of its outputs. In 1973, Leontief earned the NobelPrize in economics for his work on the input-output model.An example is provided in Table2.

Let qi,j denote the quantity produced by the ith industryand used by the jth industry, and qi be the total quantityproduced by sector i, that is, qi =

jqi,j . Let A = (ai,j)

be such that ai,j = qi,j/qj ; each coefficient ai,j representsthe amount of product (produced by industry) i consumedby industryj that is necessary to produce a unit of productj. Let j be the price for the unit of product produced byeach industry j. The reproducibility of the economy holdswhen each sector j balances the costs for its inputs with therevenues of its outputs, that is:

costj =

i iqi,j =revenuej =

i

jqj,i= j

iqj,i= jqj

By dividing each balance equation by qj we have

j =

i

iqi,jqj

=

i

iai,j

or, in matrix notation,

= A (6)

Hence, highly remunerated industries (industriesj with hightotal revenuejqj) are those that receive substantial inputsfrom highly remunerated industries, a circularity that closelyresembles the PageRank thesis [24]. With the same argu-

ment used in[8]for the Pinski and Narin bibliometric modelwe can show that the spectral radius of matrix A is 1, thusthe equilibrium price vector is the dominant eigenvector ofmatrixA. Such a solution always exists, although it mightnot be unique, unless A is irreducible. Notice the strikingsimilarity of the Leontief closedmodel with that proposedby Pinski and Narin. An openLeontief model adds an ex-ogenous demand and creates a surplus of revenue (profit). Itis described by the equation = A +vwherev is the profitvector. Hubbell himself observes the similarity between hismodel and the Leontief open model [9].

8/13/2019 1002.2858v3

8/9

agriculture industry family total price revenueagriculture 7.5 6 16.5 30 20 600industry 14 6 30 50 15 750family 80 180 40 300 3 900

cost 600 750 900

Table 2: An input-output table for an economy with three sectors with the balance solution. Each row showsthe output of a sector to other sectors of the economy. Each column shows the inputs received by a sector

from other sectors. For each sector we also show total quantity produced, equilibrium unitary price, totalcost, and total revenue. Notice that each sector balances costs and revenues.

It might seem disputable to juxtapose PageRank and Leon-tief methods. To be sure, the original motivation of Leontiefwork was to give a formal method to find equilibrium pricesfor the reproducibility of the economy and to use the methodto estimate the impact on the entire economy of the changein demand in any sectors of the economy. Leontief, to thebest of our limited knowledge, was not motivated by an in-dustryrankingproblem. On the other hand, the motivationunderlying the other methods described in this paper is theranking of a set of homogeneous entities. Despite the orig-inal motivations, however, there are more than coinciden-tal similarities between the Leontief open and closed mod-els and the other ranking methods described in this paper.These connections motivated the discussion of the Leontiefcontribution, which is probably the least known among thesurveyed methods within the computing community.

6. CONCLUSIONThe classic notion of quality of information is related to thejudgment given by few field experts. PageRank introducedan original notion of quality of information found on theWeb: the collective intelligenceof the Web, formed by theopinions of the millions of people that populate this universe,is exploited to determine the importance, and ultimately thequality, of that information.

Consider the difference betweenexpert evaluationand collec-tive evaluation. The former tends to be intrinsic, subjective,deep, slow and expensive. By contrast, the latter is typ-ically extrinsic, democratic, superficial, fast and low-cost.Interestingly, the dichotomy between these two evaluationmethodologies is not peculiar to information found on theWeb. In the context of assessment of academic research,peer review the evaluation of scholar publications givenby peer experts working in the same field of the publication plays the role of expert evaluation. Collective evaluationconsists in gauging the importance of a contribution thoughthe bibliometric practice of counting and analysing citationsreceived by the publication from the academic community.Citations generally witness the use of information and ac-knowledge intellectual debt. Eigenfactor[33], a PageRank-inspired bibliometric indicator, is among the most interest-ing recent proposals to collectively evaluate the status ofacademic journals. The consequences of a shift from peerreview to bibliometric evaluation are currently heartily de-bated in the academic community [32].

Acknowledgements

The author thanks Enrico Bozzo, Sebastiano Vigna, and the

anonymous referees for positive and critical comments on theearly drafts of this paper. Sebastiano Vigna was the first topoint out the contribution of John R. Seeley.

7. REFERENCES[1] R. A. Baeza-Yates, P. Boldi, and C. Castillo. Generic

damping functions for propagating importance inlink-based ranking. Internet Mathematics,3(4):445478, 2007.

[2] J. Bollen, M. A. Rodriguez, and H. V. de Sompel.

Journal status. Scientometrics, 69(3):669687, 2006.[3] S. Brin and L. Page. The anatomy of a large-scale

hypertextual web search engine. Computer networksand ISDN systems, 30(1-7):107117, 1998.

[4] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan,S. Rajagopalan, R. Stata, A. Tomkins, and J. L.Wiener. Graph structure in the Web. ComputerNetworks, 33(1-6):309320, 2000.

[5] V. Bush. As we may think.Atlantic Monthly,176(1):101108, 1945.

[6] G. Frobenius. Uber matrizen aus nicht negativenelementen. In Sitzungsberichte der PreussischenAkademie der Wissenschaften zu Berlin, pages456477, 1912.

[7] E. Garfield and H. Sher. New factors in the evaluationof scientific literature through citation indexing.American Documentation, 14:195201, 1963.

[8] N. L. Geller. On the citation influence methodology ofPinski and Narin. Information Processing &Management, 14(2):93 95, 1978.

[9] C. H. Hubbell. An input-output approach to cliqueidentification. Sociometry, 28:377399, 1965.

[10] L. Katz. A new status index derived from sociometricanalysis.Psychometrika, 18:3943, 1953.

[11] J. P. Keener. The Perron-Frobenius theorem and theranking of football teams. SIAM Review, 35(1):8093,1993.

[12] M. G. Kendall. Further contributions to the theory ofpaired comparisons. Biometrics, 11(1):4362, 1955.

[13] J. M. Kleinberg. Authoritative sources in ahyperlinked environment. In ACM-SIAM Symposiumon Discrete Algorithms, pages 668677, 1998.

[14] J. M. Kleinberg. Authoritative sources in ahyperlinked environment. Journal of the ACM,46(5):604632, 1999.

[15] A. N. Langville and C. D. Meyer. Googles PageRankand Beyond: The Science of Search Engine Rankings.Princeton University Press, 2006.

[16] R. Lempel and S. Moran. The stochastic approach for

8/13/2019 1002.2858v3

9/9

link-structure analysis (SALSA) and the TKC effect.Computer Networks, 33(1-6):387401, 2000.

[17] W. W. Leontief. The Structure of American Economy,1919-1929. Harvard University Press, 1941.

[18] S. J. Liebowitz and J. P. Palmer. Assessing therelative impacts of economics journals. Journal ofEconomic Literature, 22:7788, 1984.

[19] A. Markov. Rasprostranenie zakona bolshih chisel navelichiny, zavisyaschie drug ot druga. IzvestiyaFiziko-matematicheskogo obschestva pri Kazanskomuniversitete, 2-ya seriya 15(94):135156, 1906.

[20] M. E. J. Newman. Network analysis: An introduction.Oxford University Press, 2010.

[21] L. Page, S. Brin, R. Motwani, and T. Winograd. ThePageRank citation ranking: Bringing order to theWeb. Technical Report 1999-66, Stanford InfoLab,November 1999. Retrieved June 1, 2010, fromhttp://ilpubs.stanford.edu:8090/422/ .

[22] I. Palacios-Huerta and O. Volij. The measurement ofintellectual influence. Econometrica, 72:963977, 2004.

[23] O. Perron. Zur theorie der matrices.MathematischeAnnalen, 64(2):248263, 1907.

[24] S. U. Pillai, T. Suel, and S. Cha. ThePerron-Frobenius theorem: some of its applications.IEEE Signal Processing Magazine, 22(2):6275, 2005.

[25] G. Pinski and F. Narin. Citation influence for journalaggregates of scientific publications: Theory, withapplication to the literature of physics. InformationProcessing & Management, 12(5):297 312, 1976.

[26] R Development Core Team.R: A Language andEnvironment for Statistical Computing. R Foundationfor Statistical Computing, Vienna, Austria, 2007.ISBN 3-900051-07-0. Accessed June 1, 2010, athttp://www.R-project.org .

[27] SCImago. SJR SCImago Journal & Country Rank.Accessed June 1, 2010, athttp://www.scimagojr.com , 2007.

[28] J. R. Seeley. The net of reciprocal influence: Aproblem in treating sociometric data. The CanadianJournal of Psychology, 3:234240, 1949.

[29] S. Vigna. Spectral ranking, 2010. Retrieved June 1,2010 from http://arxiv.org/abs/0912.0238.

[30] R. von Mises and H. Pollaczek-Geiringer. Praktischeverfahren der gleichungsauflosung. Zeitschrift f urAngewandte Mathematik und Mechanik, 9:5877,152164, 1929.

[31] T. H. Wei. The algebraic foundations of rankingtheory. PhD thesis, Cambridge University, 1952.

[32] P. Weingart. Impact of bibliometrics upon the sciencesystem: inadvertent consequences? Scientometrics,62(1):117131, 2005.

[33] J. West, B. Althouse, C. Bergstrom, M. Rosvall, andT. Bergstrom. Eigenfactor.org Ranking and mappingscientific knowledge. Accessed June 1, 2010, athttp://www.eigenfactor.org , 2007.

[34] A. X. Zheng, A. Y. Ng, and M. I. Jordan. Stablealgorithms for link analysis. In International ACMSIGIR Conference on Research and Development inInformation Retrieval, pages 258266, 2001.

1002.2858v3

Documents

Transcript of 1002.2858v3