Citation Networks as a Multi-layer Graph: Link Prediction and

Citation Networks as a Multi-layer Graph: Link Predictionand Importance Ranking

CS224W Project Report, Group 5

Jingyu CuiElectrical Engineering

Stanford CA [email protected]

Fan WangElectrical Engineering

Stanford CA [email protected]

Jinjian ZhaiComputer ScienceStanford CA 94305

[email protected]

ABSTRACTIn academia, to represent the relationships between researchersor papers, the traditional way is to analyze the citation net-work constructed by paper citation relationships. Our majorcontribution is that we incorporated multiple networks fromthe publication dataset, such as the paper citation network,author citation network, author collaboration network, etc.These networks can provide very useful and interesting infor-mation for analyzing the citation behavior of researchers. Inthis paper, we first analyze how people are citing papers byapplying logistic regression model to paper citation network.With the additional information from the author networks,we can get improved results of link prediction in citationnetwork. Then we run a modified PageRank algorithm onpaper citation network and author citation network, whichcan provide us with a more accurate sense of the paper andresearcher impact.

KeywordsCitation network, Regression, Link prediction, PageRank,Impact metric.

1. INTRODUCTIONWhen researchers are writing a paper, usually they will citeother publications as references. As digital libraries becomemore and more popular recently, almost all the research pub-lications have been made available to the general public.Although there might be a lot of related works available forcitation, researchers usually don’t just randomly pick someof them; it is believed that they are following some specificrules. It might be interesting and meaningful for us to inves-tigate the citation network and try to find these rules. Ourgoal of this project is to analyze the static and dynamic prop-erties of citation network, to further obtain some insightsabout the measurements of research quality, collaborationbehavior, even the evolution of science and technology.

Citation network analysis has been a hot research topic,

and there have been many publications in the related field.In [15], the authors investigated how patterns of citationsvaried between the scientific disciplines and how such pat-terns related to the impact of a paper. A citation projectiongraph was defined, and several metrics and statistics wereproposed to capture the network property. In [13], althoughthe network analyzed was a Blog network, the proposedSusceptible-Infected-Susceptible model was very useful forus to analyze citation network. In [12], how to find the mostimportant nodes to obtain information from the network wasdiscussed as an optimization problem.

The authors of [8] proposed important indices on arcs toidentify the important parts of the citation network. Theauthors of [1] proposed an efficient algorithm for determiningthe arc weights in the SPLC and SPNP weights. The authorsof [9] tried to analyze a citation network constrained in aspecific attainability science domain. The authors of [10]proposed three methods to analyze the large scale dynamicnetwork using EM algorithm, modularity optimization, andeigenvector centrality.

Inspired by the publications above, we are trying to analyzethe citation network in a more accurate and complete man-ner, and we state the problems that we want to investigatein the following sessions.

The rest of the paper is organized as follows: Section 2 de-scribes some basic analysis of the citation network charac-teristics. Section 3 depicts our methods of modeling andpredicting citation behavior in the network. Section 4 givesthe method and results for predicting impact of items in thenetwork.

2. BASIC ANALYSIS OF CITATION NET-WORK

We consider two datasets in our following analysis and ex-periments. The first one is DBLP dataset, which includesthe information on computer science publications listed inthe DBLP Computer Science Bibliography (www.informatik.uni-trier.de/~ley/db/). The second one is Hep-Th dataset,

Each entry of the dataset will contain a piece of publica-tion record, which includes publication id, title, author list,citations, proceeding, year, pages, etc. First we constructthe traditional paper citation network. Note that in DBLPdataset, all of the publications have not been cited, nor

1940 1950 1960 1970 1980 1990 2000 2010

101

102

103

104

105

Year

# pa

pers

Figure 1: The distribution of paper number in eachyear for DBLP data set.

1992 1994 1996 1998 2000 200210

2

103

Year

# pa

pers

Figure 2: The distribution of paper number in eachyear for the HEP-TH data set.

all publications have cited other publications in the samedataset. There are in total 1,466,034 papers in DBLP dataset, while only 22,138 papers which are either citing otherpapers or being cited by others. That is to say, the DBLPdata set doesn’t contain the full citation information; how-ever, we can regard it as sufficient to reflect the true citationpattern. The reason here is that the characteristics and thecitation pattern we found for this network and the Hep-Thnetwork are similar to each other, which will be shown inthe following parts. ).

2.1 Data AnalysisIn this part, we analyzed the characteristics of the tradi-tional paper citation network. Several interesting pointshave been discovered here, as described in the following sec-tions.

2.1.1 More papers are produced as years go byWe first plot the distribution of the number of papers pub-lished in each year as Fig.1 and Fig.2. The x-axis is theyear, and the y-axis is the log scale of the number of paperspublished in the corresponding year. The log-scale papernumbers almost increase linearly as time goes by (the pointcorresponding to the last year can be ignored since the datamight be incomplete), which means, the actual number ofpapers is increasing exponentially! Each field is a reallybooming field, and this could be also because publishingpapers is becoming easier as there are more journals andconferences.

2.1.2 The degree distribution follows power law

100

101

102

100

101

102

103

# Citations

# pa

pers

Figure 3: The in-degree distribution of DBLP dataset.

100

101

102

100

101

102

103

# Citations

# pa

pers

Figure 4: The in-degree distribution of the HEP-THdata set.

We construct a citation graph from each dataset, which is adirected graph. For a paper which is cited by another paper,it has an incoming edge; for a paper which is citing anotherpaper, there is an outcoming edge.

The in-degree distribution and out-degree distribution areplotted in Fig.3,Fig.4, FIg.5, Fig.6 respectively. From thefigures we can see that, in- and out- degrees of both datasets follow power law distribution.

3. CITATION MODEL AND LINK PREDIC-TION

To model a network, the most important thing is to deter-mine how the graph evolves and how to determine there is

100

101

102

100

101

102

# References

# pa

pers

Figure 5: The out-degree distribution of DBLP dataset.

100

101

102

100

101

102

103

# References

# pa

pers

Figure 6: The out-degree distribution of the HEP-TH data set.

link between two nodes. Here we want to decide the prob-ability of connecting two nodes through a function of manyfactors. These factors are those we think might have influ-ence on an author’s decision of citation. If we can model thisfunction accurately, we may explicitly express the rule of ci-tation, and explicitly state some properties of the citationnetwork.

3.1 Citation ModelIn many pervious literatures, how the information is prop-agated or how a link will be created in the network wassimulated by a simple but powerful probabilistic model, inwhich each node spreads its information to (or link to) itsneighborhood nodes with probability β. Thus the choice ofβ is very critical. However, it is somewhat limiting to haveone value of β to describe the behavior of all nodes. In ourproblem, when researchers are thinking about citing otherpublications, they are not making decisions based on a sim-ple and fixed parameter. Many factors will have importantinfluence in this process, such as how famous the target pa-per is, whether or not the target paper was the research’sown previous work, whether the cited paper is recent workor more than 10 years ago, etc. Therefore, it is necessaryand meaningful to adapt the probability value β accordingto the properties of the edge and the two nodes the edge con-nects, i.e., the probability that a link is made depends onthe properties of the source node and the destination nodeand the properties of these two together.

Generally speaking, we try to model the relationship as alinear regression model [2]:

z (s, d) =

M∑i=1

αifi (s) +

N∑j=1

γjfj (d) +

L∑

k=1

ξkfk (s, d)

where fi(s) are M properties of the potential source nodes, fj(d) are N properties of the potential destination noded, and fk(s, d) are L properties decided by the two nodesjointly, i.e. the properties of the edge. αi, γj , and ξk areweights to be determined by the regression algorithm. Thispart is basically a linear combination of all features.

To get a probability parameter β, we transform the real-valued z(s, d) to the (0, 1) interval using the logistic function:

β (s, d) =1

1 + e−z(s,d)

Note that these regressors contain both the properties de-cided by single node (fi(s) and fj(d)), and the propertiesdecided by two nodes of an edge (fk(s, d)).

Specifically, given two papers in our experiment, we firstdetermine the direction of the potential edge. The edgedirection can be simply determined by the publishing yearof the two papers due to the natural causality of citation.The paper published recently is called the source paper A,and the paper published earlier is regarded as a potentialdestination B for citation edge.

1. In-degree of B, meaning how many citations B hasreceived;

2. Out-degree of B, meaning how many papers B hascited;

3. Year difference, i.e. The difference between publicationyear of A and B;

4. Author overlap: The number of overlapping authors ofthe author lists of A and B;

5. Title similarity: the semantic similarity between twopaper titles defined by “WordNet”;

3.2 Multi-Layer GraphAlthough we are analyzing the citation model, it is the au-thors who make the decision of citations. Therefore, wethink it is also useful to incorporate the authors’ propertiesin our citation model.

Given the publication dataset, two more networks could beconstructed besides the paper citation network:

1. Author citation network: if an author X’s papers havebeen cited by another author Y for k times in total,there is a directed edge from Y to X with weight k.

2. Author collaboration network: if an author X has co-authored k papers with author Y , there is an undi-rected edge between X and Y with weight k.

The three networks together constitute a multi-layer hyper-graph structure as shown in Fig.7. These two newly con-structed networks are related to the paper citation network;however, we believe that they could provide extra informa-tion about the authors, who are the creators of papers andtheir personal characteristics might affect the citations.

Therefore, several new regressors could be added to our re-gression model:

1. “MaxACollaberateWithB”: the maximum number ofcollaborations happened between each author of pa-per A and each author of paper B;

2. “MeanACollaberateWithB”: the average number of col-laboration between authors of paper A and authors ofpaper B;

Figure 7: Multi-layer structure of the network con-taining paper citation network, author citation net-work, and author collaboration network.

3. “MaxACiteB”: the maximum number of citations eachauthor of paper B have received from each author ofpaper A;

4. “MeanACiteB”: the average number of citations theauthors of paper B have received from authors of paperA;

5. “AverageCitedNumberB”: the average citations the au-thors of B has received, i.e. the average weight of B’sin-edge in author citation network;

6. “AverageReferenceNumberB”: the average citations theauthors of A has made, i.e. the average weight of B’sout-edge in author citation network;

7. “AverageCitedNumberA”: the average citations the au-thors of paper A has received;

8. “AverageCitedNumberA”: the average citations the au-thors of paper A has made;

3.3 Experimental Results3.3.1 Experiments SetupEach sample in our experiments represents an edge, and thevalue of features (regressors) extracted for this edge is cal-culated based on the network constructed at the time whenthe source paper is published. That is to say, we cannot usefuture information when finding citations for current paper.

We generate two training sample sets from DBLP dataset,one is called “DBLP ’90-’95”, in which the source papers ofall edges were published between the year 1990 and 1995;the other is “DBLP ’95-’00”, meaning the edges are startedfrom papers from 1995 to 2000. The Hep-Th dataset is usedas a whole.

For each source paper A, all the actual out-edges can beused as positive training samples, and we random select nnon-existing edges starting from A as negative samples, inwhich n is the out-degree of A. Therefore, we have the samenumber of positive and negative training samples.

The regression problem can then be solved using the stan-dard Newton-Raphson algorithm [2] to maximize the log-likelihood given the observations.

3.3.2 Performance EvaluationAfter fitting the model, given two papers, we can calculatethe probability that the later one cites the earlier one, i.e.there is a directed edge between them. Since we know thetrue edges in the network as the ground truth, we can eval-uate prediction precision, as follows:

p =#correctly predicted edges

#predicted links

which means, if our model predicts there is an edge, howlikely it is a true edge. We can also have the true edge recallas:

r =#correct predictions

#all true edges

which means, in all the true edges, how many of them havebeen found by our model.

We can first see the performance when only using the re-gressors from the paper citation network like described inSec.3.1. The results are shown in Fig.8.

After the 8 regressors described in Sec.3.2 are included, wehave improved performance shown in Fig.9.

In these two tables, each column means the model is trainedwith the sample set indicated at the top of the column, eachrow means the models are applied onto the sample sets in-dicated at the left end of the row. The upper part of eachtable is precision and the bottom part is recall.

If we look at the four experiments about the models trainedon samples sets “DBLP ’90-’95” and “DBLP ’95-’00” andapplied on these two sets, the performance has been greatlyimproved by using all regressors from multi-graph comparedwith only using the regressors on paper citation network.

When a model was trained with one sample set and appliedonto the other sample set, if we still get good results, themodel should have good generalization ability, which can beshown by the values in the off-diagonal entries.

To visualize the performance better, we have the F-1 score(harmonic mean of precision and recall) plotted in Fig.10.

The small difference between one model applied onto dif-ferent sample sets shows the generalization ability of ourmodel.

We have also included the results trained by Decision Treealgorithm, in the right half of Fig.9 and Fig.10. Gener-ally speaking, decision tree has a better performance thanregression model. However, the generalization ability is alittle weaker, and the tree structure is too complicated tobe interpreted. Therefore, we will only try to interpret themeaning of regression model in next section.

3.3.3 Citation Model InterpretationWe obtained 3 regression models trained on three differentsample sets and those models are compared in Fig.11.

Those weights have been normalized according the scale ofeach feature, so they can reflect the actual importance.

Figure 8: Precision and recall for the regression model with regressors only from paper network.

Figure 9: Precision and recall on different training sets and testing sets.

Figure 10: F1 score for the regression model and decision tree model on different training sets and testingsets.

Figure 11: Regression model trained on different dataset. The length of each bar represents the absolutevalue of weight as the number shown near the right end of each bar; the color of each circle indicates thesign of the weight: red is negative, green is positive, yellow is almost zeros. sets.

Our first observation is that, these three models, althoughtrained on different dataset, show similar patterns. Thismeans the model makes sense, and the dataset does capturethe actual citation characteristics though it’s not a completedataset.

According to the weights of each regressor, the citation pat-terns could be interpreted as:

(1) “MeanACiteB” has the largest weight: People tend tocite papers written by authors that they have cited before

(2) “Title similarity” has almost the second largest weight:Papers with similar content tend to cite each other.

(3) “In-degree of B” has large weight in Hep-Th: Popularpapers tend to get more citations in physics field (rich getsricher).

(4) “Year difference” has negative weight: People tend tocite recent papers.

(5) “Author overlap” has almost zero weight: We originallythought that researchers prefer to cite their own papers, butthe experiments told us that this effect is not significant.

4. MODIFIED PAGE RANK FOR IMPACTPREDICTION

We’d like to investigate the importance of each node (paperor author) in the citation network using the structure ofthe graph. This will enable many interesting applicationsincluding ranking the items according to impact, examiningthe correlation with popular impact indicators, discovery oftrends in the graph, etc.

Inspired by the page rank algorithm to rank web pages ac-

cording to popularity, we propose to rank the entities in themulti-layer hypergraph according to their interconnectionrelationships. This could be better than just using the nodedegree (number of citations) as a measurement of impactsince it takes into account more information.

The original page rank algorithm assumes that the rank ofa web page is the sum of contributions from pages that linkto it, discounted by their out degrees:

r (ni) =∑j→i

r (nj)

d (nj)

where j → i means entity j refers entity i.

A damping factor was further introduced to the model toimprove robustness:

r (ni) =a

N+ (1− α)

∑j→i

r (nj)

d (nj)

This problem is an eigenvector problem of a modified adja-cent matrix, which can be solved by iteration.

We cannot directly use this model for predicting importanceof authors or papers, since the the number of references of apaper is not equivalent to the number of out links of a webpage. The contribution of paper A to paper B by citing itshould be proportional to the impact of paper A, but shouldnot be inverse proportional to the number of references ofpaper A. As a result, we modify the page rank model asfollows:

r (ni) =a

N+ (1− α)

∑j→i

r (nj)

The algorithm converges very fast with our sparse matrix

Figure 12: Log-log plot of paper rank vs. normalizedcitation number.

10−6

10−4

10−2

100

10−6

10−5

10−4

10−3

10−2

10−1

Normalized Citation Number of Author

Aut

hor

Ran

k V

alue

Figure 13: Log-log plot of author rank vs. normal-ized citation number.

implementation, and it’s relatively robust in terms of pa-rameter choice. The relative ranking of the papers stay al-most the same was we vary the parameter α in the rangebetween 0.1 to 0.4.

We compared the proposed paper rank with the normalizedcitation number of the investigated papers, as shown in Fig-ure 12. We can see that the two criterion correlates well.

We also carried out the experiment on the citation networkof the authors, the proposed author rank are plotted againstthe normalized citation number of the authors, and shownin Figure 13. We can see that the two criterion correlateswell, too.

Lastly, we compared the proposed author rank with the h-score [5], which is a popular indicator for author impact. Anauthor has h-score h if he/she has at least h papers that arecited at least h times each. We can see from figure 14 thatthe proposed rank correlates with h-score well.

We can also see from the figures that the correlation behav-ior looks different for different indicators, which means thatmaybe combining indicators will give better performance.

100

101

102

10−6

10−5

10−4

10−3

10−2

10−1

h−Score of Author

Aut

hor

Ran

k V

alue

Figure 14: Log-log plot of author rank vs. h-score.

4.1 What can we infer from the citation net-work

We have already investigated the static property of the ci-tation network, mostly in the macro scale. Furthermore, wecan model the time varying properties of the graph. Morespecifically, we can look at the evolution of the citation fora specific paper over time, or look at how new scientificdiscoveries (e. g., SVM, Compressive Sensing, FFT, etc.)propagate in the academic community.

Take “database” for example, we plot in Figure 15 the num-ber of papers with the word “database” in the title for dif-ferent years, and also the average impact per paper for eachyear. We can see that although there is an exponentialgrowth in the number of papers on the topic of “database”,the actual impact per paper is diminishing after the founda-tional papers were published around 1976. The three papersin 1976 that have highest impact are:

1. System R: Relational Approach to Database Manage-ment.

2. The Notions of Consistency and Predicate Locks in aDatabase System.

3. Differential Files: Their Application to the Mainte-nance of Large Databases.

They’ve received 960, 1750, 259 citations until now accord-ing to Google Scholar search.

5. ACKNOWLEDGMENTSThe Proximity DBLP database is based on data from theDBLP Computer Science Bibliography with additional prepa-ration performed by the Knowledge Discovery Laboratory,University of Massachusetts Amherst.

The authors would like to thank Professor Leskovec and theTAs for their kind assistant and valuable discussion.

6. REFERENCES[1] V. Batagelj. Efficient algorithms for citation network

analysis. Arxiv preprint cs/0309023, pages 1–27, 2003.

1975 1980 1985 1990 1995 2000 2005 20100

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Papers with ’database’ in titleImpact per paper

Figure 15: Trend of the word“database” in the titlesof the papers.

[2] C. M. Bishop. Pattern Recognition and MachineLearning. Springer, 2006.

[3] A. Clauset, C. R. Shalizi, and M. E. J. Newman.Power-law distributions in empirical data. 2007.

[4] D. Easley and J. Kleinberg. Networks, Crowds, andMarkets: Reasoning About a Highly Connected World.Cambridge University Press, 2010.

[5] L. Egghe. Theory and practise of the g-index.Scientometrics, 69:131–152, 2006.

[6] B. H. Hall, A. B. Jaffe, and M. Trajtenberg. The nberpatent citation data file: Lessons, insights andmethodological tools. NBER Working Paper 8498,2001.

[7] J. E. Hirsch. An index to quantify an individual’sscientific research output. Proceedings of the NationalAcademy of Sciences of the United States of America,102(46):pp. 16569–16572, 2005.

[8] N. Hummon and P. Doreian. Connectivity in acitation network: The development of dna theory.Social Networks, 11:39–63, 1989.

[9] Y. Kajikawa, J. Ohno, Y. Takeda, K. Matsushima,and H. Komiyama. Creating an academic landscape ofsustainability science: an analysis of the citationnetwork. Sustainability Science, 2(2):221–231, July2007.

[10] E. a. Leicht, G. Clarkson, K. Shedden, and M. E.Newman. Large-scale structure of time evolvingcitation networks. The European Physical Journal B,59(1):75–83, Oct. 2007.

[11] J. Leskovec and E. Horvitz. Planetary-scale views on alarge instant-messaging network. In WWW ’08:Proceeding of the 17th international conference onWorld Wide Web, pages 915–924, New York, NY,USA, 2008. ACM.

[12] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos,J. VanBriesen, and N. Glance. Cost-effective outbreakdetection in networks. In KDD ’07: Proceedings of the13th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 420–429,New York, NY, USA, 2007. ACM.

[13] J. Leskovec, M. Mcglohon, C. Faloutsos, N. Glance,and M. Hurst. Cascading behavior in large blog

graphs. In In SDM, 2007.

[14] G. A. Miller. Wordnet: A lexical database for english.Communications of the ACM, 38(11):39–41, 1995.

[15] X. Shi, J. Leskovec, and D. A. McFarland. Citing forhigh impact. In JCDL ’10: Proceedings of the 10thannual joint conference on Digital libraries, pages49–58, New York, NY, USA, 2010. ACM.

Citation Networks as a Multi-layer Graph: Link Prediction and

Documents

Transcript of Citation Networks as a Multi-layer Graph: Link Prediction and