HRank: A Path based Ranking Framework in Heterogeneous ... · bility, and thus lead to different...

arX

iv:1

403.

7315

v1 [

cs.IR

] 28

Mar

201

4

HRank: A Path based Ranking Framework inHeterogeneous Information Network

Yitong Li, Chuan Shi, Philip S. Yu, and Qing Chen

ABSTRACTRecently, there is a surge of interests on heterogeneous informationnetwork analysis. As a newly emerging network model, heteroge-neous information networks have many unique features (e.g., com-plex structure and rich semantics) and a number of interesting datamining tasks have been exploited in this kind of networks, such assimilarity measure, clustering, and classification. Although evalu-ating the importance of objects has been well studied in homoge-neous networks, it is not yet exploited in heterogeneous networks.In this paper, we study the ranking problem in heterogeneousnet-works and propose the HRank framework to evaluate the impor-tance of multiple types of objects and meta paths. Since the impor-tance of objects depends upon the meta paths in heterogeneous net-works, HRank develops a path based random walk process. More-over, a constrained meta path is proposed to subtly capture the richsemantics in heterogeneous networks. Furthermore, HRank can si-multaneously determine the importance of objects and meta pathsthrough applying the tensor analysis. Extensive experiments onthree real datasets show that HRank can effectively evaluate the im-portance of objects and paths together. Moreover, the constrainedmeta path shows its potential on mining subtle semantics by obtain-ing more accurate ranking results.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications-Data Min-ing

General TermsAlgorithm

Keywordsheterogeneous information network, rank, random walk, tensor anal-ysis

1. INTRODUCTIONIt is an important research problem to evaluate object importanceor popularity, which can be used in many data mining tasks. Manymethods have been developed to evaluate object importance,such

Y.T. Li, C. Shi are with Beijing University of Posts and Telecommunica-tions, Beijing, China.E-mail: [email protected], [email protected]. Yu is with University of Illinois at Chicago, IL, USA.E-mail: [email protected]. Chen is with China Mobile Communications Corporation, Beijing,China. E-mail: [email protected].

as PageRank [13], HITS [7], and SimRank [4]. In these literatures,objects ranking is done in a homogeneous network in which objectsor relations are the same. For example, both PageRank and HITSrank the web page in WWW.

However, in many real network data, there are many differenttypesof objects and relations, which can be organized as heterogeneousnetwork. Formally, Heterogeneous Information Networks (HIN)are the logical networks involving multiple types of objects as wellas multiple types of links denoting different relations [3]. For ex-ample, the movies recommendation data include multiple types ofobjects: movies, actors, and directors and their relations[14]. It isclear that heterogeneous information networks are ubiquitous andform a critical component of modern information infrastructure [3].Recently, many data mining tasks have been exploited in thiskindof networks, such as similarity measure [16, 14], clustering [19,17], and classification [5], among which ranking is an importantbut not yet exploited task.

(a) Heterogeneous network (b) Network schema

Figure 1: A heterogeneous information network example on biblio-graphic data. (a) shows heterogeneous objects and their relations. (b)shows the network schema.

Figure 1(a) shows a HIN example in bibliographic data and Fig-ure 1(b) illustrates its network schema. In this example, itcontainsobjects from four types of objects: papers (P ), authors (A), labels(L, categories of papers) and conferences (C). There are links con-necting different types of objects. The link types are defined bythe relations between two object types. For example, links existbetween authors and papers denoting the writing or written-by re-lations, between conferences and papers denoting the publishingor published-in relations. In this network, several interesting, yetseldom exploited, ranking problems can be proposed.

• One may be interested in the importance of one type of ob-jects, and ask the following questions:Q. 1.1 Who are the most influential authors?Q. 1.2 Who are the most influential authors in data mining field?

http://arxiv.org/abs/1403.7315v1

• As we know, some object types have an effect on each other.For example, influential authors usually publish papers inreputable conferences. So one may pay attention to the im-portance of multiple types of objects simultaneously, and askthe following questions:Q. 2.1 Which are the most influential authors and reputable confer-ences?Q. 2.2 Which are the most influential authors and reputable confer-ences in data mining field?

• Furthermore, one may wonder which factor mostly affectsthe importance of objects, since the importance of objects isaffected by many factors. So he may ask the questions likethis:Q. 3 Who are the most influential authors and which factor makesthe author most influential?

Although the ranking problem in homogeneous networks has beenwell studied, the above ranking problems are unique in HIN (espe-cially Q. 2 andQ. 3), which are seldom studied until now. Sincethere are multiple types of objects in HIN, it is possible to analyzethe importance of multiple types of objects (i.e.,Q. 2) as well as af-fecting factors (i.e.,Q. 3) together. However, the ranking analysisin HIN also faces the following research challenges.

• There are different types of objects and links in HIN. If wesimply treat all objects equally and apply the random walkas PageRank does in homogeneous network, the ranking re-sult will mix different types of objects together. It is difficultto distinguish the importance of a specific type of objects.Moreover, the result has obvious bias on object types whichhave many links.

• Different types of objects and links in heterogeneous net-works carry different semantic meanings. The random walkalong different meta paths has different semantics, whichmay lead to different ranking results. Here the meta path [16]means a sequence of relations between object types. For ex-ample, the importance of authors should be different basedon the relations of co-author (i.e., the meta path “Author-Paper-Author”) and co-conference (i.e., the meta path “Author-Paper-Conference-Paper-Author”). So a desirable rankingmethod in HIN should be path-dependent, which can capturethe semantics under specific meta paths and return differentvalues based on specific meta paths.

In this paper, we study the ranking problem in HIN and proposea novel ranking framework, HRank, to evaluate the importance ofmultiple types of objects and meta paths in HIN. ForQ. 1 andQ.2, a path based random walk model is proposed to evaluate the im-portance of single or multiple types of objects. That is, therandomwalk process in HIN should follow the meta path designated be-forehand. The different meta paths connecting two types (same ordifferent) of objects have different semantics and transitive proba-bility, and thus lead to different random walk processes andrankingresults. Although meta path has been widely used to capture thesemantics in HIN [14, 16], it coarsely depicts object relations. Byemploying the meta path, we can answer theQ. 1.1andQ. 2.1, butcannot answer theQ. 1.2andQ. 2.2. For example, “Author-Paper-Author” describes the collaboration relation among authors. How-ever, it cannot depict the fact that Philip S. Yu and Jiawei Han have

many collaborations in data mining field but they seldom collabo-rate in information retrieval field. In order to overcome theshort-coming existing in meta path, we propose theconstrained metapathconcept, which can effectively describe this kind of subtlese-mantics. The constrained meta path sets constraint conditions onmeta path. Through adopting the constrained meta path, we cananswer theQ. 1.2andQ. 2.2.

In HIN, there are many constrained meta paths connecting twotypes of objects. Based on different paths, the objects havedifferentranking values. The comprehensive importance of objects shouldconsider all kinds of factors (the factors can be embodied bycon-strained meta paths). Moreover, these constrained meta paths havedifferent contribution to the importance of objects. For example,although Jiawei Han and W. Bruce Croft both are influential au-thors in computer science, but the achievements on data miningand information retrieval fields contribute to their reputation, re-spectively. In order to evaluate the importance of objects and metapaths simultaneously (i.e., answerQ. 3), we further propose a co-ranking method which organizes the relation matrices of objects ondifferent constrained meta paths as a tensor. A random walk pro-cess is designed on this tensor to co-rank importance of objects andpaths simultaneously. That is, random walkers surf in the tensor,where the stationary visiting probability of objects and meta pathsis considered as the HRank score of objects and paths. In addition,in order to fasten the matrix multiplication process in HRank, wedesign three fast computation strategies whose effectiveness havebeen validated by experiments.

In all, this paper has the following three contributions.

• We propose the constrained meta path concept to describe thesubtle semantic relation in HIN. Compared to the meta path,the constrained meta path can depict object relation with finergranular through setting constraint condition on meta path.

• We propose a path-based ranking method to evaluate the im-portance of same or different types of objects in HIN by set-ting constrained meta path. Furthermore, we propose a co-ranking method to simultaneously evaluate the importanceof objects and paths. The method not only can comprehen-sively evaluate the importance of objects by considering allconstrained meta paths, but also can rank the contribution ofdifferent constrained meta paths.

• Extensive experiments evaluate the proposed method. Wenot only validate that the objects have different importancebased on different constrained meta paths, but also show thatthe ranking results of constrained meta paths more complywith our common sense. In addition, the experimental resultsillustrate that the proposed method can accurately identifythe importance of objects and the corresponding paths.

The rest of the paper is organized as follows. In Section 2, wesum-marize and compare the related work. In Section 3, we describenotations in this paper and some preliminary knowledge. In Sec-tion 4, we present the proposed method and the fast computationstrategies are introduced in Section 5. Extensive experiments aredone to validate the proposed method in Section 6. Finally, Section7 concludes this paper.

2. RELATED WORK

Ranking is an important data mining task in network analysis. Manyranking methods have been proposed. For example, PageRank [13]evaluates the importance of objects through a random walk process;HITS [7] ranks objects using the authority and hub scores; Sim-Rank [4] evaluates the similarity of two objects by their neighbors’similarities. The recently proposed RoleSim measures the role sim-ilarity between any two nodes from network structure [6]. Theseapproaches only consider the same type of objects in homogeneousnetworks, so they cannot be applied in heterogeneous networks.Some researches have begun to pay attention to the co-ranking onmultiple types of objects. For example, Zhou et al. [21] co-rankauthors and their publications by coupling two random walk pro-cesses, and the co-HITS [2] incorporates the bipartite graph withthe content information and the constraints of relevance. Althoughthese methods can rank different types of objects existing in HIN,they are restricted to bipartite graphs. Recently, MultiRank [11]determines the importance of both objects and relations simultane-ously for multi-relational data, and HAR [10] is proposed todeter-mine hub and authority scores of objects and relevance scores ofrelations in multi-relational data for query search. Thesetwo meth-ods focus on same type of objects with multi-relations, not multipletypes of objects.

In recent years, there is a surge on the HIN analysis. Many datamining tasks have been exploited in HIN, such as similarity mea-sure [14, 16], clustering [17, 19, 18], classification [8]. As anunique feature of HIN, the links connecting different typesof ob-jects contain semantics. So the meta path [16], connecting objecttypes via a sequence of relations, has been widely used to capturethe relation semantics. Sun et al. [16] put forward the concept ofmeta path to describe the rich semantic relations, and studied sim-ilarity search on symmetric meta paths. As an extension of Sun’swork, Yu et al. [20] use a meta-path-based ranking model ensem-ble to represent semantic meanings for similarity queries.HeteSimis also proposed by Shi et al. [14] to measure the relevance scoresof heterogeneous objects in HIN. PathSelClus [18] integrates metapath selection with user-guided clustering to cluster objects in net-works. Kong et al.[8] develop a HCC solution to assign labelstoa group of instances that are interconnected through different metapaths. Although the meta path may convey semantic informationin HIN, it is too coarse to capture the subtle semantics in someapplications.

3. PRELIMINARYIn this section, we describe notations used in this paper andpresentsome preliminary knowledge.

A heterogeneous information network is a special type of informa-tion network with the underneath data structure as a directed graph,which either contains multiple types of objects or multipletypes oflinks.

DEFINITION 1. Information Network [16]. Given a schemaS = (A,R) which consists of a set of entity typesA = {A} and aset of relationsR = {R}, an information network is defined as adirected graphG = (V,E) with an object type mapping functionϕ : V → A and a link type mapping functionψ : E → R. Eachobjectv ∈ V belongs to one particular object typeϕ(v) ∈ A, andeach linke ∈ E belongs to a particular relationψ(e) ∈ R. Whenthe types of objects|A| > 1 or the types of relations|R| > 1, thenetwork is calledheterogeneous information network; otherwise,it is a homogeneous information network.

In heterogeneous information networks, there are multipleobjecttypes and relation types. We use the network schema to depicttheobject types and the relations existing among object types.For a re-

lationR existing from typeS to typeT , denoted asSR

−→ T ,S andT are thesource typeandtarget type of relationR, which is de-noted asR.S andR.T , respectively. The inverse relationR−1 holds

naturally forTR−1

−→ S. Generally,R is not equal toR−1, unlessRis symmetric and these two types are the same. Figure 1(b) showsa network schema of bibliographic information network, which de-scribes the object types and their relations in the HIN.

Different from homogeneous networks, two objects in a heteroge-neous network can be connected via different paths and thesepathshave different meanings. As an example shown in Figure 1(b),au-thors can be connected via “Author-Paper-Author” (APA) path,“Author-Paper-Conference-Paper-Author” (APCPA) path and soon. These paths are called meta paths which can be defined as fol-lows.

DEFINITION 2. Meta path [16]. A meta pathP is a path de-fined on a schemaS = (A,R), and is denoted in the form of

A1

R1−−→ A2

R2−−→ . . .Rl−−→ Al+1 (abbreviated asA1A2 . . . Al+1),

which defines a composite relationR = R1 ◦R2 ◦ . . .◦Rl betweentypeA1 andAl+1, where◦ denotes the composition operator onrelations.

It is obvious that semantics underneath these paths are different.TheAPApath means authors collaborating on the same papers (i.e.,co-author relation), while theAPCPApath means the authors’ pa-pers publishing on the same conferences (i.e., co-conference rela-tion). Based on different meta paths, there are different relationnetworks, which may result in different importance of objects. Forexample, the importance of authors underAPApath has bias on theauthors who write many papers having many authors, while theim-portance of authors underAPCPApath emphasizes the authors whopublish many papers on those productive conferences. So theim-portance of objects depends on the meta path in the heterogeneousnetworks. As an effective semantic capturing method, meta pathhas been widely used in many data mining tasks in HIN, such assimilarity measure [14, 16], clustering [18], and classification [8].However, meta path fails to capture some subtle semantics. Tak-ing Figure 1(b) as an example, theAPAcannot reveal the co-authorrelations in Data Mining (DM) field.

In order to overcome the shortcomings in meta path, we proposethe concept of constrained meta path, defined as follows.

DEFINITION 3. Constrained meta path. A constrained metapath is a meta path based on a certain constraint which is denotedasCP = P|C. P = (A1A2 . . . Al) is a meta path, whileC repre-sents the constraint on the objects in the meta path.

Note that theC can be one or multiple constraint conditions onobjects. Taking Figure 1(b) as an example, the constrained metapathAPA|P.L = “DM” represents the co-author relations of au-thors in data mining field through constraining the label of paperswith DM. Similarly, the constrained meta pathAPCPA|P.L =“DM”&&C = “CIKM” represents the co-author relations ofauthors in CIKM conference and the papers of authors are in data

mining field. Obviously, compared to meta path, the constrainedmeta path conveys richer semantics by subdividing meta paths un-der distinct conditions. Particularly, when the length of meta pathis 1 (i.e., a relation), the constrained meta path degrades to acon-strained relation. In other words, the constrained relation confinescondition on objects of the relation.

For a relationAR

−→ B, we can obtain its transition probabilitymatrix as follows.

DEFINITION 4. Transition probability matrix. WAB is an ad-

jacent matrix between typeA andB on relationAR

−→ B. UAB isthe normalized matrix ofWAB along the row vector, which is the

transition probability matrix ofAR

−→ B.

Then we make some constraints on objects of the relationAR

−→ B(i.e., constrained relation). We can have the following definition.

DEFINITION 5. Constrained transition probability matrix. WAB

is an adjacent matrix between typeA andB on relationAR

−→ B.Suppose there is a constraintC on object type A. The constrainedtransition probability matrixU

′

AB of constrained relationR|C is

U′

AB = MCUAB , whereMC is the constraint matrix generated bythe constraint conditionC on object type A.

The constraint matrixMC is usually a diagonal matrix whose di-mension is the number of objects in object typeA. The element inthe diagonal is 1 if the corresponding object satisfies the constraint,else the element in the diagonal is 0. Similarly, we can confine theconstraint on the object typeB or both types. Note that the transi-tion probability matrix is a special case of the constrainedtransi-tion probability matrix, when we let the constraint matrixMC bethe identity matrixI .

Given a networkG = (V,E) following a network schemaS =(A,R), we can define the meta path based reachable probabilitymatrix as follows.

DEFINITION 6. Meta path based reachable probability matrix.For a meta pathP = (A1A2 · · ·Al+1), the meta path based reach-able probability matrixPM is defined asPMP = UA1A2UA2A3

· · ·UAlAl+1 . PMP(i, j) represents the probability of objecti ∈A1 reaching objectj ∈ Al+1 under the pathP .

Similarly, we have the following definition for constrainedmetapath.

DEFINITION 7. Constrained meta path based reachable prob-ability matrix. For a constrained meta pathCP = (A1A2 · · ·Al+1|C),the constrained meta path based reachable probability matrix is de-fined asPMCP = U

′

A1A2U

′

A2A3· · ·U

′

AlAl+1. PMCP(i, j) rep-

resents the probability of objecti ∈ A1 reaching objectj ∈ Al+1

under the constrained meta pathP|C.

In fact, if there is no constraint on the objects of a relationAiR

−→

Ai+1, U′

AiAi+1is equal toUAiAi+1 . If there is a constraint on

the objects, we only consider the objects that satisfy the constraint.For simplicity, we use the reachable probability matrix andtheMP

to represent the constrained meta path based reachable probabilitymatrix in the following section.

4. THE HRANK METHODSince the importance of objects is related to the meta path desig-nated by users, we propose the path based ranking method HRankin heterogeneous networks. In order to answer the three rank-ing problems proposed in Section 1, we design three versionsofHRank, respectively.

Figure 2: An example of the computation process of HRank. Theblue and red broken line represents the process on the symmetric andasymmetric constrained meta path, respectively.

4.1 Ranking based on symmetric constrainedmeta paths

In order to evaluate the importance of one type of objects (i.e.,Q. 1), we design the HRank-SY method based on symmetric con-strained meta paths, since the constrained meta paths connectingone type of objects are usually symmetric, such asAPA|P.L =“DM”.

For a symmetric constrained meta pathP = (A1A2 . . . Al|C) , Pis equal toP−1 andA1 andAl are the same. Similar to PageRank[13], the importance evaluation of objectA1 (i.e.,Al) can be con-sidered as a random walk process in which random walkers wanderfrom typeA1 to typeAl along the pathP . The HRank value ofobjectA1 (R(A1|P)) is the stable visiting probability of randomwalkers, which is defined as follows:

R(A1|P) = αR(A1|P)MP + (1− α)E (1)

whereMP is the constrained meta path based reachable probabil-ity matrix as defined above.E is the restart probability for con-vergence, which is set equally.α is the weight of restart, whichcan be set with 0.15 as PageRank does. HRank-SY and PageR-ank both have the same idea that the importance of objects is de-cided by the visiting probability of random surfers. Different fromPageRank, the random surfers in HRank-SY should wander alongthe constrained meta path to visit objects.

As shown in Figure 2, the red broken line illustrates an example ofthe process of calculating rank value, where theCP isAPA|P.L =“DM”. The concrete calculating process is as follows.

R(Author|CP) = αR(Author|CP)MCP + (1 − α)E

MCP = U′

APU′

PA = UAPMPMPUPA

(2)

whereMP is the constraint matrix on object type P (paper).

4.2 Ranking based on asymmetric constrainedmeta paths

For the questionQ. 2, we propose the HRank-AS method basedon asymmetric constrained meta paths, since the paths connectingdifferent types of objects are asymmetric. For an asymmetric con-strained meta pathP = (A1A2 . . . Al|C), P is not equal toP−1.Note thatA1 andAl are either of the same or different types, suchasAPC|P.L = “DM” andPCPLP |C = “CIKM”.

Similarly, HRank-AS is also based on a random walk process thatrandom walkers wander betweenA1 andAl along the path. Theranks ofA1 andAl can be seen as the visiting probability of walk-ers, which are defined as follows:

R(Al|P−1) = αR(A1|P)MP + (1− α)EAl

R(A1|P) = αR(Al|P−1)M

P−1 + (1 − α)EA1

(3)

whereMP andMP−1 are the reachable probability matrix of pathP andP−1. EA1 andEAl

are the restart probability ofA1 andAl.Obviously, HRank-SY is the special case of HRank-AS. When thepathP is symmetric, the Eq. 3 is the same with Eq. 1.

The blue broken line in Figure 2 illustrates an example whichsi-multaneously evaluates the importance of authors and conferences.Here theCP is APC|P.L = “DM”. The concrete calculatingprocess is as follows.

R(Conf.|CP) = αR(Aut.|CP)MCP + (1 − α)EConf.

R(Aut.|CP) = αR(Conf.|CP)MCP−1 + (1− α)EAut.

MCP = U′

APU′

PC = UAPMPMPUPC

MCP−1 = U

′

CPU′

PA = UCPMPMPUPA

(4)

whereMP is the constraint matrix on object type P (paper).

4.3 Co-ranking for objects and relations in HINUntil now, we have created methods to rank same or different typesof objects under a certain constrained meta path. However, thereare many constrained meta paths in heterogeneous networks.It isan important issue to automatically determine the importance ofpaths [16, 18], since it is usually hard for us to identify which re-lation is more important in real applications. To solve thisproblem(i.e., Q. 3), we propose the HRank-CO to co-rank the importanceof objects and relations. The basic idea is based on an intuitionthat important objects are connected to many other objects througha number of important relations and important relations connectmany important objects. So we organize the multiple relation net-works with a tensor and a random walk process is designed on thistensor. The method not only can comprehensively evaluate the im-portance of objects by considering all constrained meta paths, butalso can rank the contribution of different constrained meta paths.

In Figure 3(a), we show an example of multiple relations amongobjects, generated by multiple meta paths. There are three objectsof type A, three objects of typeB and three types of relationsamong them. These relations are generated by three constrainedmeta paths with typeA as the source type and typeB as the tar-get type. To describe the multiple relations among objects,weuse the representation of tensor which is a multidimensional ar-ray. We callX = (xi,j,k) a 3rd order tensor, wherexi,j,k ∈ R,for i = 1, · · · , m, j = 1, · · · , l, k = 1, · · · , n. xi,j,k representsthe times that objecti is related to objectk through thejth con-strained meta path. For example, Figure 3(b) is a three-way array,where each two dimensional slice represents an adjacency matrixfor a single relation. So the data can be represented as a tensor of

(a) Multiple relations (b) Tenser representation

Figure 3: An example of multi-relations of objects generated by mul-tiple paths: (a) the graph representation; (b) the corresponding tensorrepresentation.

size3× 3× 3. In the multi-relational network, we define the tran-sition probability tensor to present the transition probability amongobjects and relations.

DEFINITION 8. Transition probability tensor. In a multi-relationalnetwork,X is the tensor representing the network.F is the nor-malized tensor ofX along the column vector.R is the normalizedtensor ofX along the tube vector.T is the normalized tensor ofX along the row vector.F , R, and T are called the transitionprobability tensor which can be denoted as follows:

fi,j,k =xi,j,k∑mi=1 xi,j,k

i = 1, 2, . . . ,m

ri,j,k =xi,j,k∑lj=1 xi,j,k

j = 1, 2, . . . , l

ti,j,k =xi,j,k∑n

k=1xi,j,k

k = 1, 2, . . . , n

(5)

fi,j,k can be interpreted as the probability of objecti (of typeA)being the visiting object when relationj is used and the currentobject being visited is objectk (of typeB), ri,j,k represents theprobability of using relationj given that objectk is visited fromobject i, andti,j,k can be interpreted as the probability of objectk being visited, given that objecti is currently the visiting objectand relationj is used. The meaning of these three tensors can bedefined formally as follows:

fi,j,k = Prob(Xt = i|Yt = j, Zt = k)

ri,j,k = Prob(Yt = j|Xt = i, Zt = k)

ti,j,k = Prob(Zt = k|Xt = i, Yt = j)

(6)

in which Xt, Zt andYt are three random variables representingvisiting at certain object of typeA or typeB and using certainrelation respectively at the timet.

Now, we define the stationary distributions of objects and relationsas follows

x = (x1, x2, · · · , xm)T

y = (y1, y2, · · · , yl)T

z = (z1, z2, · · · , zn)T

(7)

in which

xi = limt→∞

Prob(Xt = i)

yj = limt→∞

Prob(Yt = j)

zk = limt→∞

Prob(Zt = k)

(8)

From the above equations, we can get:

Prob(Xt = i) =l∑

j=1

n∑

k=1

fi,j,k × Prob(Yt = j, Zt = k)

Prob(Yt = j) =m∑

i=1

n∑

k=1

ri,j,k × Prob(Xt = i, Zt = k)

Prob(Zt = k) =m∑

i=1

l∑

j=1

ti,j,k × Prob(Xt = i, Yt = j)

(9)

whereProb(Yt = j, Zt = k) is the joint probability distributionof Yt andZt, Prob(Xt = i, Zt = k) is the joint probability dis-tribution ofXt andZt, andProb(Xt = i, Yt = j) is the jointprobability distribution ofXt andYt.

To obtainxi, yj and zk, we assume thatXt, Yt andZt are allindependent from each other which can be denoted as below:

Prob(Xt = i, Yt = j) = Prob(Xt = i)Prob(Yt = j)

Prob(Xt = i, Zt = k) = Prob(Xt = i)Prob(Zt = k)

Prob(Yt = j, Zt = k) = Prob(Yt = j)Prob(Zt = k)

(10)

Consequently, through combining the equations with the assump-tions above, we get

xi =l∑

j=1

n∑

k=1

fi,j,kyjzk, i = 1, 2, . . . ,m,

yj =m∑

i=1

n∑

k=1

ri,j,kxizk, j = 1, 2, . . . , l,

zk =m∑

i=1

l∑

j=1

ti,j,kxiyj , k = 1, 2, . . . , n.

(11)

The equations above can be written in a tensor format:

x = Fyz, y = Rxz, z = Txy (12)

with

∑m

i=1xi = 1,

∑l

j=1yj = 1, and

∑n

k=1zk = 1.

According to the analysis above, we can design the followingalgo-rithm to co-rank the importance of objects and relations.

Algorithm 1 HRank-CO AlgorithmInput: Three tensorsF ,T andR, three initial probability distributionsx0,

y0 andz0 and the toleranceǫ.Output: Three stationary probability distributionsx, y andz.

Procedure:Sett = 1;repeat

Computext = Fyt−1zt−1;Computeyt = Rxtzt−1;Computezt = Txtyt;

until ||xt − xt−1||+ ||yt − yt−1||+ ||zt − zt−1|| < ǫ

4.4 DiscussionFirst we analyze the connection of three versions of HRank. Wehave stated that HRank-SY is a special version of HRank-AS whenthe asymmetric path degrades to a symmetric path. We can alsofindthat HRank-AS is the special version of HRank-CO. When thereisonly one relation in HRank-CO, generated by pathP , T andFare the transition probability matrices between typeA andB alongpathP andP−1 (i.e.,MP andMP−1 ), respectively. Moreover,Randy become 1. In this condition, Eq. 12 turns into Eq. 3 withoutconsidering the restarting probability.

Then we estimate the space and time complexity of HRank. Forsimplicity, we assume that there arer relations,n objects for eachtype, andt iterations for convergence. For HRank-CO, the spacecomplexity isO(rn2) to store the transition probability tensor. Thetime complexity of HRank-CO comes from two parts: iterativelycompute rank values (see Algorithm 1) and construct the transitionprobability tensor (see Def. 8). The time complexity of rankcom-putation isO(trn2). For thel length path, the complexity of con-structing the transition probability tensor isO(rln3). So the wholetime complexity isO(trn2 + rln3). For HRank-AS and HRank-SY, the number of relations (i.e.,r) is 1, so their space complexityareO(n2) and time complexity areO(tn2 + ln3). For real ap-plications, the relation matrices are very sparse, so the real timecomplexity is much smaller than the theoretical analysis.

5. FAST COMPUTATION STRATEGIESAccording to the time complexity analysis above, we can find thatthe main time-consuming component of HRank lies in constructingthe reachable probability matrix with the complexityO(rln3). Wedesign three fast computation strategies to fasten the matrix multi-plication process to construct the reachable probability matrix.

5.1 Dynamic Programming StrategySince the matrix multiplication obeys the associative property (i.e.,(M1 ×M2) ×M3 is equal toM1 × (M2 ×M3)), we can designtheDynamicProgramming strategy (DynP) to change the sequenceof matrix multiplication with the associative property. The basicidea of DynP is to assign small-dimensioned matrix with the highcomputation priority. For a meta pathP = R1 ◦R2 ◦ · · · ◦Rl, wecan calculate the expected minimal computation complexityby thefollowing equation and record the computation sequence ini.

C(R1 ◦ · · · ◦ Rl) =

0 l = 1|R1.S| × |R1.T | × |R2.T | l = 2arg min

i{C(R1 · · ·Ri) + C(Ri+1 · · ·Rl)+

|R1.S| × |Ri.T | × |Rl.T |} l > 2

(13)

Through dynamic programming method, the above equation canbeeasily solved with theO(l2) complexity. The running time can be

omitted, sincel is much smaller than the matrix dimension. TheDynP does not change relevance results, so it is an information-lossless strategy.

5.2 Truncation StrategyThe basic hypothesis ofTrun cation strategy (Trun) is that remov-ing the probability on those less important nodes would not sig-nificantly degrade the performance. It has been proved by manyresearches [9, 1]. Through keeping matrix sparse, the truncationstrategy could greatly reduce the amount of space and time con-sumption. We can add a truncation step at each step of the matrixmultiplication. In the truncation step, the probability value is setwith 0 for those nodes whose values are smaller than a threshold ε.Although a static threshold is usually used in many methods (e.g.,ref. [9]), it faces the following disadvantage: it may truncate noth-ing for matrix whose elements all have high probability and it maytruncate most nodes for matrix whose elements all have low prob-ability. Since we usually pay close attention to the topk objects inquery task, the thresholdε can be set as the topk relevance valuefor each search object. Thek is dynamically adjusted as follows.

k =

{

L if L ≤ W

⌊(L−W )β⌋+W (β ∈ [0, 1]) otherwise

whereL is the vector length andW is the number of top objects,decided by users. TheW andβ determine the truncation level.The largerW or β will cause the largerk, which means a densermatrix. It is expensive to determine the topk relevance value foreach object, so we can estimate the value by the topkM value forthe whole matrix (M is the number of objects). However, it is alsotime-consuming to calculate the topkM value for the whole ma-trix. It can be approximated with the sample data from the rawmatrix. The sample ratio isγ. The largerγ leads to more accurateapproximation with longer running time. In summary, the trun-cation strategy is an information-loss strategy. It can keep matrixsparse with small sacrifice on accuracy. In addition, it needs addi-tional time to estimate the thresholdε.

5.3 Monte Carlo StrategyThe basic idea ofMonte Carlo method (MonC) is that, for eachnodeu,K independent random walkers are simulated starting fromu. The distribution ofu is approximated by the normalized countsof number of times the random walkers visit a node. So the reach-able probabilityPMP(a, b) can be approximated by the normal-ized count of the number of times that the walkers visit the nodebfrom a along the pathP .

PMP(a, b) =#times the walkers visit b along P

#walkers from a

The number of walkers froma (i.e.,K) controls the accuracy andamount of computation. The largerK will achieve more accurateestimation with more time cost. An advantage of the MonC strategyis that its running time is not affected by the dimension and sparsityof matrix. However, the high-dimension matrix needs largerKfor high accuracy. As a sampling method, the MonC is also aninformation-loss strategy.

6. EXPERIMENTSIn this section, we do experiments to validate the effectiveness ofthree versions of HRank on three real datasets, respectively.

(a) DBLP (b) ACM (c) Movie

Figure 4: The network schema of three heterogeneous datasets. (a)DBLP bibliographic dataset. (b) ACM bibliographic dataset. (c) IMDBmovie dataset.

6.1 DatasetsWe use three heterogeneous information networks for our experi-ments, including DBLP dataset, ACM dataset, and IMDB dataset.They are summarized as follows:

DBLP dataset [14, 16]: The DBLP dataset is a sub-network col-lected from DBLP website1 involving major conferences in two re-search areas: database (DB) and information retrieval (IR), whichnaturally form two labels. The dataset contains 9682 authors, 20conferences and 22185 papers which are all labeled with one of thetwo research areas. The network schema is shown in Figure 4(a).

ACM dataset [14]: The ACM dataset was downloaded from ACMdigital library2 in June 2010. The ACM dataset comes from 14 rep-resentative computer science conferences: KDD, SIGMOD, WWW,SIGIR, CIKM, SODA, STOC, SOSP, SPAA, SIGCOMM, Mobi-COMM, ICML, COLT and VLDB. These conferences include 196corresponding venue proceedings (e.g., KDD conference includes12 proceedings, such as KDD’10, KDD’09, etc). The dataset has12499 papers, 17431 authors, 1903 terms and 1804 author affilia-tions. The network also includes 73 labels of these papers inACMcategory (e.g., H.2 is Database Management). The network schemaof ACM dataset is shown in Figure 4(b).

IMDB dataset [15]: We crawled movie information from The In-ternet Movie Database3 to construct the network. The IMDBmovie data collects 1591 movies before 2010. The related objectsinclude movies, actors, directors and movie types, which are orga-nized as a star schema shown in Figure 4(c). Movie informationincludes 5324 actors, 1591 movies, 551 directors and 112 movietypes (e.g., comedy and romance).

6.2 Ranking of Homogeneous ObjectsSince the homogeneous objects are connected by symmetric con-strained meta paths, the experiments validate the effectiveness ofHRank-SY on symmetric constrained meta paths.

6.2.1 Experiment Study on Symmetric ConstrainedMeta Paths

This experiment ranks same type of objects by designating a sym-metric constrained meta path on ACM dataset. Here we rank theimportance of authors through the symmetric meta pathAPA, whichconsiders the co-author relations among authors. We also employtwo constrained meta pathsAPA|P.L = “H.2” andAPA|P.L =“H.3”, where the categories of ACMH.2 andH.3 represent “databasemanagement” and “information storage/retrieval”, respectively. That

1http://www.informatik.uni-trier.de/∼ley/db/2http://dl.acm.org/3www.imdb.com/

Table 1: Top ten authors of different methods on ACM dataset.The number in the parenthesis of the fifth column means the rank ofauthors in the whole ranking list returned by PageRank.

Rank APA APA|P.L = “H.3” APA|P.L = “H.2” PageRank Degree1 Jiawei Han W. Bruce Croft Jiawei Han Ming Li(1522) Jiawei Han2 Philip Yu ChengXiang Zhai Christos Faloutsos Wei Wei(2072) Philip Yu3 Christos Faloutsos James Allan Philip Yu Jiawei Han(5385) ChengXiang Zhai4 Zheng Chen Jamie Callan Jian Pei Tao Li(6090) Zheng Chen5 Wei-Ying Ma Zheng Chen H. Garcia-Molina Hong-Jiang Zhang(6319) Christos Faloutsos6 ChengXiang Zhai Ryen W. White Jeffrey F. Naughton Wei Ding(6354) Ravi Kumar7 W. Bruce Croft Wei-Ying Ma Divesh Srivastava Jiangong Zhang(7285) W. Bruce Croft8 Scott Shenker Jian-Yun Nie Raghu Ramakrishnan Christos Faloutsos(7895) Wei-Ying Ma9 H. Garcia-Molina Gerhard Weikum Charu C. Aggarwal Feng Pan(8262) Gerhard Weikum10 Ravi Kumar C. Lee Giles Surajit Chaudhuri Hongyan Liu(8440) Divesh Srivastava

is, two constrained meta paths subtly consider the co-author rela-tions in database/data mining field and information retrieval field,respectively. We employ HRank-SY to rank the importance of au-thors based on these three paths. As the baseline methods, werankthe importance of authors with PageRank and the degree of authors(called Degree method). We directly run PageRank on the wholeACM network by ignoring the heterogeneity of objects. Sincetheresults of PageRank mix all types of objects, we select the authortype from the ranking list as the final results.

The top ten authors of each method are shown in Table 1. We canfind that these ranking lists all have some common influentialau-thors except that of PageRank. The results of PageRank includesome not very well known authors in DB/IR field, such as Ming Liand Wei Wei, although they may be very influential in other fields.We know that the PageRank values of objects are decided by theirdegrees to a large extent, so the rank values of affiliation objects arehigh due to their high degrees. It improves the rank values ofauthorobjects connecting multiple high-ranking affiliations. The bad re-sults of PageRank show that the ranking in heterogeneous networksshould consider the heterogeneity of objects. Otherwise, it can-not distinguish the effect of different types of links. Moreover, wecan also observe that the results of HRank with constrained metapaths have obvious bias on the field it assigns. For example, thepathAPA|P.L = “H.3” reveals the important authors in informa-tion retrieval field, such as W. Bruce Croft, ChengXiang Zhai, andJames Allan. However, the pathAPA|P.L = “H.2” returns theinfluential authors in database and data mining field, such asJiaweiHan and Christos Faloutsos. For the meta pathAPA, it mingleswell-known authors in these two fields. The results illustrate thatthe constrained meta paths are able to capture subtle semantics bydeeply disclosing the most influential authors in a certain field.

6.2.2 Quantitative Comparison ExperimentsBased on the results returned by five methods, we can obtain fivecandidate ranking lists of authors in ACM dataset. To evaluate theresults quantitatively, we use the author ranks from Microsoft Aca-demic Search4 as ground truth. Specifically, we crawled two stan-dard ranking lists of authors in two academic fields: DB and IR.Then we compare the difference between our candidate rankinglists and the standard ranking lists. We use theDistancecriterion[12] to measure the quality of the ranking results. The criterion notonly measures the number of mismatches between these two lists,but also considers the position of these mismatches. The smallerDistancemeans the smaller difference (i.e., better performance).

4http://academic.research.microsoft.com/

50 100 150 200

0.45

0.5

0.55

0.6

Top K

Distance

APA|P.L=H.2APA|P.L=H.3PageRankDegreeAPA

(a) DB field

50 100 150 200

0.5

0.55

0.6

0.65

0.7

Top K

Distance

APA|P.L=H.2APA|P.L=H.3PageRankDegreeAPA

(b) IR field

Figure 5: The Distances between the ranking lists obtained by differ-ent methods and the standard ranking lists on different fields on ACMdataset.

In this experiment, we compare the five candidate ranking lists witheach of the two standard ranking lists from Microsoft AcademicSearch and theDistanceresults are shown in Figure 5. We canobserve an obvious phenomenon: the results obtained by the con-strained meta paths have the smallestDistanceon its correspondingfield, while they have the largestDistanceon other fields. For ex-ample, HRank with the pathAPA|P.L = “H.2” has the smallestDistanceon the DB field in Figure 5(a), while it has the largestDistanceon IR field in Figure 5(b). The reason lies in that the pathAPA|P.L = “H.2” focuses on the authors in the DB field. Mean-while, these authors deviate from those in the IR field. The resultsfurther illustrate that the constrained meta path can disclose the in-fluential authors in a certain field more correctly. Since themetapath (i.e.,APA) considers the co-author relationship on all fields,it achieves mediocre performances on these two fields. In fact, theHRank with meta pathAPA only achieves closer performances toPageRank and Degree methods. It implies that the constrained metapath in HRank indeed helps to improve the ranking performancesin a specific field.

6.3 Ranking of Heterogeneous ObjectsThen the experiments validate the effectiveness of HRank-AS onasymmetric constrained meta paths.

6.3.1 Experiment Study on Asymmetric ConstrainedMeta Paths

The experiments are done on the DBLP dataset. We evaluate theimportance of authors and conferences simultaneously based onthe meta pathAPC, which means authors publish papers on con-ferences. Two constrained meta paths (APC|P.L = “DB” andAPC|P.L = “IR”) are also included, which means authors pub-lish DB(IR)-field papers on conferences. Similarly, the experi-ments also include two baseline methods (i.e., PageRank andDe-

Table 2: Top ten authors of different methods on DBLP dataset. The number in the parenthesis of the fifth column means the rankof authors in the whole ranking list returned by PageRank.

Rank APC APC|P.L = “DB” APC|P.L = “IR” PageRank Degree1 Gerhard Weikum Surajit Chaudhuri W. Bruce Croft W. Bruce Croft(23) Philip S. Yu2 Katsumi Tanaka H. Garcia-Molina Bert R. Boyce Gerhard Weikum(24) Gerhard Weikum3 Philip S. Yu H. V. Jagadish Carol L. Barry Philip S. Yu(25) Divesh Srivastava4 H. Garcia-Molina Jeffrey F. Naughton James Allan Jiawei Han(26) Jiawei Han5 W. Bruce Croft Michael Stonebraker ChengXiang Zhai H. Garcia-Molina(27) H. Garcia-Molina6 Jiawei Han Divesh Srivastava Mark Sanderson Divesh Srivastava(28) W. Bruce Croft7 Divesh Srivastava Gerhard Weikum Maarten de Rijke Surajit Chaudhuri(29) Surajit Chaudhuri8 Hans-Peter Kriegel Jiawei Han Katsumi Tanaka H. V. Jagadish(30) H. V. Jagadish9 Divyakant Agrawal Christos Faloutsos Iadh Ounis Jeffrey F. Naughton(31) Jeffrey F. Naughton10 Jeffrey Xu Yu Philip S. Yu Joemon M. Jose Rakesh Agrawal(32) Rakesh Agrawal

gree) in above experiments with the same experimental process.

The top ten authors and conferences returned by these five methodsare shown in Tables 2 and 3, respectively. As shown in Table 2,the ranking results of these methods on authors all are reasonable,however, the constrained meta paths can find the most influentialauthors in a certain field. For example, the top three authorsofAPC|P.L = “DB” are Surajit Chaudhuri, Hector Garcia-Molinaand H. V. Jagadish, and all of them are very influential researchersin the database field. The top three authors ofAPC|P.L = “IR”are W. Bruce Croft, Bert R. Boyce and Carol L. Barry, and theyall have the high academic reputation in information retrieval field.Similarly, as we can see in Table 3, HRank with constrained metapaths (i.e.,APC|P.L = “DB” andAPC|P.L = “IR”) canclearly find the important conferences in DB and IR fields, whileother methods mingle these conferences. For example, the mostimportant conferences in the DB field are ICDE, VLDB and SIG-MOD, while the most important conferences in the IR field are SI-GIR, WWW, CIKM. Observing Tables 2 and 3, we can also findthe mutual effect of authors and conferences. That is, an influentialauthor published many papers in the important conferences,andvice versa. For example, W. Bruce Croft published many papersin SIGIR and CIKM, while Surajit Chaudhuri has many papers inSIGMOD, ICDE and VLDB.

Table 3: Top ten conferences of different methods on DBLP dataset.The number in the parenthesis of the fifth column means the rank ofconferences in the whole ranking list returned by PageRank.

Rank APC APC|P.L = “DB” APC|P.L = “IR” PageRank Degree1 CIKM ICDE SIGIR ICDE(3) ICDE2 ICDE VLDB WWW SIGIR(4) SIGIR3 WWW SIGMOD CIKM VLDB(5) VLDB4 VLDB PODS JASIST CIKM(6) SIGMOD5 SIGMOD DASFAA WISE SIGMOD(7) CIKM6 SIGIR EDBT ECIR JASIST(8) JASIST7 DASFAA ICDT APWeb WWW(9) WWW8 JASIST MDM WSDM DASFAA(10) PODS9 WISE WebDB JCIS PODS(11) DASFAA10 EDBT SSTD IJKM JCIS(12) EDBT

6.3.2 Quantitative Comparison ExperimentsTo verify the effectiveness of these methods, we use the aboveDis-tancecriterion to calculate the difference between their results andstandard ranking lists crawled from Microsoft Academic Search.Figure 6 shows the differences of author ranking lists. We can ob-serve the same phenomenon with above quantitative experiments.

That is, HRank with constrained meta paths achieve the best perfor-mances on their corresponding field, while they have the worst per-formances on other fields. In addition, compared to that of PageR-ank and Degree, the mediocre performances of HRank with metapathAPC further demonstrate the importance of constrained metapath to capture the subtle semantics contained in heterogeneousnetworks.

50 100 150 200 250 300

0.35

0.4

0.45

0.5

0.55

0.6

Top K

Distance

APC|P.L=DBAPC|P.L=IRPageRankDegreeAPC

(a) DB field

50 100 150 200 250 300

0.45

0.5

0.55

0.6

Top K

Distance

APC|P.L=DBAPC|P.L=IRPageRankDegreeAPC

(b) IR field

Figure 6: The Distances between the candidate author ranking listsand the standard ranking lists on different fields on DBLP dataset.

6.3.3 Experiments on Meta Path with Multiple Con-straints

Furthermore, we validate the effectiveness of meta path with multi-ple constraints. In the above experiments, we employ the constrainton the label of papers in HRank with the meta pathAPC. Herewe add one more constraint on conference. Specifically, by con-trast to the constrained meta pathAPC|P.L = “DB”, we employthe pathsAPC|P.L = “DB”&&C = “V LDB”, APC|P.L =“DB”&&C = “SIGIR”, andAPC|P.L = “DB”&&C =“CIKM”, which mean authors publish DB-field papers on speci-fied conferences (e.g., VLDB, SIGIR, and CIKM). Similarly, weadd the same conference constraints on the pathAPC|P.L =“IR”. Same with the above experiments, we calculate the rankaccuracy of HRank with these constrained meta paths and the re-sults are shown in Figure 7.

We know that HRank with the pathAPC|P.L = “DB” (APC|P.L =“IR”) can reveal the influence of authors in the DB (IR) field. Asground truth, this ranking is based on the aggregate of many con-ferences related to the DB field. The added conference constraintin HRank further reveals the influence of authors in the specificconference of the field. So we can use the closeness to the groundtruth to reveal the importance of a conference to that field. Thatis, if the ranking from a specific conference is quite closer to theground truth rank, that can imply the conference is a dominatingconference in that field. From Figure 7(a), we can find that the

VLDB conference constraint (the blue curve) achieves the closestperformances to the ground truth ranking, while the performancesof the SIGIR conference constraint (the black curve) deviate most.So we can infer that the VLDB is more important than SIGIR in theDB field and the CIKM has the middle importance. Similarly, fromFigure 7(b), we can infer that the SIGIR is more important thanVLDB in the IR field. These findings comply with our commonsense. As we know, although the VLDB and SIGIR both are thetop conferences in computer science, they are very important onlyin their research fields. For example, the VLDB is important in theDB field, while it is not so important in the IR field. The middleimportance of the CIKM conference stems from the fact that itis acomprehensive conference including papers from DB and IR fields.In addition, we can find that the SIGIR curve almost overlaps withthe ground truth over the IR field, while the VLDB curve still hasa gap with the ground truth over the DB field. We think the rea-son is that SIGIR is the main conference in the IR field, while inthe DB field, there are also other important conferences, such asSIGMOD and ICDE. In all, the experiments show that HRank withconstrained meta path can not only effectively find the influentialauthors in each research field on a specified conference but alsoindirectly reveal the importance of conferences in fields. It also im-plies that HRank can achieve accurate and subtle ranking results byflexibly setting the combination of constraints.

50 100 150 200 250 3000.3

0.4

0.5

0.6

Top K

Distance

APC|P.L=DBAPC|P.L=DB&&C=VLDBAPC|P.L=DB&&C=SIGIRAPC|P.L=DB&&C=CIKM

(a) DB field

50 100 150 200 250 300

0.45

0.5

0.55

0.6

0.65

Top K

Distance

APC|P.L=IRAPC|P.L=IR&&C=VLDBAPC|P.L=IR&&C=SIGIRAPC|P.L=IR&&C=CIKM

(b) IR field

Figure 7: The rank accuracy of HRank with different constrainedmeta paths on DBLP dataset.

0 5000 10000 15000

6

6.5

7

7.5

8

8.5

x 10−5

Author ID

Stationary

Probability

(a) Authors

0 20 40 60

0.0137

0.01372

0.01374

0.01376

Path ID

Stationary

Probability

(b) Paths

Figure 8: The stationary probability distributions of authors and con-strained meta paths.

6.4 Co-Ranking of Objects and Paths6.4.1 Experiment Study on Co-Ranking on Symmet-

ric Constrained Meta PathsIn this experiment, we will validate the effectiveness of HRank-COto rank objects and symmetric constrained meta paths simultane-ously. The experiment is done on ACM dataset. First we constructa (2, 1)th order tensorX based on 73 constrained meta paths (i.e.,APA|P.L = Lj , j = 1 · · · 73). When theith and thekth authorsco-publish a paper together, of which the label is thejth label (i.e.,ACM categories), we add one to the entriesxi,j,k andxk,j,i of X.

Table 4: Top 10 authors and constrained meta paths (note that onlythe constraint (Lj ) of the paths (APA|P.L = Lj , j = 1 . . . 73) areshown in the third column of the table).

Rank Authors Constrained meta paths1 Jiawei Han H.3 (Information Storage and Retrieval)2 Philip Yu H.2 (Database Management)3 Christos Faloutsos C.2 (Computer-Communication Networks)4 Ravi Kumar I.2 (Artificial Intelligence)5 Wei-Ying Ma F.2 (Analysis of Algorithms and Problem Complexity)6 Zheng Chen D.4 (Operating Systems)7 Hector Garcia-Molina H.4 (Information Systems Applications)8 Hans-Peter Kriegel G.2 (Discrete Mathematics)9 Gerhard Weikum I.5 (Pattern Recognition)10 D. R. Karger H.5 (Information Interfaces and Presentation)

In this case,X is symmetric with respect to the indexj. By con-sidering all the publications,xi,j,k (or xk,j,i ) refers to the numberof collaborations by theith and thekth author under thejth paperlabel. In addition, we do not consider any self-collaboration, i.e.,xi,j,i = 0 for all 1 ≤ i ≤ 17431 and1 ≤ j ≤ 73. The size ofX is 17431 × 73 × 17431 where there are 91520 nonzero entriesin X. The percentage of nonzero entries is4.126 × 10−4%. Inthis dataset, we will evaluate the importance of authors through theco-author relations, meanwhile we will analyze the importance ofpaths (i.e., which paths have the most contributions to the impor-tance of authors).

Figure 8 shows the stationary probability distributions ofauthorsand paths. It is obvious that some authors and paths have higher sta-tionary probability, which implies these authors and pathsare moreimportant than others. Table 4 shows the top ten authors (left) andpaths (right) based on their HRank values. We can find that thetopten authors all are influential researchers in the DM/IR fields, whichconforms to our common senses. Similarly, the most importantpaths are related to DM/IR fields, such asAPA|P.L = “H.3” (In-formation Storage and Retrieval) andAPA|P.L = “H.2” (DatabaseManagement). Although the conferences in ACM dataset are frommultiple fields, such as DM/DB (e.g., KDD, SIGMOD) and compu-tation theory (e.g., SODA, STOC), there are more papers fromtheDM/DB fields, which makes the authors and paths in the DM/DBfields ranked higher. We can also find that the influence of authorsand paths can be promoted by each other. The reputation of JiaweiHan and Philip Yu come from their productive papers in the influ-ential fields (e.g., H.3 and H.2). In order to observe this point moreclearly, we show the number of co-authors of the top ten authorsbased on the top ten paths in Table 5. We can observe that thereare more collaborations for top authors in the influential fields. Forexample, although Zheng Chen (rank 6) has more number of co-authors than Jiawei Han (rank 1), the collaborations of Jiawei Hanfocus on ranked higher fields (i.e., H.3 and H.2), so Jiawei Han hashigher HRank score. Similarly, the top paths contain many collab-orations of influential authors.

6.4.2 Experiment Study on Co-Ranking on Asymmet-ric Constrained Meta Paths

The experiments on the Movie dataset aim to show the effectivenessof HRank-CO to rank heterogeneous objects and asymmetric con-strained meta paths simultaneously. In this case, we construct a 3rdorder tensorX based on the constrained meta pathsAMD|M.T .That is, the tensor represents the actor-director collaboration rela-tions on different types of movies. When theith actor and thekthdirector cooperate in a movie of thejth type, we add one to the en-

Table 5: The number that the top ten authors collaborate withothers via the top ten constrained meta paths (note that onlytheconstraint (Lj) of the paths (APA|P.L = Lj , j = 1 . . . 73) are shown in the first row of the table).

Ranked A/CP 1 (H.3) 2 (H.2) 3 (C.2) 4 (I.2) 5 (F.2) 6 (D.4) 7 (H.4) 8 (G.2) 9 (I.5) 10 (H.5)1 (Jiawei Han) 51 176 0 0 0 0 9 2 2 02 (Philip Yu) 51 94 0 0 9 0 3 0 13 0

3 (C. Faloutsos) 17 107 0 5 9 0 3 4 2 04 (Ravi Kumar) 73 27 0 3 13 0 18 5 0 0

5 (Wei-Ying Ma) 132 26 0 9 0 0 2 0 30 106 (Zheng Chen) 172 9 0 9 0 0 22 0 38 9

7 (H. Garcia-Molina) 23 65 3 0 0 0 1 0 0 48 (H. Kriegel) 19 28 5 0 0 0 6 0 7 49 (G. Weikum) 82 14 0 4 0 0 8 0 4 0

10 (D. R. Karger) 11 5 13 0 7 4 1 7 0 7

Table 6: Top 10 actors, directors and meta paths on IMDB dataset(note that only the constraint (Tj ) of the paths (AMD|M.T = Tj , j =

1 . . . 1591) are shown in the fourth column).

Rank Actor Director Conditional meta path1 Eddie Murphy Tim Burton Comedy2 Harrison Ford Zack Snyder Drama3 Bruce Willis Marc Forster Thriller4 Drew Barrymore David Fincher Action5 Nicole Kidman Michael Bay Adventure6 Nicolas Cage Ridley Scott Romance7 Hugh Jackman Richard Donner Crime8 Robert De Niro Steven Spielberg Sci-Fi9 Brad Pitt Robert Zemeckis Animation10 Christopher Walken Stephen Sommers Fantasy

triesxi,j,k of X. By considering all the cooperations,xi,j,k refersto the number of collaborations by theith actor and thekth directorunder thejth type of movie. The size ofX is5324×112×551 andthere are 36529 nonzero entries inX. The percentage of nonzeroentries is7.827 × 10−4%.

Table 6 shows the top ten actors, directors and constrained metapaths (i.e., movie type). We observe the mutual enhancements ofthe importance of objects and meta paths again. Basically, the re-sults comply with our common senses. The top ten actors are wellknown, such as Eddie Murphy, Harrison Ford. Similarly, these di-rectors are also famous in filmdom due to their works. These movietypes obtained are the most popular movie subjects as well. In ad-dition, we can observe the mutual effect of objects and pathsonemore time. As we know, Eddie Murphy and Drew Barrymore (rank1, 4 in actors) are famous comedy and drama (rank 1, 2 in paths)actors. Harrison Ford and Bruce Willis (rank 2,3 in actors) are pop-ular thrill and action (rank 3,4 in paths) actors. These higher rankeddirectors also prefer those popular movie subjects.

6.5 Fast Computation ExperimentsBased on the DBLP dataset, we select two meta paths with varyinglength (l): (APA)l and(APCPA)l, wherel means times of pathrepetition ranging from 1 to 5. We record the running time of matrixmultiplication based on these paths with different fast computationstrategies. The direct matrix multiplication is baseline.Meanwhile,we calculate the differences of results obtained by baseline and fastcomputation strategies (i.e.,F norm of differences of two matices).These differences can be considered as the accuracy measureof fastcomputation strategies (the smaller the better). We set theparam-eters in the Trun strategy as follows:W is 200,β is 0.5, andγ

is 0.02. The number of walkers (i.e.,K) in the MonC strategy is500. All experiments are done on machines with Intel Xeon 8-CoreCPUs of 2.13 GHz and 64 GB RAM.

Figure 9 shows the running time and accuracy of three strategies ondifferent paths. From Figure 9(a) and (b), we can find that theDynPis an effective strategy to fasten matrix multiplication onboth paths,while the Trun and MonC strategies only speed up matrix multipli-cation on the path(APCPA)l. During the matrix multiplicationalong(APA)l, the matrix is always sparse, so the baseline itself isvery fast. In this condition, the Trun and MonC strategies donotwork. For the path(APCPA)l, the multiplication matrix becomesdense due to the low dimension ofC (# of conferences is 20), soits running time increases greatly. In this condition, the Trun andMonC are also effective strategies to fasten matrix multiplication.Then, we observe their accuracy from Figure 9 (c) and (d). Asan information-lossless strategy, the DynP’s results are the same aswith baseline. The MonC strategy has the lowest accuracy. Tosumup, the DynP can effectively accelerate matrix multiplication with-out loss on accuracy, while the Trun and MonC strategies alsohelpto speed up the multiplication of dense matrices.

In Section 4.4, we have pointed out that the time complexity ofthe rank computation in HRank-CO isO(trn2). However, for realapplications, the relation matrices are very sparse, so thereal timecomplexity is linear to the number of links. This point is confirmedby the following experiments. We create three different scales oftensors (sizen × r × n). We record the running time for rankcomputation on different link densities. The results are shown inFigure 10. It is clear that the running time slowly and near linearlyincreases with the increment of link density. Moreover, thelongerrunning time is needed for larger-scale tensor.

10−7

10−6

10−5

10−410

0

101

102

Density(log scale)

Runningtime(logscale,

s)

2000*500*20005000*500*500010000*500*10000

Figure 10: The running time for rank computation on different linkdensities.

6.6 Convergence ExperimentsIn Figure 11, we show the convergence of HRank on the previousexperiments. The results illustrate that the three versions of HRankall quickly converge after no more than 20 iterations. In addition,

1 2 3 4 50

10

20

30

l

Runningtime(s)

BaselineTrunDynPMonC

(a) Running time on(APA)l1 2 3 4 5

100

102

104

l

Runningtime(logscale,s)

BaselineTrunDynPMonC

(b) Running time on(APCPA)l1 2 3 4 5

0

1

2

3

4

l

Difference

TrunDynPMonC

(c) Accuracy on(APA)l1 2 3 4 5

0

1

2

3

4

5

l

Difference

TrunDynPMonC

(d) Accuracy on(APCPA)l

Figure 9: Running time and accuracy of matrix multiplicatio n based on different fast computation strategies and paths.

we can also observe that HRank has different convergence speedin these three conditions. For symmetric meta paths, the HRank-SY almost converges on 6 iterations (see Figure 11(a)). However,HRank-CO for co-ranking converges on 16 iterations (see Figure11(c)). We think it is reasonable, since it is more difficult to con-verge for more objects in the HRank-CO.

0 5 10 15 200

2

4

6

8x 10−3

Number of Iterations

||xt−

x t−1|| 1

APA|P.L=IRAPA|P.L=DB

(a) HRank-SY

0 5 10 15 200

0.5

1

1.5

2


||xt−

x t−1|| 1+

||zt−

z t−1|| 1

APC|P.L=DBAPC|P.L=IR

(b) HRank-AS

0 5 10 15 2010

−20

10−10

100


||xt−

x t−1|| 1+

||yt−

y t−1|| 1

APA

(c) HRank-CO

Figure 11: The difference between two successive calculated proba-bility vectors against iterations based on the three versions of HRank.

7. CONCLUSIONSIn this paper, we first study the ranking problem in heterogeneousinformation network and propose the HRank framework, whichisa path based random walk method. In this framework, we intro-duce the constrained meta path concept to capture the more subtleand refined semantics contained in HIN. In addition, we further putforward a method to co-rank the paths and objects, since the pathseffect the importance of objects. Extensive experiments validatethe effectiveness and efficiency of HRank on three real datasets.

8. REFERENCES[1] S. Chakrabarti. Dynamic personalized pagerank in

entity-relation graphs. InWWW, pages 571–580, 2007.[2] H. Deng, M. R. Lyu, and I. King. A generalized co-hits

algorithm and its application to bipartite graphs. InKDD,pages 239–248, 2009.

[3] J. Han. Mining heterogeneous information networks byexploring the power of links. InDiscovery Science, pages13–30, 2009.

[4] G. Jeh and J. Widom. Simrank: a measure ofstructural-context similarity. InKDD, pages 538–543, 2002.

[5] M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao. Graphregularized transductive classification on heterogeneousinformation networks. InECML/PKDD, pages 570–586,2010.

[6] R. Jin, V. E. Lee, and H. Hong. Axiomatic ranking ofnetwork role similarity. InKDD, pages 922–930, 2011.

[7] J. M. Kleinberg. Authoritative sources in a hyperlinkedenvironment. InSODA, pages 668–677, 1999.

[8] X. Kong, P. S. Yu, Y. Ding, and D. J. Wild. Meta path-based

collective classification in heterogeneous informationnetworks. InCIKM, pages 1567–1571, 2012.

[9] N. Lao and W. Cohen. Fast query execution for retrievalmodels based on path constrained random walks. InKDD,pages 881–888, 2010.

[10] X. Li, M. K. Ng, and Y. Ye. Har: Hub, authority andrelevance scores in multi-relational data for query search. InSDM, pages 141–152, 2012.

[11] M. K. NG, X. Li, and Y. Ye. Multirank: Co-ranking forobjects and relations in multi-relational data. InKDD, pages1217–1225, 2011.

[12] Z. Nie, Y. Zhang, J. Wen, and W. Ma. Object-level ranking:bringing order to web objects. InWWW, pages 422–433,2005.

[13] L. Page, S. Brin, R. Motwani, and T. Winograd. Thepagerank citation ranking: Bringing order to the web.Technical report, Stanford University Database Group, 1998.

[14] C. Shi, X. Kong, P. S. Yu, S. Xie, and B. Wu. Relevancesearch in heterogeneous networks. InEDBT, pages 180–191,2012.

[15] C. Shi, C. Zhou, X. Kong, P. S. Yu, G. Liu, and B. Wang.Heterecom: A semantic-based recommendation system inheterogeneous networks. InKDD, pages 1552–1555, 2012.

[16] Y. Sun, J. Han, X. Yan, P. Yu, and T. Wu. Pathsim: Metapath-based top-k similarity search in heterogeneousinformation networks. InVLDB, pages 992–1003, 2011.

[17] Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu.RankClus: integrating clustering with ranking forheterogeneous information network analysis. InEDBT,pages 565–576, 2009.

[18] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu.Integrating meta-path selection with user-guided objectclustering in heterogeneous information networks. InKDD,pages 1348–1356, 2012.

[19] Y. Sun, Y. Yu, and J. Han. Ranking-based clustering ofheterogeneous information networks with star networkschema. InKDD, pages 797–806, 2009.

[20] X. Yu, Y. Sun, B. Norick, T. Mao, and J. Han. User guidedentity similarity search using meta-path selection inheterogeneous information networks. InCIKM, pages2025–2029, 2012.

[21] D. Zhou, S. A. Orshanskiy, H. Zha, and C. L. Giles.Co-ranking authors and documents in a heterogeneousnetwork. InICDM, pages 739–744, 2007.

HRank: A Path based Ranking Framework in Heterogeneous ... · bility, and thus lead to different...

Documents

Transcript of HRank: A Path based Ranking Framework in Heterogeneous ... · bility, and thus lead to different...