WIGM: Discovery of Subgraph Patterns in a Large Weighted Graph · graph patterns in a single graph...

WIGM: Discovery of Subgraph Patterns in a Large Weighted Graph

Jiong Yang∗ Wei Su† Shirong Li‡ Mehmet M. Dalkilic§

Abstract

Many research areas have begun representing massivedata sets as very large graphs. Thus, graph mining hasbeen an active research area in recent years. Most of thegraph mining research focuses on mining unweightedgraphs. However, weighted graphs are actually morecommon. The weight on an edge may represent thelikelihood or logarithmic transformation of likelihoodof the existence of the edge or the strength of anedge, which is common in many biological networks.In this paper, a weighted subgraph pattern model isproposed to capture the importance of a subgraphpattern and our aim is to find these patterns in a largeweighted graph. Two related problems are studied inthis paper: (1) discovering all patterns with respectto a given minimum weight threshold and (2) findingk patterns with the highest weights. The weightedsubgraph patterns do not possess the anti-monotonicproperty and in turn, most of existing subgraph miningmethods could not be directly applied. Fortunately, the1-extension property is identified so that a boundedsearch can be achieved. A novel weighted graph miningalgorithm, namely WIGM, is devised based on the1-extension property. Last but not least, real andsynthetic data sets are used to show the effectivenessand efficiency of our proposed models and algorithms.

1 Introduction

With the emergence of areas like social informatics andbioinformatics where large pools of data are representedas graphs, a large amount of data mining research hasbeen developed in analyzing these networks, e.g., sub-graph pattern mining. These graph analysis tools candiscover important inherent patterns or characteristicsof the graph. Many of these tools have been proven veryuseful in various application domains. However, most ofcurrent research has focused on unweighted graphs. Inreality, weighted graphs are common in many applica-tions. The following are a couple of example applica-tions.

∗EECS, Case Western Reserve Univ. [email protected]†EECS, Case Western Reserve Univ. [email protected]‡Amazon Corporation. [email protected]§School of Informatics, Indiana Univ. [email protected]

• Biological Networks. Among the most commonuses of weighted graphs in biology is for construct-ing what are called “functional networks”, weightedgraphs that endeavor to capture and help explaingenomic interactions of an organism[12, 3]. Eachvertex in a functional genomic network representsa gene, and an edge weight is the likelihood orlogarithmic transformation of likelihood that thetwo genes are functionally related. The likelihoodcan be obtained through the integration of exper-imental, textual, and electronically inferred infor-mation that indicates two genes are functionallyrelated. A motif, in this scenario, might be somefragment or complete metabolic pathway, i.e., a se-ries of metabolic reactions[10] (a subgraph) that areso commonly observed among most organisms thatthe subgraph becomes identified as such. Metabolicpathways are idealized subgraphs. Questions aboutwhat genes play roles in pathways and the rolesthey play can be answered with motif discovery.

• Social Networks. Each vertex in the social net-work represents a person while an edge indicatesthe relationship between two persons. A weight onan edge may represent the degree or strength onthe relationship. For instance, a parent-child rela-tionship may be stronger than a coworker relation-ship. A social scientist may be interested in findinggroups of people involved in strong relationships.

To solve these problems, a weighted subgraph pat-tern model is necessary. In many applications, theweight of an edge represents the interestingness of theedge or the probability of the existence of the edge. Ingeneral, the importance of a pattern g should be propor-tional to the weights of the occurrences of g. In a largegraph, there are two difficulties in defining the model.(1) When a pattern has more edges, the occurrencesof this pattern will have a larger number of edges andin turn, the occurrences would carry a larger weight.This could exaggerate the importance of a pattern withmore edges. There may exist applications in which pat-terns with more edges are more desirable. However, inthis paper, we do not focus on these patterns. (2) Ina large graph, the matches or occurrences of a patternmay heavily overlap with each other. For instance, in

212 Copyright © SIAM.Unauthorized reproduction of this article is prohibited.


Figure 1, the pattern g has three edges. g occurs threetimes and all these occurrences only differ at one edge.Thus, two edges are shared by all three occurrences.For example, the edge (u2, u4) and (u4, u5) in G will becounted three times. As a result, the weights of theseoverlapping edge occurrences would be over-counted.

Several methods [15, 2, 4, 14, 8] have been proposedto address the problem of quantifying the importance(support) of a pattern in a single graph or a set ofweighted graphs. However, since they are not designedfor a single weighted graph, these models may notbe directly applicable to the problem studied in thispaper. Inspired by these existing models, we design anew model, called normalized weight model. Themotivations and characteristics of the new model areexplained in Section 3.

a

a

d

b

ca

1.0

1.2 0.9

0.1

2.5

u0

u2u3

u5

u1

u4

d

b

c

a

v1

v3

v2

v0

Figure 1: Graph example

The goal of this paper is to discover subgraphpatterns with high overall weight. There are two ways toqualify a pattern. The first one is to find patterns withthe weight above some user given threshold. However,for an end user, it is very difficult to choose thethreshold. Thus, an alternative approach is to find topk subgraph patterns with the highest overall weights.In this model, a user only needs to specify the numberof patterns that he wants to find, which is much easierto specify.

Although this weighted subgraph pattern modelis meaningful, it does not possess the useful anti-monotonic property. Therefore, many existing subgraphmining algorithms utilizing the anti-monotonic propertycannot be directly applied to this problem. Fortunately,we are able to identify another weaker property: 1-extension property. We denote a pattern withweight over a given threshold as a strong pattern. Incontrast to the anti-monotonic property, the 1-extensionproperty states that a strong subgraph pattern can bepartitioned into two patterns one of which is strong andthe other is either a strong pattern or a 1-extensionsubgraph pattern. A 1-extension subgraph pattern canbe obtained by adding an edge to a strong pattern.

This newly discovered property can be used to prunethe search space and guide the mining process.

In this paper, we devise a novel algorithm, calledWIGM, to mine the weighted subgraph patterns. Sincetwo problems are studied in this paper, there aretwo versions of WIGM: threshold-WIGM (t-WIGM forshort) for mining patterns with weights above a user-given threshold and top-k-WIGM (k-WIGM for short)to discover k patterns with the highest weights. WIGMemploys a bottom-up dynamic programming algorithm.

The remainder of this paper is organized as follows.Some related work is discussed in Section 2. Section 3presents the preliminaries and the problem statement.We formally describe the 1-extension property in Sec-tion 4. The k-WIGM and t-WIGM algorithms are pre-sented in Section 5 and 6, respectively. The experimen-tal results are analyzed in Section 7. Finally, we drawconclusions in Section 8.

2 Related Work

There are three categories of work related to the prob-lem studied in this paper: (1) subgraph mining from aset of graphs, (2) important subgraph pattern discov-ery in a single graph, and (3) subgraph pattern miningin weighted graphs. In recent years, a large number ofalgorithms have been proposed for frequent subgraphmining (e.g., [16, 7, 1, 13, 11]). These algorithms fo-cus on mining frequent subgraph patterns, not findingweighted patterns. Since there are a large number ofsubgraph mining algorithms, we will not elaborate onthese work in this category.

The second category of the related work is onmining frequent subgraph patterns from a single largegraph. The main challenge relies on computing thesupport of a pattern. Support measures that simplycount the occurrences of a pattern may violate the anti-monotonic property (i.e. Apriori property) since occur-rences of the pattern may overlap with each other. In[9], a support measure called maximum independent setsupport measure (MIS) is proposed. In this model, thesupport of a subgraph pattern g is the maximum num-ber of non-overlapping occurrences of g. It has beenproven in [15] that this support measure possesses theanti-monotonic property. The authors of [4] proposea variant of MIS support measure. The only differ-ence is the definition of occurrence overlapping. Twooccurrences of a pattern is defined as overlapped if theyshare any common vertex. In [15], the authors providea tentative support measure to count each occurrencepartially depending on how many overlaps it has withother occurrences. An overlapping graph is built basedon the overlap of occurrences. Each node in the over-lapping graph represents the occurrence of a pattern. If



two occurrences of a pattern overlaps in G, there will bean edge in the overlapping graph between the respectiveoccurrences (nodes). The weight of each node is a recip-rocal of its degree. The support of a pattern is the totalweight on all nodes in the overlapping graph for the pat-tern. In this model, the degree of overlap (the numberof edges in the overlap and the weight on these edges) isnot considered. In [2], for each vertex v in g, let M(v) bethe number of distinct vertices that vertex v is mappedto. The support of g is the minimum M(v) for all v in g.Although several of these models have been shown use-ful and hold the anti-monotone property, they may notwork well in the domain of a single weighted graph dueto the following two reasons. (1) The overlapping mayhave different degree. Some occurrences may overlap onone edge or vertex while others may overlap on manyvertices and edges. Most of the algorithms don’t takethis into account. (2) Edges have different weights andoverlapping may occur on edges with different weights.None of these models can be applied to that scenario.

Recently, researchers have been working in the areaof mining subgraph patterns from a set of weightedgraphs. In [14], each database graph has internalweights associated with each vertex and an externalweight representing the importance of the graph itself.Accordingly, a pattern g has an external weight de-fined as the accumulated external weights of databasegraphs that are subgraph isomorphic to g and an inter-nal weight which is generated by counting only the oc-currences with highest aggregated internal weights. Theweighted support of a pattern g is either weighted sumof the two weights or its external weight under an in-ternal weight constraint. In [8], the weight of a patterng is defined as the sum of the weights of graphs con-taining g. These methods are proposed for computingthe weighted support of a pattern in a set of weightedgraphs. However, they cannot be applied directly to thecontext of a single large weighted graph since they donot consider overlapping occurrences.

Overall, there exists research work on mining sub-graph patterns in a single graph or in a set of weightedgraphs. However, not much has been done on discover-ing subgraph patterns in a single weighted graph, whichis the focus of this paper. Due to the difficulty of thisproblem, we develop a new model and algorithm.

3 Preliminaries

In this section, we present some preliminaries andthe formal problem statement. Without the loss ofgenerality, the graphs are assumed to be undirectedsince it is very easy to extend the problem setting to thedirected graphs. In addition, we focus on the discoveryof connected patterns since in most applications, only

the connected subgraph patterns are interesting to theusers.

Definition 1. A labeled graph G is a five elementtuple G = (V, E, ΣV , ΣE , LG) where V is a set ofvertices and E ⊆ V × V is a set of edges. ΣV andΣE are the sets of vertices and edge labels, respectively.The labeling function LG defines the mappings V → ΣV

and E → ΣE.

A weighted labeled graph is same as a labeledgraph with one addition: there is a real number weightwe associated with each edge e in G. In some realapplications e.g. biological networks, the weight ofan edge may be the logarithmic transformation of thelikelihood of the existence of an edge. Thus, whencomputing the likelihood of the existence of two edges,instead of using the product of the weights, the sum ofthe weights can be used, which is a simpler operation.The weight of a weighted labeled graph G, denoted asW (G), is equal to the sum of all weights on every edgein G.

Definition 2. A labeled graph G = (V, E, ΣV , ΣE , LG)is isomorphic to G′ = (V ′, E′, Σ′

V , Σ′E , L′

G), denotedby G ≈ G′, iff a bijection f : V → V ′ exists, s.t. 1)∀ u ∈ V , LG(u) = L′

G(f(u)), 2) ∀ u, v ∈ V ,(u, v) ∈E ⇔ (f(u), f(v)) ∈ E′, and 3) ∀ (u, v) ∈ E, LG(u, v) =L′

G(f(u), f(v)). S is subgraph isomorphic to G′

(S ⊆ G′), if S is isomorphic to at least one subgraphG′′ of G′.

In a similar way, a labeled graph can be definedisomorphic (or subgraph isomorphic) to a weightedlabeled graph by ignoring the weights in the weightedlabeled graph. For example, in the Figure 1, the graphg is subgraph isomorphic to the graph G.

The most straightforward and traditionally usedsupport measure of a subgraph pattern g is the numberof occurrences of g in G. However, many occurrencesof g may overlap. This could cause a problem if theoverlapping is high. As in Figure 1, the edge (u2, u4)and (u4, u5) may be counted three times and theirweights could be over-amplified. As discussed in theprevious section, although several models were proposedto quantify the support of a pattern in a single graph,none of them can be applied directly to the scenario of asingle large weighted graph. Thus in this paper, a newsupport model is proposed. The union of all edges inall occurrences of a pattern g forms a support set for g.Therefore, the weight of every edge e in all occurrenceshas the same contribution to the overall importance ofa subgraph pattern, regardless how many occurrences eparticipates in.



Definition 3. Given a weighted labeled graph G and alabeled graph g, the support set of g in G, denotedas Sup(G, g), is the set of distinct subgraphs in Gwhich are isomorphic to g. These subgraphs are calledoccurrences of g in G. Two subgraphs g′ and g′′ areconsidered distinct if they differ on at least one vertexor one edge. The support edge set of g in G, denotedas Sup edge(G, g), is defined as the union of edge setsof all graphs in Sup(G, g).

Notice that the subgraph pattern g is not weighted,i.e., there is no weight associated with any edge in gwhile the large graph G is weighted. In the exampleof Figure 1, Sup(G, g1) consists of three occurrences ofg while Sup edge(G, g1) consists of five edges: (u0, u2),(u1, u2), (u2, u3), (u2, u4), and (u4, u5).

Definition 4. Given a weighted labeled graph G and aconnected labeled graph g, the weighted support of gin G, denoted as WSup(G, g), is the sum of weights ofall edges in Sup edge(G, g), i.e., Σe∈Sup edge(G,g) W (e).

For example, in Figure 1, the weighted support of gin G is 5.7. However, with the definition of weightedsupport, we may give unreasonably high weights topatterns with more edges. For example, if a patternwith one edge occurs 100 times, then the edge supportset has 100 edges at most. On the other hand, if apattern with two edges occurs 100 times, then the edgesupport set may have 200 edges. As a result, the patternwith more edges could have a higher weighted supportthan patterns with fewer edges. Thus, the overallweighted support of a pattern should be normalized bythe number of edges in the pattern. Looking ahead, weempirically compare the normalized weighted supportmodel with other alternative models on some real datasets in Section 7.

Definition 5. Given a weighted labeled graph G and aconnected labeled graph g, the normalized weightedsupport of g in G (NWSup(G, g)), is equal toWSup(G,g)

|E(g)| where |E(g)| is the number of edges in g.

In Figure 1, the normalized weighted support of g is5.7/3=1.9 in G. Intuitively, the normalized weight of apattern can be viewed as the average aggregated weightof the distinct matches for each edge in the pattern.

Problem Statement: In this paper, we aim to solvethe following two problems. Given a weighted labeledgraph G, the first problem is to find all connected la-beled subgraphs whose normalized weighted support inG is larger than or equal to some user specified thresh-old t. Since it may be difficult to specify the threshold tin some applications, the alternative problem formula-tion is provided as follows. Given an integer k, we want

Figure 2: Multiple Patterns

to find k connected subgraphs which have the highestnormalized weighted supports in G.

4 Property

The anti-monotonic property (i.e., Apriori property)has been one of the most widely applied properties toguide data mining algorithms. It provides the pruningof the search space. Unfortunately, the normalizedweighted support model does not possess this property.For example, in Figure 2, g1 is a supergraph of g2. g1

occurs four times and WSup(G, g1) = 13.6 while g2

occurs twice and WSup(G, g2) = 3.6. Since g1 has twoedges and g2 has one edge, NWSup(G, g1) = 6.8 andNWsup(G, g2) = 3.6. This violates the anti-monotonicproperty.

Fortunately, the weighted support model possessesanother weaker property, which is called 1-extensionproperty. Before presenting the property, some termi-nology is defined first.

Definition 6. Given a weighted labeled graph G anda normalized weight support threshold t, a con-nected labeled subgraph pattern g is called strong ifNWSup(G, g) ≥ t. Otherwise, g is called a weak pat-tern. g (with at least two edges) is called a 1-extensionstrong pattern (1-extension pattern for short) if(1) g is a weak pattern and (2) there exists a connectedsubgraph g′ of g where g′ has one less edge than g andg′ is a strong pattern. Any weak graph pattern with asingle edge is defined as a 1-extension pattern.

In other words, an 1-extension strong pattern canbe obtained by adding one edge into a strong pattern.For example, in Figure 2, with threshold t = 8, g3 isa strong pattern because NWSup(G, g3) = 10. g2 isa 1-extension pattern since it consists of only one edgeand it is a weak pattern. g1 is also a 1-extension patternbecause it can be obtained by adding edge (b, d) to g3.

Let Cont(G, g, E) be the sum of weights of all edgesin E where E ⊆ E(g) and we only count the weightof an occurrence of an edge e if the occurrence of e isin an occurrence of g. For example, in Figure 2, lete be the edge connecting vertices with label b and d



in g1. The edge support set of g1 in G is {(u0, u2),(u1, u2), (u2, u4), (u4, u5), (u5, u6), (u5, u7)}. Thus,WSup(G, g1) = 13.6. Among these six edges in G,e accounts for two edges, (u2, u4) and (u4, u5). As aresult, Cont(G, g1, {e}) = 3.6. We have the followinglemma for Cont(G, g, E).

Lemma 4.1. For a given weighted labeled graph G,Cont(G, g, E) ≤ Cont(G, g′, E) if g is a super graphof g′ and E is a subset of edges in g′.

Proof: Let S and S′ be the edge support set of g and g′

in G, respectively. The set of matched occurrences of Ein S is a subset of these in S′ since each occurrenceof g has to contain an occurrence of g′. Therefore,Cont(G, g, E) ≤ Cont(G, g′, E). �

Property 4.1. (1-Extension Property:) Given aweighted labeled graph G and a normalized weightedsupport threshold t, let g be a connected strong subgraphpattern. There must exist two connected subgraphs g1

and g2 of g, which satisfy the following set of conditions:

• there is no overlapping edge between g1 and g2,

• the set of edges in g is equal to the union of theedges in g1 and g2, and

• either g1 and g2 are both strong patterns, or oneis a strong pattern and the other is a 1-extensionpattern.

Proof: The proof of this property is a little bittedious. Thus, we will give a formal proof for g being atree and give a sketch proof for the case where g is a gen-eral graph. Let g be a tree and v be the root of g. g canbe partitioned into x disjoint branches (patterns) wherex = deg(v), g1, g2, . . . , gx. Let gi be the pattern withthe lowest Cont(G, g, E(gi))/|E(gi)| among the x pat-terns (1 ≤ i ≤ x). Now we partition g into two patterns:gi and g′ = g− gi. Figure 3 shows an example of gi andg′ for a tree gT . Both gi and g′ are connected. We haveCont(G, g, E(g′))/|E(g′)| ≥ Cont(G, g, E(gi))/|E(gi)|.By Lemma 4.1, Cont(G, gi, E(gi)) ≥ Cont(G, g, E(gi))and Cont(G, g′, E(g′)) ≥ Cont(G, g, E(g′)).If Cont(G, g, E(gi)) /|E(gi)| ≥ t, then bothNWSup(G, g′) and NWSup(G, gi) are at least t.The property holds since condition (1) is satisfied.Otherwise, gi is a weak pattern while g′ is a strongpattern and we travel down the gi branch from the rootand recursively divide gi.

Let u be the root of gi and v is the firstly reachedvertex after the root. There are three cases of v based onthe degree of v: deg(v) = 1, deg(v) = 2, and deg(v) > 2.When deg(v) = 1, v is a leaf and there is only oneedge in the branch (gi) before reaching v. Thus, g′

Figure 3: Example of partition

is a strong pattern and gi is an 1-extension patternby definition. Therefore, the property holds. Whendeg(v) = 2, we can move the edge (u, v) from gi to g′.If gi is strong now, then g′ is either a strong or an 1-extension pattern since g′ is strong before adding theedge (u, v). In the case that gi is still weak, g′ has tobe strong because g is strong. Then, we traverse downthe branch again from v. For deg(v) > 2, there areat least two downward branches starting at vertex vin addition to the edge (u, v). First the edge (u, v) ismoved from gi to g′. In the case that gi becomes strong,the property holds. Otherwise, gi is partitioned basedon its downward branches, the branch with the lowestnormalized weighted support remains in gi while otherbranches are moved to g′. The procedure continues.

In this procedure, both gi and g′ are always con-nected and one of two termination conditions will occur:(1) gi becomes strong or (2) deg(v) = 1. In case (1), g′

is either strong or 1-extension strong and gi is strong.Thus, the property holds. In case (2), gi has one sin-gle edge which is an 1-extension pattern by definition.Thus, the property also holds.

For the situation of g being a general connectedgraph, the proof is more tedious and complicated. Dueto space limitations, we will not present the formalproof, but rather give a sketch in this paper. Let gT bea spanning tree of g. Then a similar partition processas the one for the tree is performed on gT . There arethree modifications for the partitioning process to makeg′ and gi as connected subgraphs. (1) gi and g′ aregraphs instead of trees. (2) When taking one branchof the pattern, we need to take both the branch in thespanning tree and the edges having both ending pointsare in the branch. The edges between gi and g′ areassigned to g′. (3) After moving edge (u, v) from gi tog′, we need to move all edges between u and verticesin gi from the subgraph gi to the subgraph g′ one at atime before traveling downward.

During this procedure, when one edge is moved fromgi to g′, if g′ changes from strong to weak, then gi hasto be strong and g′ is a 1-extension pattern. Otherwise,when gi only contains one edge, it is a 1-extensionpattern. Therefore, the property holds. �

It is possible that a strong pattern S can be



partitioned into two sub-patterns in which one sub-pattern is neither strong nor 1-extension. However,there must exist a partition of S (i.e., S is partitionedto S1 and S2) such that one of the sub-pattern isstrong while the other is either strong or 1-extension.Therefore, we should be able to reach S if we onlyfocus on strong and 1-extension patterns since S1 andS2 would be generated first.

Based on the 1-extension property, we devise abottom-up approach that only focuses on strong and1-extension patterns. The 1-extension property can bedirectly used to find subgraph patterns with normalizedweighted support of at least t. On the other hand,more is needed for the top k patterns discovery since theweight threshold t is unknown. In this case, a weightthreshold is dynamically maintained. The algorithmsfor solving these two problems are discussed in detailsin the next two sections.

5 Threshold-WIGM: Mining withWeighted Support

To find patterns whose normalized weighted supportabove a user-specified threshold t, one of the straight-forward methods is to start from a small pattern andgrow by adding one edge at a time. Due to the lackof the anti-monotonic property, it is possible that thenormalized weighted support of a connected patternwith m edges may be larger than any of its connectedsub-patterns with m − 1 edges, which does not provideany termination condition on the search. For example,in Figure 1, NWSup(g) = 1.9 and g has two connectedsubgraphs with 2 edges. The normalized weightedsupport of these two sub-patterns are 1.6 and 1.3. Infact, we can only say that for a connected pattern Pwith m edges, there exists a connected sub-pattern P ′

with [m/2] edges, the normalized weighted support ofP ′ is larger than or equal to that of P . In order to usethe existing depth-first subgraph mining methods, thefollowing modification has to be made. When a strongconnected pattern with m edge is found, it has to growone edge at a time and the search on this pattern canterminate only if none of its connected super-patternwith up to 2m edges is strong. We name this methodthe base algorithm (t-base). It is obviously that t-base is an inefficient algorithm. Looking ahead, thebase algorithms are compared with our proposed WIGMalgorithms empirically in a later section.

In this section, the algorithm with the minimumnormalized weighted support t is presented, which isreferred to as the threshold-WIGM (t-WIGM for short).The formal description of this algorithm is presented inAlgorithm 1.

5.1 Main Algorithm Since all strong patterns canbe generated by combining two strong patterns orcombining a strong pattern with a 1-extension pattern,the following procedure is employed. The t-WIGMproceeds iteratively. The main data structure in thisalgorithm consists of four sets: S, W , SN , and WN . Sstores all strong patterns while W stores all 1-extensionpatterns discovered so far. SN and WN store the newlygenerated strong and 1-extension patterns discovered inthe previous round, respectively.

Algorithm 1 Threshold-WIGMInput: Graph G, minimum normalized weighted supporttOutput: A set S of patterns with normalized weightedsupport in G greater than or equal to t.

1: S ← ∅, W ← ∅, SN ← ∅, WN ← ∅.2: for each unique edge e in G do3: Calculate NWSup(G, e)4: if NWSup(G, e) ≥ t then5: Add e into S and SN6: else7: Add e into W and WN8: end if9: end for

10: while Either SN or WN is not empty do11: W ′

N ← ∅, S ′N ← ∅

12: for each pair of patterns (p1, p2) in (S , WN ) and(SN , W) do

13: CP ← combine(p1,p2)14: for each candidate pattern g in CP do15: if g is not in S and NWSup(G, g) ≥ t then16: Add g into S ′

N17: end if18: end for19: end for20: for each pattern g in SN do21: Find a set of edges SE that can be added to g22: for each edge e in SE do23: Obtain g′ by adding e in g24: if g′ is not in S and W then25: if NWSup(G, g′) < t then26: Add g′ into W ′

N27: else28: Add g′ into S ′

N29: end if30: end if31: end for32: end for33: SN ← S ′

N34: WN ←W ′

N35: Add patterns in S ′

N into S36: Add patterns in W ′

N into W37: end while38: Return S



Initially, the normalized weighted support of everyedge is computed. If the edge is strong, then it isput into both S and SN . Otherwise, it is put into Wand WN since every weak single edge graph pattern isdefined as an 1-extension pattern. Notice that initially,S = SN and W = WN . In later rounds, S is a supersetof SN while W is a superset of WN .

In each of later rounds, we first generate newstrong patterns. A strong pattern may be obtained intwo ways: (1) combining two strong patterns or (2)combining a 1-extension pattern and another strongpattern. The first case is equivalent to combining apattern in SN with a pattern in S, while the secondcase is to combine a pattern in S with a pattern inWN or combine a pattern in SN with a pattern inW . The combination procedure is described in a latersubsection.

For each of these newly generated candidate pat-terns g, we first test whether g is already in S, whichcan be done by using the canonical form of g. Thereare many types of canonical forms and any canonicalform would work here. Without a loss of generality, thecanonical form in [7] is chosen. If g does not exist in S,NWSup(G, g) is calculated. If NWSup(G, g) < t, theng is discarded. Otherwise, it is added into S′

N for thenext round.

Next, new 1-extension patterns are generated. Bydefinition, an 1-extension can be obtained by adding oneedge to a strong pattern. It is unnecessary to extend allpatterns in S since many of these patterns have beenextended in previous rounds. Thus, only patterns in SNare extended. For every pattern g in SN , one more edgeis added to g. The new edge may connect two verticesin g or connect one vertex in g and a vertex outside g.For each newly extended pattern g′, we check whetherg′ is in S or W . If not, NWSup(G, g′) is computed. IfNWSupp(G, g′) < t, g′ is appended to W ′

N . Otherwise,g′ is added to S′

N .The final step in each round is to replace WN and

SN with the newly generated W ′N and S′

N , respectively.In addition, S and W are updated to include these newpatterns. The process terminates when WN = SN= ∅.

5.2 Subgraph Pattern Combination One of themain difficulties in t-WIGM is how to combine twosubgraphs g1 and g2. Since we require the resultingpatterns be connected, at least one vertex from eachsubgraph should have the same label. If all vertexlabels in g1 do not appear in g2, then the results of thecombination is a null set. Otherwise, for each vertex vin g1, we find the vertices u in g2 which have the samelabel as v. The vertices of u and v can be combined intoone vertex in the new pattern. A data structure M is

used to maintain the mapping of all pairs of vertices ing1 and g2 that have the same label.

Let’s assume that g1 has three vertices v0, v1, andv2 with labels A, A, and C while g2 consists of threevertices u0, u1, and u2 with labels A, B, and C as shownin Figure 4(a). The mapping includes the following pairs(v0, u0), (v1, u0), and (v2, u2).

A new combined pattern g′ includes one or morecombined vertices. We first generate new patterns withone combined vertices, then two combined vertices, andso on. The maximum number of combined vertices inthe new pattern is equal to the number of disjoint pairsin M , which can be determined by the maximum matchin a bipartite graph. Vertices in g1 are one set whilevertices in g2 are another set. If there is a match (u, v)in M , there is an edge between u and v in the bipartitegraph. The known max-flow algorithm can be used todetermine the maximum match. There are three newpatterns with one combined vertex g′1, g′2, and g′3 inFigure 4(b). Also, there are two patterns g′4 and g′5with two combined vertices. The formal description ofthe combination algorithm is in Algorithm 2. In theworst case, the algorithm is exponential. But in theexperimental results, we show that in real applications,the algorithm is much more efficient on average.

(a) Two patterns to be joined

(b) Five new patterns

Figure 4: Combine Two Patterns

5.3 Support Computation The computation ofthe normalized weighted support of a pattern g is at theheart of the t-WIGM. Since it is invoked many times,it is essential that this computation is performed effi-ciently. The main difficulty is to locate all occurrencesof a subgraph pattern. Subgraph indexing is used to



Algorithm 2 Combining Two PatternsInput: Pattern g1 and g2.Output: A set of new combined patterns CP .

1: for each vertex v in g1 do2: Find a set of vertices SV in g2 having the same label

as v.3: for each vertex u in SV do4: Add (v, u) into M5: end for6: end for7: l← max flow(g1, g2, M)8: i← 19: while i ≤ l do

10: for each i vertices SU in g1 do11: if there exist i distinct vertices SV in g2 that are

mapped from SU then12: p← pattern with combining SV and SU13: CP ← CP ∪ p14: end if15: end for16: i← i + 117: end while18: return CP

find occurrences of a pattern since it has been shown toaccelerate the match process dramatically. Without aloss of generality, GADDI [18] is chosen as the indexingstructure for the large weighted graph. After matches ofa subgraph pattern is discovered, the weight of matchededges are obtained and the normalized weight of the pat-tern is computed based on the definition.

5.4 Algorithm Analysis In this subsection, the cor-rectness of the t-WIGM algorithm is first proven, thenthe time complexity of this algorithm is shown. To provethe correctness of the algorithm, we show that everystrong pattern is enumerated in the algorithm by induc-tion. Any single edge strong and 1-extension patternsare generated in the initialization step. Assume that all1-extension patterns and strong patterns with i or lessedges have been enumerated. A strong pattern p withi+1 edges can be constructed by combining 2 connectedstrong patterns with i or less edges or a strong patternand a 1-extension pattern with i or less edges. There-fore, p will be enumerated. In addition, a 1-extensionsubgraph with i+1 edges is generated by extending froma strong pattern with i edges. Since all i edge strongpatterns are enumerated, all i+1 edges 1-extension pat-terns are also enumerated. Thus the threshold-WIGMalgorithm can find all strong patterns.

The complexity of the basic algorithm is highlydepended on the number of discovered patterns. Letn be the total number of strong patterns and l bethe number of distinct edges in G. Therefore, the

total number of 1-extension patterns is at most n ×l. Since every strong pattern needs to be combinedwith any strong pattern and 1-extension pattern, thereare O(n2l) possible combinations. The complexityof combining two patterns is highly depended on thenumber of vertices sharing the same label in these twopatterns. Due to space limitations, we will not show thetheoretical complexity bound of pattern combination.Instead, we will empirically characterize the efficiencyof the algorithm in Section 7.

T-WIGM mainly has two shortcomings. First thethreshold is too difficult to set for an end user. Secondly,the algorithm may not be efficient since it has to keep alldiscovered strong patterns and 1-extension in memoryand there does not exist a bound on the number ofstrong patterns and 1-extension patterns. As a result, inthis paper, we propose another alternative model: top-kpatterns to address these problems.

6 Top-K-WIGM: Mining for Top-k Patterns

Algorithm 3 k-WIGM-AdditionInput: Graph G, the number k, pattern setsS ,W,SN ,WN .Output: None.

1: t is set to kth normalized weighted support in S2: Remove patterns with normalized weighted support less

than t from S and SN3: W ← ∅,WN ← ∅

4: for each pattern p in SN do5: for each edge e in G do6: Initialize a candidate pattern set CP ← ∅

7: CP ← combine(p,e)8: for each candidate pattern g in CP do9: if NWSup(G, g) < t then

10: Add g into W and WN11: end if12: end for13: end for14: end for15: for each pattern p in S and not in SN do16: for each edge e adjacent to a vertex in p do17: CP ← combine(p,e)18: for each candidate pattern g in CP do19: if NWSup(G, g) < t then20: Add g into W21: end if22: end for23: end for24: end for

When finding top k patterns, the minimum sup-port threshold t is unknown. Therefore, an iterativeapproach, called top-k-WIGM (k-WIGM for short), isdevised. The top-k patterns discovery process is the



same as t-WIGM with one modification: the minimumnormalized weighted support threshold is updated atthe end of each round. At the end of ith round, S con-tains a set of strong patterns. The normalized weightedsupport of the kth pattern is computed and it is cho-sen as the minimum support threshold ti for the nextround. Patterns with support less than ti are prunedfrom S and SN . W and WN are updated based on thenew S and SN . First, both of W and WN are set toempty. Next, for each pattern p in SN , p is extendedby one edge and the new extended patterns are put intoWN and W if they are not strong. For any subgraphpattern q in S but not in SN , q is also extended byadding a new edge and the new extended patterns areincluded in W if they are not strong. Since there are atmost k distinctive patterns in S and SN together, thecomputation time to generate patterns in WN and Wis not significant. The k-WIGM is the same as t-WIGMwith one exception: the k-WIGM-Addition procedureis invoked between line 36 and 37 in Algorithm 1. Theformal description of the k-WIGM-Addition procedureis in Algorithm 3.

In each round, with newly discovered strong and1-extension patterns, the minimum support thresholdti increases monotonically. As a result, the numberof patterns in S is controlled to be k. (S may havemore than k patterns if multiple patterns have thesame normalized weighted support.) Due to this fact,the memory requirement of k-WIGM is quite small.When the algorithm terminates (i.e., SN is empty), thepatterns in S are returned.

k-WIGM is correct because all potential strong pat-terns are combined with potential strong patterns andpotential 1-extension patterns. Therefore, no strongpattern will be missed. For example, for a strongsubgraph pattern P with normalized weighted supportmore than t, there exist two subgraphs patterns P1 andP2 such that one pattern is a strong (support greaterthan t) and the other is a strong pattern or a 1-extensionpattern. P1 and P2 will be discovered and put into Sand/or W in previous rounds since the threshold in pre-vious rounds is less than or equal to t. Therefore, Pwould be discovered.

Since S has k patterns and W has k× l patterns atmost in each round. The total number of combinationsis O(k2l) at each round. Assuming that there are rrounds, the overall number of pattern combinations isO(k2lr). The complexity of combining two patterns ishighly depended on the number of shared vertex labelsin the two subgraphs. We will show the efficiency of thealgorithm empirically in next section.

7 Experimental ResultsWe analyze the effectiveness and efficiency of theweighted subgraph mining models and methods in thissection. The WIGM algorithms deal with weightedgraphs. To the best of our knowledge, although muchwork has been proposed to mine patterns in a set ofweighted graphs and patterns in a single non-weightedgraph, there does not exist any prior work on dis-covering subgraph patterns in a single large weightedgraph. Thus, we could not compare our methods withother existing alternative models. Therefore, we com-pare WIGM with a baseline algorithm. In the baselinealgorithm, the 1-extension property is not employed.Instead, in each round, one more edge is added intoexisting patterns as described in Section 5. The twoversions of WIGM and two version of the baseline al-gorithm (t-base and k-base) are implemented with theC++ programming language. All experiments are con-ducted on a Dell PowerEdge 2950, with two 3.33GHZquad-core CPUs and 32GB main memory, using Linux2.6.18-92.e15-smp.

7.1 Biological Networks The biological net-work used in this experiment Gbio = (Vbio,Ebio, ΣVbio

, ΣEbio, LGbio

) are constructed from theexperimental data from [3] denote as Gfly (Drosphilamelanogaster); specifically, Gbio is constructed fromthe protein-protein interaction data which gives us aneasier way of validating motifs. Vbio is a set of fly genes.An edge in Gfly is a possible (potential) interactionbetween two genes. The weight on an edge is thesum of likelihoods of multi-model experimental datasupporting a functional relationship between two genes,which represents the probability of the interactionbetween the two proteins. An edge exists if the sum isabove a determined threshold. The vertex labels areGO Biological Process Terms [5] (GO:BP).

In summary, there are 7496 vertices (|Vbio| = 7496)with 515 distinct labels and 25408 edges (|Ebio| =25408). The average degree of vertices is 6.78. We applyboth WIGM algorithms on this data set with differentthresholds. The t value is chosen so that the samenumber of patterns will be discovered by all methods.Table 1 shows the execution time of t-WIGM, k-WIGM,t-base, and k-base algorithms.

It is clear that the t-WIGM saves about 5% to 10%execution time compared to k-WIGM since k-WIGMneeds to estimate the minimum weight threshold dy-namically. At each round, k-WIGM sets the minimumweight threshold t as the kth normalized weight supportin the set of strong patterns generated. Thus, in theearlier rounds, t is set as a small value and more weakpatterns are generated by k-WIGM, which leads to the



Table 1: Results on Biological Network (time in sec.)k t k-WIGM t-WIGM k-base t-base4 176 23 23 47 448 167 67 65 159 14916 126 158 149 468 45132 115 389 359 1332 128750 107 723 668 3345 3192100 100 1854 1715 11081 10454

prolonged execution time. It takes about 6.5 minutesto find 32 patterns with the highest weights. Compar-ing to t-base and k-base methods, the execution time ofWIGM algorithms is about 1/2 to 1/6 because the basealgorithms have the following two shortcomings: (1) thetermination condition is loose, thus more iterations areneeded and (2) edges are inserted one at a time, morecandidate patterns are generated. (However, there isone advantage of the base algorithms: adding one edgeinto a pattern can be done more efficiently than com-bining two patterns.) Overall, a longer execution timeis needed for the base algorithms. Based on this, thepruning power of the 1-extension property is evident.

Table 2: Discovered Motifs in Biological Networkk Motif Terms1 3479.37344.11598.39044 4925 3479.11598.1219.10602.212 6050 3479.37344.11598.11661 1662 3479.11598.2593.1319 1875 3479.37344.11598.1128 3096 4595.10300 2

The following method is used to infer (represent)the biological importance of a graph motif. A graphmotif consists of a set of genes. Each gene is associatedwith some words or terms. The terms may include genenames, gene ids, function annotations (e.g., gene ontol-ogy terms), etc. GeneList Analyzer [6] collects a largeamount of biological literature that includes research ar-ticles, experiments notes, etc. An over-represented termis a term whose number of occurrences is above a levelof statistically significance, e.g., p < 0.05, assuming thatthe terms follow a normal distribution. Thus, the bio-logical importance of a gene can be represented by thenumber of over-represented terms associated with thisgene. In turn, the importance of a graph pattern canbe represented by the number of over-represented termsassociated with all genes in the pattern.

Table 2 shows some examples of discovered motifs.The left column k is the rank of the motif based onthe normalized weighted support; the middle is the

FBGN motif; the right column is the number of over-represented terms determined by GeneList Analyzer [6].The motif is encoded as a regular expression where ‘.’ isconcatenation. The discovered patterns are biologicallyinteresting. Using GeneList Analyzer, an applicationthat detects statistical significance (p < 0.05) of genelists and respective terms (GO:BP, for example), wefind that the top 100 motifs discovered by k-WIGMare associated with a large number of over-representedterms.

In Fig. 5 we provide a smoothed plot of the results,k-WIGM motif vs. number of significant terms. Thestraight line is the linear regression. We do observethe intuitive relationship. As the motif becomes lessimportant (smaller normalized weighted support), thenumber of significant terms decreases.

Figure 5: A smoothed curve of motif significance vs.terms and a linear regression line generated by R.

To show the usefulness of the normalized weightedsupport model, we compare the patterns discoveredby this model with three other models: frequentlyoccurring model, occurrence weight model, and overallweight model. In the frequently occurring model, theweights on the edges are discarded and the goal is tofind the frequently repeated patterns as in [9]. In theoccurrence weight model, if an edge is in x occurrencesof a pattern, then its weight will be counted x timesinstead of once for this pattern. In this model, when anedge is in more occurrences, its weight will contributemore for the overall importance of the pattern. In theoverall weight model, we do not normalize the weightaccording to the number of edges in the pattern. Inthis case, larger patterns may have a higher weight.The number of over-represented terms contained by top100 patterns according to these three models are muchsmaller, between 0 and 10. The average number of over-represented terms of each of these models are shown inTable 3.

7.2 Synthetic Graphs To better analyze the perfor-mance of the WIGM algorithms with respect to differentaspects of the input data, a set of synthetic graphs are



Table 3: Models and Average Number of TermsModel Terms

Normalized Weight Model 31.5Frequently Occurring Model 3.8Occurrence Weight Model 4.2

Overall Weight Model 4.1

employed. The input graphs are generated by a toolcalled gengraph win [19] based on four parameters: thenumber of vertices, the number of vertex labels, the av-erage of degree of a vertex, and standard deviation ofthe weights on an edge. In all experiments, we assumethat the weight on an edge follows a normal distribu-tion and the average weight of edges is 10. Table 4 showsthe default values of these parameters. The degree ofa vertex in the input graph G follows an exponentialdistribution with the rate parameter λ set to 1/d whered is the average degree.

There are two extra parameters in this set of ex-periments, which are the thresholds of the number ofpatterns (k) in k-WIGM and the minimum normalizedweighted support (t) in t-WIGM. To make fair com-parisons, the value t is set according to k such that allmethods will discover the same number of patterns. Thedefault value of k is also shown in Table 4. In this sec-tion, these parameters are varied one at a time to showthe effects on all methods.

Table 4: Default Parameter ValueParameter Default Value

Number of Vertices in G 5000Number of Labels 500

Average Degree of G 10k 100

In all experiments, we find that t-WIGM takes 5%to 10% less time than k-WIGM to find the same setof patterns. At the beginning, the threshold used ink-WIGM may be much lower than the true threshold.As a result, many ”useless” patterns will be discoveredby k-WIGM at early rounds that will be discardedby higher thresholds in later rounds. In addition,the WIGM algorithms outperform the base algorithmsdue to the pruning power of the 1-extension property,which reduces the candidate patterns and the numberof iterations. When the number of vertices varies from1000 to 10000, as shown in Figure 6(a), the executiontime of both versions of WIGM increases with thenumber of vertices at a linear pace while the executiontime of base methods increases at an exponential pacebecause the number of potential patterns is larger with

more the number of vertices and thus, the pruningpower of the 1-extension property is more evident.

(a) Number of Vertices in G (b) Average Vertex Degree ofG

(c) Distinct labels (d) K

Figure 6: Execution Time

When the average degree of G increases, the execu-tion time of WIGM increases exponentially as shown inFigure 6(b). The main reason is that with higher de-gree in G, the discovered patterns will also have a higheraverage degree. In such a case, the cost of combiningpatterns is higher, which leads to a higher executiontime. As with the previous figure, when the degree ishigher, the number of potential candidate patterns islarger. Thus, the 1-extension property can prune morepatterns and it leads to more execution time saving overthe base algorithms.

With more distinct label types, the execution timealso increases as illustrated in Figure 6(c). In eachround, more candidate patterns will be generated. Asa result, the overall time for discovering patterns in-creases. On the other hand, for the base methods, al-though the number of patterns increases, the increasespace is not high. Therefore, the improvement of WIGMover base algorithms remain more or less constant withdifferent number of label types.

In Figure 6(d), the execution time increases whenmore patterns are needed. In this case, all four methodswill take more rounds and the mining process takes moretime. With more iterations, the pruning power of the 1-extension property is more significant and the disparitybetween WIGM and base methods is larger.

On average, the discovered patterns in syntheticgraphs consist of around seven vertices and 15 edges.Overall, our proposed WIGM algorithms can efficientlydiscover strong patterns in graphs with hundreds ofthousands of edges. The average degree of the graph



can be in the range of 30 or 40. Some very largesocial networks may have millions of vertices and edges,WIGM algorithms may not be able to handle thesegraphs. On the other hand, many real data sets, e.g.,biological networks, small social networks, fall into thisrange. Therefore, our WIGM algorithms can be usedto find important patterns for these graphs with amanageable execution time.

8 Conclusion

In this paper, we study the problem of mining importantpatterns from a large weighted graph. First a weightedsubgraph pattern model is proposed. Although theanti-monotonic property no longer holds for the model,we are able to identify a weaker property, namely the1-extension property. Based on this property, twoalgorithms (t-WIGM and k-WIGM) are proposed tofind patterns with respect to a threshold t and top kpatterns. With real and synthetic data sets, we showthat WIGM is capable of finding not only importantpatterns, but also achieving it in an efficient manner.

References

[1] C. Borgelt and M. Fiedler. Graph Mining: Repositoryvs. Canonical Form. Proc. of Annual Conference of theGerman Classification Society, 2007.

[2] B. Bringmann, and S. Nijssen: What Is Frequent in aSingle Graph? PAKDD, 2008.

[3] J. Costello and M. Dalkilic, S. Beason, R. Patwardhan,S. Middha, B. Eads, and J. Andrews. Gene networksin Drosophila melanogaster: integrating experimentaldata to predict gene function. Genome Biology, vol. 10,no. 9, 2009.

[4] Fiedler and Borgelt. Support computation for miningfrequent subgraphs in a single graph. MLG, 2007.

[5] The Gene Ontology Consortium. The Gene Ontologyproject in 2008. Nucleic Acids Research, vol. 36, 2008.

[6] X. He, M. Sarma, X. Ling, B. Chee, C. Zhai, and B.Schatz. Identifying overrepresented concepts in genelists from literature: a statistical approach based onPoisson mixture model. BMC Bioinformatics, vol. 11,2010.

[7] J. Huan, W. Wang, J. Prins, and J. Yang. SPIN: miningmaximal frequent subgraphs from graph databases.Proc. of KDD, 2004.

[8] C. Jiang, F. Coenen, M. Zito. Frequent Sub-graphMining on Edge Weighted Graphs. DAWAK, 2010.

[9] M. Kuramochi, G. Karypis. Finding Frequent Patternsin a Large Sparse Graph. Data Mining and KnowledgeDiscovery, 2005.

[10] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopediaof Genes and Genomes. Nucleic Acids Res, vol. 28,2000.

[11] M. Kuramochi, and G. Karypis, Finding FrequentPatterns in a Large Sparse Graph. DMKD, 2005.

[12] E. Marcotte, M. Pellegrini, M. Thompson, T. Yeates,and D. Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature, vol. 402,1999.

[13] S. Nijssen and J. Kok. A quickstart in frequent struc-ture mining can make a difference, Proc of KDD, 2004.

[14] M. Shinoda, T. Ozaki, and T. Ohkawa. Weighted Fre-quent Subgraph Mining in Weighted Graph Databases.ICDM workshop on Domain Driven Data Mining,2009.

[15] N. Vanetik, S. E. Shimony and E. Gudes. Support mea-sures for graph data. Data Min. Knowl. Discov.,2006.

[16] X. Yan and J. Han. gSpan: graph-based substructurepattern mining, Proc. of ICDM, 2002.

[17] J. Yang, W. Su, S. Li, and M. M. Dalkil. WIGM:Discovery of Subgraph Patterns in a Large WeightedGraph. Technical Report, Case Western Reserve Uni-versity. Available at http://beijing.case.edu/wigm.pdf

[18] S. Zhang, S. Li, and J. Yang. GADDI: distance indexbased subgraph matching in biological networks. Proc.of EDBT, 2009.

[19] gengraph win. Available athttp://www.cs.sunysb.edu/algorith/implement/viger/distrib/.



WIGM: Discovery of Subgraph Patterns in a Large Weighted Graph · graph patterns in a single graph...

Documents

Transcript of WIGM: Discovery of Subgraph Patterns in a Large Weighted Graph · graph patterns in a single graph...