Structured Prediction of Degree Constrained Subgraphsandrews/pub/netp-jmlr-2007.pdfrealistic graphs....

33
Journal of Machine Learning Research 1 (2007) 1-35 Submitted 9/07; Published 10/07 Structured Prediction of Degree Constrained Subgraphs Stuart J. Andrews ANDREWS@CS. COLUMBIA. EDU Tony Jebara JEBARA@CS. COLUMBIA. EDU Department of Computer Science Columbia University New York, NY 10027, USA Editor: Leslie Pack Kaelbling Abstract Recent study of complex graphs has revealed that not all of them are made equal. Man-made and naturally-formed graphs exhibit structural regularity and have common degree distributions that differ a great deal from the random graphs of Erdos-Renyi. A coincident discovery, made in the machine learning community, is that several multivariate prediction problems can be solved with greater accuracy via structural dependencies imposed on the outputs; structured prediction. In other words, not all outputs are made equal. In this paper, we develop machine learning algo- rithms to enable the prediction of naturally structured graphs, a problem we call structured graph inference. Given partially observed connectivity over n observed objects, return a complementary set of directed edges connecting the objects together into a graph while satisfying topological con- straints. Our model learns to predict connectivity of ordered pairs of objects within a structured prediction framework to enforce in-degree and out-degree conditions, while current methods for graph inference largely ignore global characteristics of graphs. If no connectivity constraints can be specified a priori, the model learns to predict these from the observed component of the graph. This specializes independent link prediction and relational learning to interdependent prediction of edges within a combinatorial family of graphs. We demonstrate the characteristics of the model as it learns to predict 2-dimensional geometric graphs, and subsequently analyze its performance on several structured graphs including protein-protein interation, social network and citation network graphs. Keywords: Graph Inference, Graph Reconstruction, Link Prediction, Maximum Margin, Trans- duction 1. Introduction This paper considers learning and inference of structured graphs. Graphs are abstract models used to represent collections of interelated objects. Nodes in a graph represent individual objects, while edges in these graphs encode relationships between pairs. We consider graphs where both nodes and edges may have attributes associated with them. For example, a respresentation of a sentence as a graph includes nodes for each word, and edges between words that are adjacent. The attributes of each node include the word at which it is positioned and any additional word properties associated with that word, while the edges may include attributes relating to word pairs. In another example, we use a graph to represent the folding of an amino acid chain. The graph includes nodes for each amino acid, and edges connecting pairs of amino acids that are either adjacent in the polypeptide chain, or have formed a bond. Attributes of the c 2007 Stuart J. Andrews and Tony Jebara.

Transcript of Structured Prediction of Degree Constrained Subgraphsandrews/pub/netp-jmlr-2007.pdfrealistic graphs....

  • Journal of Machine Learning Research 1 (2007) 1-35 Submitted 9/07; Published 10/07

    Structured Prediction of Degree Constrained Subgraphs

    Stuart J. Andrews [email protected] Jebara [email protected] of Computer ScienceColumbia UniversityNew York, NY 10027, USA

    Editor: Leslie Pack Kaelbling

    AbstractRecent study of complex graphs has revealed that not all of them are made equal. Man-made

    and naturally-formed graphs exhibit structural regularity and have common degree distributionsthat differ a great deal from the random graphs of Erdos-Renyi. A coincident discovery, made inthe machine learning community, is that several multivariate prediction problems can be solvedwith greater accuracy via structural dependencies imposed on the outputs; structured prediction.In other words, not all outputs are made equal. In this paper, we develop machine learning algo-rithms to enable the prediction of naturally structured graphs, a problem we call structured graphinference. Given partially observed connectivity over n observed objects, return a complementaryset of directed edges connecting the objects together into a graph while satisfying topological con-straints. Our model learns to predict connectivity of ordered pairs of objects within a structuredprediction framework to enforce in-degree and out-degree conditions, while current methods forgraph inference largely ignore global characteristics of graphs. If no connectivity constraints canbe specified a priori, the model learns to predict these from the observed component of the graph.This specializes independent link prediction and relational learning to interdependent prediction ofedges within a combinatorial family of graphs. We demonstrate the characteristics of the model asit learns to predict 2-dimensional geometric graphs, and subsequently analyze its performance onseveral structured graphs including protein-protein interation, social network and citation networkgraphs.Keywords: Graph Inference, Graph Reconstruction, Link Prediction, Maximum Margin, Trans-duction

    1. Introduction

    This paper considers learning and inference of structured graphs. Graphs are abstract models usedto represent collections of interelated objects. Nodes in a graph represent individual objects, whileedges in these graphs encode relationships between pairs.

    We consider graphs where both nodes and edges may have attributes associated with them. Forexample, a respresentation of a sentence as a graph includes nodes for each word, and edges betweenwords that are adjacent. The attributes of each node include the word at which it is positioned andany additional word properties associated with that word, while the edges may include attributesrelating to word pairs. In another example, we use a graph to represent the folding of an aminoacid chain. The graph includes nodes for each amino acid, and edges connecting pairs of aminoacids that are either adjacent in the polypeptide chain, or have formed a bond. Attributes of the

    c©2007 Stuart J. Andrews and Tony Jebara.

  • ANDREWS AND JEBARA

    nodes and edges are derived from the individual amino acids and the bonds that are formed betweenpairs. Lastly, we consider a citation network that is an abstract graph comprised of research papersas nodes and citations from one paper to another as directed edges. Attributes are derived from thefull text of the paper, and the text surronding the citation.

    A common belief held in various domains is that the edge attributes tell us something about thenodes and likewise the node attributes tell us something about the edges. In some sense, this beliefencodes the idea that the information associated with nodes and edges is not entirely mutually exclu-sive. Therefore, given partial information about the nodes and edges, a recurring goal is to predictwhat is missing. For instance, in natural language processing, goals include labeling the words withpart-of-speech tags, and mapping out the grammatical construction of the sentence using a directedacyclic graph. A goal in computational molecular biology is to predict the 3-dimensional (tertiarary)structure of a protein by way of predicting bonds and their characteristics given a sequence of aminoacids. The discovery of new or missing connections between distant or related research areas is onereasons for the construction and analysis of citation network.

    We refer to these as graph inference problems. The literature is replete with further examplesof graph inference problems including, for example, gap-filling of metabolic networks using geneexpression information (Herrgaard et al., 2004, Vitkup et al., 2006, Chechik et al., 2007), predictionof structural properties of proteins (Baldi et al., 1999, Punta and Rost, 2005, Gassend et al., 2007),prediction of binding sites and genetic response in a regulatory network (Middendorf et al., 2004),the reconstruction of causal protein-signaling graphs from cellular response data (Sachs et al., 2005),and the graph-based prediction of protein function (Sharan et al., 2007).

    A popular approach for learning to predict attributes of a graph is to build classifiers that special-ize in predicting labels for individual edges and nodes. For example, metric learning has recentlyhad considerable progress in the machine learning community (Xing et al., 2003, Kondor and Je-bara, 2006, Shalev-Shwartz et al., 2004, Globerson and Roweis, 2006). Typically, an affinity ormetric is learned between pairs of examples such that pairs of objects in an equivalence classes arenear each other and pairs that are in different classes are far apart. When applied to graph inference,pairwise distances below a threshold indicate the presence of an edge between a pair of objects.However, using a learned affinity metric in a strictly pairwise manner for graph inference makesconnectivity decisions independently in the graph. Clearly, in realistic scenarios, the presence orabsence of one edge in a graph influences and depends on the presence or absence of other edgesin the graph. Thus, an affinity metric alone may produce graphs whose structure is unlike that ofrealistic graphs.

    Going one step further by taking into account the interdependence of node and edge attributesimproves predictions. This is evident from the success of techniques such as diffusion kernels andstatistical relational models for graphs. However, these and other techniques do not explicitly modelthe global topology or connectivity of those predictions.

    We introduce the problem of structured graph completion: given n objects with partially ob-served connectivity, return a set of edges completing the graph while satisfying topological struc-tural regularity constraints. The topological constraints we focus on concern the number of edgesconnected to each object. If a graphs satisfies the constraints, we say that it is structured.

    The first motivation for this work comes from the observation that not all graphs are alike.Many real-world graphs exhibit a high degree of structural regularity. For instance, social net-work graphs where edges represents friendships, collaborations or affilations have fairly consistentproperties such as small diameters and interesting degree distributions. Complex biological system

    2

  • STRUCTURED GRAPH INFERENCE

    graphs, such as the regulatory networks, metabolic networks, and signal transduction pathways arecomposed of frequently occuring motifs (Kashtan et al., 2004) and local clustering of edges. Manymodels have been proposed to describe such networks, their formation (Even-Dar and Kearns, 2006)and the constraints they frequently satisfy. Therefore, it is worthwhile to augment existing learningmethods to predict graphs that match structural regularities seen in real-world graphs.

    A second motivation for this formulation comes from the observation that complete graphs aredifficult and/or expensive to collect, while on the other hand, partial labelings are typically lessexpensive. Partial labelings are often easier to annotate by hand, or can even be extracted fromother sources. This form of learning has received attention for univariate predictions, for examplein multiple-instance learning (Andrews, 2007), but not apparently for structured outputs.

    Since structured graph completion involves making inter-dependent predictions, we follow thestructured-outputs framework (Altun et al., 2003, Taskar et al., 2003, Altun et al., 2004, Tsochan-taridis et al., 2004, Altun et al., 2005, Taskar et al., 2006). However, we view structured graphcompletion as a transductive inference problem because we are only concerned with the completionof a given partially observed graph. Instead of attempting to learn a general-purpose predictor for anarbitrary graph, we learn to predict missing edges for the graph that is given as input. The attributesof all potential edges are available during training, and it is sufficient to classify these edges alone.Moreover, since the topological structure of the inferred graph depends on all predicted edges, wecan design a better predictor for this graph analyzing all edges concurrently. To this end, we proposea transductive maximum margin framework for structured variables and apply it to the structuredgraph completion problem. Our approach, which applies the principle of transduction within indi-vidual structures, is complementary to that of (Altun et al., 2005, Zien et al., 2007) for dealing withpartially labeled data sets.

    The contributions of our work include:

    1. A taxonomy of graph inference problems

    2. A definition of regular graph structures in terms of a versatile combinatorial family of graphsknown as degree-constrained subgraphs

    3. A novel, exact and polynomial-time solution for graph inference that accounts for our defini-tion of structural regularity

    4. The marriage of degree-constrained subgraphs to the structure outputs framework

    5. A transductive extention of the structured-outputs framework, with application to graph in-ference

    6. A demonstration of our transductive learning framework on synthetic and real-world data

    This paper is organized as follows. Section 2 outlines a variety of graph inference problems. Sec-tion 3 describe inference. Section 4 describes learning within a maximum margin structured out-puts framework, while Section 5 shows how to extend this framework to a transductive setting. InSection 7 we review related work. Section 6 then presents experiments and reconstructs severalreal-world graphs from partially labeled connectivity information. Finally, we conclude with a briefdiscussion in Section 8.

    3

  • ANDREWS AND JEBARA

    2. Graph Completion Problems

    In general, we are given a fixed set of nodes V = {1, . . . , n} which form a complete graph G =(V, E) with n2 directed edges E = {(j, k) |1 ≤ j, k ≤ n} (including self-loops). A feature vectorxj ∈ RdV is associated with each node j and likewise a feature vector xj,k ∈ RdE is associated witheach directed edge (j, k). The representation that is chosen depends on the application. Topologicalattributes of the graph were used for the edges in (Liben-Nowell and Kleinberg, 2003). It is alsopossible to derive attributes for an edge using the node attributes at its head and tail, for exampleusing some multivariate mapping xj,k = f (xj ,xk). If, on the other hand, the input consists ofkernel values kj,k on edges, one can decompose the kernel K = (kj,k) via eigendecomposition intoa set of attributes on the nodes. The resulting features xj measure the similarity of the j-th node toa set of prototypes, like the empirical feature map Tsuda (1999), except in this case the prototypesare formed from the eigenbasis.

    The problem of graph inference arises when the attribute information of a graph is incomplete.Typically, it is assumed that a designated attributes Yj and Yj,k that we call labels are only partiallyobserved. Labels can be continuous or categorical and their semantics varies with application.For instance, in the complete graph G where there is an edge between every pair of nodes, binaryedge labels Yj,k ∈ {0, 1} may indicate the presense or absense of edges (j, k) in another sparselyconnected abstract graph G′. Or they may signify the existence of another distinct relationshipbetween the node pair; for instance, molecule j interacts with molecule k, or person j is friends withperson k. In the sequel, we refer to edges from the complete graph G with binary labels Yj,k = 1as positive edges, and those with labels Yj,k 6= 1 as negative edges. We denote by X = (XV ,XE)the set of node and edge feature vectors, and by Y = (YV ,YE) the set of labels.

    Graph inference concerns the prediction of the unobserved node and edge labels of Y (whichwe denote YU ) of the graph, assuming we have observed only some of the entries (which we denoteYO) together with the information encoded in the independent attributes or features X of the nodesand edges. Again, the basis for this prediction is the underlying assumption that the node andedge attributes contain information about each other. The partition of labels into observed O andunobserved U sets satisfies YO ∪YU = Y. For example, suppose we have V = {1, 2, 3} and weobserve YO = {Y1,1,Y1,2,Y2,1,Y2,2}, then we would like to complete the network by predictingvalues for YU = {Y1,3,Y2,3,Y3,1,Y3,2,Y3,3}. Because some of the labels are observed, and thegoal is to predict the remaining labels, this process is also called completion. Special cases of graphinference deal with sets of labels that are restricted to nodes or edges, or that include both node andedge labels. Moreover, we will also consider problems in which O = ∅ and predictions for YU canrely on the features X alone. Previous work (Yamanishi et al., 2004, Vert and Yamanishi, 2004) hasfocussed on the prediction of binary edge labels alone. Even when the set of labels is restricted toedges, there are several common problems settings to consider, each of them distinct and useful.

    Using a simple geometric example, we outline the graph inference problem in its commonbipartite form (Yamanishi et al., 2004, Vert and Yamanishi, 2004). The inference procedure isconcerned with extending a graph over 24 nodes, starting from an observed subset of only 12 nodesand their connectivity. Graph extensions such as these are useful for adding new nodes to an existinggraph. Formally, we are given a train-test partition of the nodes V = VO ∪VU that generates a blockpartition of the edges. The observed edges (j, k) ∈ EO = VO × VO form the training set and theunobserved edges EU = E\EO form the testing set. In the terminology of (Bleakley et al., 2007),the training set consists of edges between pairs of training nodes, called train-train edges, while the

    4

  • STRUCTURED GRAPH INFERENCE

    com

    plet

    e

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11 12

    13

    14

    15

    16

    17

    18 19

    20

    21

    22

    23

    24

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11 12

    13

    14

    15

    16

    17

    18 19

    20

    21

    22

    23

    24tr

    ain

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11 12

    13

    14

    15

    16

    17

    18 19

    20

    21

    22

    23

    24

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11 12

    13

    14

    15

    16

    17

    18 19

    20

    21

    22

    23

    24

    test

    1

    13

    14

    15

    16

    17

    18 19

    20

    21

    22

    23

    24

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11 12

    1

    13

    14

    15

    16

    17

    18 19

    20

    21

    22

    23

    24

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11 12

    positive negative

    Figure 1: Graph inference on a structured graph. (Top-left) 24 equally spaced nodes on a circle areconnected by positive edges running in a counter clockwise direction. As depicted in (top-right) using dashed lines, there are many more negative edges exhibiting a complementarystructure. In a typical graph inference setting, a subset of the nodes and their connectivityare observed. In this case, one half of the graph nodes numbered 1-12 are observedresulting in (middle-left) positive and (middle-right) negative edges available for training.The goal is to correctly classify the remaining edges into positive and negative categories,as shown in (bottom-left) and (bottom-right) respectively.

    5

  • ANDREWS AND JEBARA

    trai

    n

    5 10 15 20

    5

    10

    15

    20

    5 10 15 20

    5

    10

    15

    20

    5 10 15 20

    5

    10

    15

    20

    5 10 15 20

    5

    10

    15

    20Neg

    Null

    Pos

    test

    5 10 15 20

    5

    10

    15

    20

    (a)5 10 15 20

    5

    10

    15

    20

    (b)5 10 15 20

    5

    10

    15

    20

    (c)

    5 10 15 20

    5

    10

    15

    20Neg

    Null

    Pos

    (d)

    Figure 2: Adjacency matrix representations showing training and testing edge partition for vari-ous graph completion tasks. Black and white entries indicate edges that are positive andnegative respectively, while gray entries indicate edges that are neither members of, norclassified by, the respective partition. Partitions for: (a) graph extension, (b) graph exten-sion excluding test-test predictions, (c) graph merge, and (d) random graph completiontasks.

    testing set consists of edges between training nodes and testing nodes, called train-test edges, aswell as between pairs of testing nodes, called test-test edges.

    Figure 1 displays the positive and negative edges of the graph showing a clear geometric struc-ture (top row). Self loops are not displayed. In this case, one half of the graph nodes numbered 1-12are observed, yielding a partition of positive and negative training edges (middle row) and positiveand negative testing edges (bottom row). Notice that while there is an imbalance in the ratio oftraining to testing edges (144 : 432), we observe that the ratio of positive to negative edges in thetraining set is the same as that for the testing set and the entire graph (1 : 23). The informationthat is present in the subfigures of Figure 1 can also be conveniently displayed using an adjacencymatrix representation of the graph, as is shown in Figure 2 column (a). In this representation, theentry in row j and column k describes edge (j, k) from node j to node k.

    Figure 2 column (b) shows a variation of the bipartite graph extension problem that was pro-posed by (Bleakley et al., 2007). This variation extends a graph by connecting testing nodes withan existing graph over the training nodes, but does not attempt to connect pairs of testing nodes.Notice that the test-test edges from the lower-right quadrant (edges between pairs of nodes 13-24)are not classified by, nor are they included in, either the training or testing sets of the partition. Athird common graph inference problem concerns the merging of two graphs, as shown in Figure 2column (c).

    The three partitions discussed thusfar are distinguished by their bipartite nature; having beengenerated by a partition of the nodes into two sets. Notice, however, that the underlying graphs arenot bipartite as they include edges between pairs of nodes of each partition. One can also consider

    6

  • STRUCTURED GRAPH INFERENCE

    graph inference problems where the underlying graph is bipartite, for example in the modeling ofuser preferences.

    When performing N -fold cross-validation in bipartite graph inference settings, the commonprocedure is to first partition the nodes into N subsets of size n/N (in Figure 2, N = 2). Then,for the k-th fold, a bipartite partition of the graph nodes is constructed by taking the k-th subset ofnodes to be the testing nodes, and all the remaining subsets of nodes to be the training nodes. Theedges are subsequently partitioned, according to one of the schemes (a) through (c) of Figure 2.The crossvalidated performance of a graph inference algorithm, whether it be accuracy, area underthe curve, recall or some other metric, is taken to be the average performance measured on thetesting edges across folds. It is important to note that the performance levels on each fold are notindependent, as they are designed to be in standard N -fold cross-validation. This can be seen byconsidering two categories of edges defined implicitly by the N original subsets of nodes: inter-subset and intra-subset. Inter-subset edges span two distinct subsets, while intra-subset edges arewithin one. When intra-subset edges appear at all as test edges (i.e. scheme (a)) they do so onlyonce, however inter-subset edges appear two times across folds (schemes (a) through (c)).

    In contrast to the bipartite partitions described above, one can consider a setting where thetrain-test partition is selected randomly, as shown in Figure 2 column (d). We call this a unipartitepartition, motivated by the terminology of (Airoldi et al., 2006). While the unipartite partitionis required for a separate purpose in our algorithms defined below, it also a realistic setting forcertain applications of graph inference, for example, when the observations themselves are selectedrandomly.

    When performing unipartite partitions, the problem of independence of test sets across foldsis not a problem. One can simply partition the n2 edges into N folds of size n2/N and constructindependent test sets as in the i.i.d. case.

    MORE HERE (examples in literature)

    MORE HERE (sparsity, low-degree nodes, class imbalance, ill-defined topological features)

    MORE HERE (summary and extensions)

    3. DC Subgraphs

    In (Taskar et al., 2006) structured output models are characterized as having a compact (decompos-able) scoring scheme over a family of combinatorial structures, and an efficient algorithm for findingthe highest scoring structure from this family. In this section we describe a combinatorial familyof graphs called degree-constrained subgraphs. Assuming potential functions for nodes and edgesare given, we describe a 0-1 combinatorial algorithm that performs the decoding. In the sequel,Section 4, we describe how these potentials can be learned.

    3.1 Degree Constrained Subgraphs

    In several domains, the degree of each node is known and the goal is to predict the edges. Incomputational molecular biology, there are constraints on the number and types of bonds that eachmolecular unit can make, and the goal is to predict binding partners. In social network graph

    7

  • ANDREWS AND JEBARA

    modeling, one typically has access to the number of connections a user has made without beingable to see the identities of their connections, and the goal is to predict the network.

    Let G = (V, E) be an arbitrary graph. A degree-constrained subgraph of G is comprised of asubset E ′ ⊆ E of its edges where the adjacency Y ∈ B satisfies in-degree and out-degree constraints

    B ,{Y ∈ {0, 1}n

    2

    |Yj,k = 0 if (j, k) /∈ E ,∑j

    Yj,k ≤ δink ,∑

    k

    Yj,k ≤ δoutj , j, k = 1, . . . , n}

    .(1)

    for given constants δink , δoutj ∈ {0, . . . , n}.

    We assume that we are given a score function that decomposes over edges S (Y) =∑

    j,k Sj,kYj,kwhere Sj,k measures the strength of each edge being chosen. The problem of finding the highestscoring completion {YU ,YO} of a partially observed graph YO amounts to solving

    DCS fixed{0,1} : argmaxYU

    ∑j,k

    Sj,kYj,k s.t. Y ∈ B . (2)

    restricting our search to {YO,YU} ∈ B. Problem DCS fixed{0,1} is an instance of a maximumweight degree-constrained subgraph problem.

    If the degrees are δink = δoutj = 1, in addition to the graph being bipartite and symmetric

    (V = V1∪V2, E = V1×V2∪V2×V1 and Yj,k = Yk,j), then the set of degree-constrained subgraphsare exactly the 1-matchings of G. 1-matching (alignment) structures are used in (Chatalbashev et al.,2005, Taskar et al., 2005). The iterative-type Hungarian or Kuhn-Munkres algorithms provide anefficient solution, both having O

    (n3

    )complexity. However, it can be shown that the optimal 0-1

    solution is obtained by the linear program (LP) relaxation of DCS fixed{0,1}, wherein the conditionof integrality of Yj,k in Equation (1) is relaxed by using real-valued variables Y′j,k ∈ [0, 1].

    While the general case in Equation (2) bears some resemblance to a matching problem, anefficient solution is significantly more complex. The LP relaxation in this case may return fractionalsolutions, because the relaxed polytope described by the linear inequalities in Equation (1) can havevertices with fractional coordinates. Nor have we found a simple iterative algorithm generalizing theapproach of Kuhn-Munkres. Yet, surprisingly, this combinatorial problem can be reduced to a largemaximum network flow problem (a linear program) and solved using an efficient implementation ofthe Edmonds-Karp algorithm with O

    (n3

    )complexity (Gabow, 1976). We use an implementation

    from the Goblin package (Fremuth-Paeger and Jungnickel, 1998).

    3.2 Extensions

    3.2.1 SOFT DEGREE CONSTRAINTS

    When the node degrees are unavailable, or when they are uncertain, we introduce variables to rep-resent these quantities. Let yink,δ ∈ {0, 1} be an indicator variable that takes the value 1 if and onlyif the in-degree of node δink is δ, and similarly y

    outj,δ ∈ {0, 1} be an indicator that is 1 if and only if

    δoutj = δ. We then redefine the set of valid structures to be

    8

  • STRUCTURED GRAPH INFERENCE

    connectivity

    1 2 5

    Figure 3: A reduction of DCS with variable degrees to a DCS with fixed degrees.

    B ,{Y ∈ {0, 1}n

    2

    ,yin ∈ {0, 1}n2

    ,yout ∈ {0, 1}n2

    |Yj,k = 0 if (j, k) /∈ E ,∑j

    Yj,k ≤∑

    δ

    δyink,δ, and∑

    δ

    yink,δ = 1, k = 1, . . . , n ,

    ∑k

    Yj,k ≤∑

    δ

    δyoutj,δ , and∑

    δ

    youtj,δ = 1, j = 1, . . . , n}

    .

    (3)

    As with the edges, we assume that there are score values sink,δ and soutj,δ associated with each variable

    which measures the strength of each degree being chosen. The problem of finding the highestscoring completion {YU ,YO,y} of a partially observed graph YO amounts to solving

    DCS free{0,1} : argmaxYU ,yin,yout

    ∑j,k

    Sj,kYj,k +∑k,δ

    sink,δyink,δ +

    ∑j,δ

    soutj,δ youtj,δ s.t. (Y,y) ∈ B (4)

    restricting our search to ({YO,YU} ,y) ∈ B. Given that the scores obey certain constraints, wecan solve problem DCS free{0,1} by reduction to a maximum weight degree-constrained subgraphproblem on a larger graph.

    3.2.2 TOTAL CONNECTIVITY CONSTRAINT

    Another modification that we make is to constrain the total number of edges in the graph. Thisinvolves a slight modification of the reduction given above.

    3.3 Approximate Inference

    This section describes methods for approximate inference. We show two approximations of DCS fixed{0,1}

    which can be generalized easily to approximations of DCS free{0,1}. We denote by S the concate-nation of the scores into a vector, which we call the score vector. The problem of finding the highestscoring completion of a graph can be viewed as a projection problem. The highest scoring comple-tion is the exact 0-1 degree-constrained connectivity from B matching YO on EO that is closest tothe score vector S under the cosine similarity metric. If there are ties between structures, they arebroken arbitrarily.

    9

  • ANDREWS AND JEBARA

    For comparison, we introduce the Euclidean projection and an approximation, that are used byprediction-correction algorithms. Consider the convex hull

    conv (B) , {Y|Y = αZ + (1− α)V, α ∈ [0, 1] , Z,V ∈ B} (5)

    formed by convex combination of points in B. We define the Euclidean projection of the scorevector onto the convex hull of B that also matches YO on EO as follows

    DCS fixed[0,1] : TB (S,YO) = argminYU

    ∑j,k

    (Yj,k − Sj,k)2 s.t. Y ∈ conv (B) (6)

    Given a finite linear description of the convex hull, problem DCS fixed[0,1] is a quadratic programwith n2 variables. While the Euclidean projection does not guarantee 0-1 outputs, it has the advan-tage that the fractional outputs indicate areas of uncertainty in the predicted topology.

    Unfortunately, conv (B) does not have a polynomial-size linear representation. To approximatethe Euclidean projection and implement prediction-correction algorithms efficiently, consider theLP-relaxation of B that is obtained by relaxing integrality on the variables Yj,k

    R ,{Y ∈ [0, 1]n

    2

    |Yj,k = 0 if (j, k) /∈ E ,∑j

    Yj,k ≤ δink ,∑

    k

    Yj,k ≤ δoutj , j, k = 1, . . . , n}

    .(7)

    The corresponding projection problem is a quadratic programming problem

    LP DCS fixed[0,1] : TR (S,YO) = argminYU

    ∑j,k

    (Yj,k − Sj,k)2 s.t. Y ∈ R (8)

    that has O (n) constraints and can be solved in O(n3

    )time in dual form. In practice, this compu-

    tation is much faster than DCS fixed{0,1}. The LP-relaxation is simple and works quite effectivelyin our setting. Although conv (B) ( R, the solution variables are often integral.

    The extension to LP DCS free[0,1], when the degrees are not known apriori, follows by modify-ing the constraints as in Equation (3). Notice that the graph need not be augmented as in Figure 3.Similarly, a constraint on the total number of edges requires only one additional linear constraint.

    4. Learning to Predict DC Subgraphs

    We have finished outlining how inference selects the graph that is most similar, or closest, to thescores. In order to completely specify our structured predictor, we must describe how we willcompute the node and edge scores.

    The scores, and inevitably the labels, depend on the independent attributes of the graph. For theedge score, previous research has experimented with the use of various node similarity or correlationscores. These measurements depend on the node features at the head and tail of an edge. The meritsof various score functions are often debated. In our work, instead of relying on our ability to designsuch scoring functions by hand, we propose to select the scoring function from a large family ofcandidates, in a data-driven manner. In particular, we assume a scoring function S = wTx that

    10

  • STRUCTURED GRAPH INFERENCE

    depends linearly on the features via hyperplane parameter w. We adopt a discriminative structured-outputs framework (Taskar et al., 2003, Tsochantaridis et al., 2004) in order to learn the weightvector w from data.

    We will start by describing the structured-outputs framework using the degree-constrained sub-graph structures considered by DCS fixed{0,1}. Unlike the graph inference setting described above,the structured-outputs framework assumes that training and testing data consist of multiple, disjointstructured objects. Training graphs are completely observed O = V × V , and testing graphs arecompletely unobserved O = ∅. With this simplifying assumption, the structured-output frameworkis essentially the same as (Tsochantaridis et al., 2004) adapted to degree-constrained subgraphs.

    Consider a fixed but unknown distribution over labeled graphs that satisfy certain degree con-straints P (X,Y). Intuitively, our goal is to select a score function so that, for a graph (X,Y)drawn from P (X,Y), the true connectivity Y achieves a larger score than any alternative set oflabels Z ∈ B with Z 6= Y ∑

    j,k

    Sj,kYj,k ≥ maxZ∈BZ6=Y

    ∑j,k

    Sj,kZj,k . (9)

    Substituting the definition for the edge score, we have an equivalent constraint on w∑j,k

    wTxj,kYj,k ≥ maxZ∈BZ6=Y

    ∑j,k

    wTxj,kZj,k . (10)

    Clearly, if Equation (10) is satisfied, then the inference procedure DCS fixed{0,1} will predict thecorrect graph Y. In the structured-outputs framework, we attempt to minimize the probability oferror

    err∆0/1

    P (w) =∫

    ∆0/1

    Y, argmaxZ∈BZ6=Y

    ∑j,k

    wTxj,kZj,k

    dP (X,Y) (11)where the 0/1 loss function ∆0/1 (Y,Z) is 0 is its arguments are identical, and 1 otherwise. Theintegral in Equation (11), which involves an expectation over labeled and structured graphs, can notbe computed because the distribution P (X,Y) is unknown. Hence, a maximum margin approachis used.

    4.1 Maximum Margin Learning to Predict DC Subgraphs

    Given a data set D consisting of m examples(Xi,Yi

    ) i.i.d.∼ P (X,Y) that are drawn independentlyand identically from the stationary but unknown distribution, the maximum margin approach tolearning to predict structured outputs uses the empirical error as a surrogate for the error probabilityin Equation (11)

    err∆0/1

    D (w) =m∑

    i=1

    ∆0/1

    Y, argmaxZ∈BZ6=Y

    ∑j,k

    wTxj,kZj,k

    . (12)The empirical error is minimized when w satisfies Equation (10) for each training graph

    (Xi,Yi

    ).

    As there may be more than one weight vector w that satisfies this constraint, the maximummargin principle (Vapnik, 1998) is used to select a unique solution. Generalizing the multi-class case

    11

  • ANDREWS AND JEBARA

    of (Crammer and Singer, 2001), the margin quantity γw (X,Y) is defined as the difference betweenthe score of the true connectivity and the largest score obtained by an alternative connectivity

    γw (X,Y) =∑j,k

    wTxj,kYj,k −maxZ∈BZ6=Y

    ∑j,k

    wTxj,kZj,k . (13)

    Then, the goal is to maximize the margin γw (X,Y) uniformly across the training data. We there-fore focus on the minimum margin γ = mini γw

    (Xi,Yi

    )which satisfies the following constraints∑

    j,k

    wTxij,kYij,k ≥ max

    Zi∈B

    ∑j,k

    wTxij,kZij,k + γ∆

    0/1(Yi,Zi

    ), 1 ≤ i ≤ m , (14)

    where we have introduced the 0/1 loss term ∆0/1(Yi,Zi

    )and eliminated the restriction on Zi. The

    i-th constraint in Equation (14) ensures that there is a margin of γ between the score of Yi andthe score of each alternate connectivity Zi ∈ B. To see this, recall that the maximum value of afunction over the set B is bounded above if and only if the function value for each member Zi ∈ Bis bounded above.

    The problem of maximizing the margin γ is equivalent, by a standard transformation, to thefollowing optimization problem

    minw

    12‖w‖2

    s.t.∑j,k

    wTxij,kYij,k ≥ max

    Zi∈B

    ∑j,k

    wTxij,kZij,k + ∆

    0/1(Yi,Zi

    ), 1 ≤ i ≤ m .

    (15)

    As is common for structured-outputs learning, two further modifications are made to this maximummargin formulation. First, the Hamming loss ∆H (Y,Z) is substituted for the 0/1 loss ∆0/1 (Y,Z)with the goal of improving generalization. The Hamming loss ∆H (Y,Z), which gradually in-creases as Z deviates from Y, counts the number of edge label prediction errors. Finally, slackvariables ξi are introduced to allow for potential violations of the margin constraints, with a scaledpenalty in the objective. The resulting learning problem is

    minw,ξ

    12‖w‖2 + C

    m∑i=1

    ξi

    s.t.∑j,k

    wTxij,kYij,k + ξi ≥ max

    Zi∈B

    ∑j,k

    wTxij,kZij,k + ∆

    H(Yi,Zi

    ), 1 ≤ i ≤ m .

    (16)

    Notice that the slack variables are constrained to be positive ξi ≥ 0 because we include the possi-bility of Zi = Yi in these constraints.

    The optimization problem in Equation (16) can be transformed into a convex quadratic pro-gram, because each margin constraint can be expressed equivalently as a conjunction of |B| linearconstraints on w, one for each Z ∈ B. For comparison with subsequent derivations, we prefer thisnotation. The next section outlines the common strategies for solving this problem.

    4.2 Optimization Details

    Clearly, there are too many constraints for direct implementation. In this section, we focus onderiving a tractable optimization strategy for the convex problem in Equation (). We modify severalalgorithms.

    12

  • STRUCTURED GRAPH INFERENCE

    4.2.1 i.i.d. SUPPORT VECTOR MACHINES

    MORE HERE

    4.2.2 AVERAGED PERCEPTRON

    MORE HERE

    4.2.3 CUTTING PLANES

    MORE HERE (TSOCHANTARIDIS / JOACHIMS)

    4.2.4 DUAL EXTRAGRADIENT

    In this section, we outline the steps of dual extragradient algorithm following (Taskar et al., 2006).The dual-extragradient algorithm is a method for solving problems possessing a convex structure,including convex optimization problems, convex-concave saddle point problems, and various equi-librium models. The formulation in Equation () is a saddle point problem.

    MORE HERE (Motivate with extragradient)

    The algorithm alternates gradient and projection step in the primal and dual space.The “loss-augmented inference” problem involves involves Z. An important aspect of the algo-

    rithm is projecting points onto the respective convex hulls Wγ ,R as follows

    TWγ (w̃) = argminw∈Wγ

    ∑d

    (wd − w̃d)2

    TR

    (Z̃

    )= argmin

    Z∈B

    ∑j,k

    (Zj,k − Z̃j,k

    )2These projections can be solved using standard quadratic programming QP software. In order todetermine the stepsize of the algorithm the Lipschitz constant computed.

    Algorithm 1 Dual extragradient algorithm for learning to predict structured graphs.1: Initialize: Choose u̇ ∈ U , sw = 0, ûw = 0, η = 1/L2: Initialize: ûZ = TR (1)3: for Iteration t, 1 ≤ t ≤ τ do4: vw = TW (u̇w + ηsw)5: vZ = TR

    (u̇Z + ηt

    [FT ûw + C+1

    ])6: uw = TW (vw − ηFvZ)7: uZ = TR

    (vZ + η

    [FTvw + C+1

    ])8: sw = sw − FuZ9: û = (tû + u) / (t + 1)

    10: end for

    13

  • ANDREWS AND JEBARA

    5. Learning to Complete DC Subgraphs

    Consider a fixed but unknown distribution P (X,Y,O,U). This is a distribution over graphs aswell as their observed and unobserved components. In graph completion, we observe (X,YO)corresponding to an example (X,Y,O,U) drawn from this distribution, and the goal is to predictYU . As described above, we view structured graph completion as a transductive inference problem.Instead of attempting to learn a general-purpose classifier for an arbitrary graph, we learn to predictmissing edges for the partially observed graph that is given as input.

    In transductive inference (Vapnik, 1998), a training set is given that consists of a set of labeledexamples {(xi, yi) | 1 ≤ i ≤ m} as well as a set of unlabeled examples {(xi, ·) |m + 1 ≤ i ≤ m + m′}and the goal is to classify the later set. According to a risk minimization framework, a margin quan-tity for an unlabeled example (xi, ·) is defined effectively as follows (see Collobert et al. (2005))

    γw (xi) = maxw=±1

    w(wTxi + b

    ), (17)

    and the goal is to maximize the minimum margin for both labeled and unlabeled examples. Thisamounts to finding a hyperplane (w, b) and a set of labels that are self-reinforcing

    yi = argmaxw=±1

    w(wTxi + b

    ), m + 1 ≤ i ≤ m + m′ . (18)

    The labels and hyperplane are self-reinforcing because they are chosen to create the largest possiblemargin. The resulting optimization problem is more difficult to solve than the supervised settingdue to the non-linearity in the margin constraints. Various techniques have been proposed to tacklethis optimization problem (Gammerman et al., 1998, Joachims, 1999, Bennett and Demiriz, 1998,Collobert et al., 2005).

    5.1 Transductive Inference for Completing DC Subgraphs

    We describe our approach using the degree-constrained subgraph structures considered by DCS fixed{0,1}.To apply the principles of transductive inference to the problem of graph completion, we generalizethe definition of the margin from Equation (13) to account for contributions from both the observedO and unobserved U components of the partially observed input graph.

    The new quantity is a difference of two scores

    γw (X,YO) = maxV∈B

    VO=YO

    ∑j,k

    wTxj,kVj,k − maxZ∈B

    ZO6=YO

    ∑j,k

    wTxj,kZj,k . (19)

    The first term corresponds to the highest scoring connectivity V = (VO,VU ) that matches theobserved labels VO = YO. The highest score is obtained by maximizing over the unobservedcomponent VU . As in Equation (17), the VU labels are self-reinforcing in that they create thelargest margin over the entire graph. The second term corresponds to the highest scoring alternativeconnectivity Z = (ZO,ZU ) obtained by maximizing over all edges V × V subject to the constraintthat ZO 6= YO. We think of Z as an adversary that attempts to reduce the margin.

    Deriving the maximum margin learning problem as in Section 4.1 and restricting to m = 1because we are concerned with a single partially observed graph, we have

    minw,ξ

    12‖w‖2 + Cξ

    s.t.ξ ≥ minV∈B

    VO=YO

    maxZ∈B

    ∑j,k

    wTxj,kZj,k + ∆H (V,Z)−∑j,k

    wTxj,kVj,k .(20)

    14

  • STRUCTURED GRAPH INFERENCE

    Due to the non-convex form of the constraints, this problem can no longer be transformed into aquadratic program. We propose several approximation algorithms for this problem in Section 5.2.

    One might consider an alternative definition of the margin using only the observed componentof the graph. For instance, one can define the margin as follows

    γw (X,YO) =∑

    j,k∈OwTxj,kYj,k − max

    Z∈BZO6=YO

    ∑j,k∈O

    wTxj,kZj,k (21)

    In this case, the maximum margin learning problem becomes

    minw,ξ

    12‖w‖2 + Cξ

    s.t.ξ ≥ maxZ∈B

    ∑j,k∈O

    wTxj,kZj,k + ∆H (YO,ZO)−∑

    j,k∈OwTxj,kYj,k .

    (22)

    This approach is essentially the same as Equation (16) with m = 1 and O ( V × V .On the one hand, the variable Z is forced to assume an alternate connectivity ZO 6= YO over

    the observed edges, while on the other hand, it is constrained according to degree-constraints Z ∈ Bover the entire graph. Unfortunately, the difference in the support of these constraints is a weakness;there is no penalty or reward for assigning the correct connectivity ZU over the unobserved compo-nent. For example, the cutting-plane algorithm will select many constraints where the connectivityvariable Zt has either: 1) a disproportionately large number of edges over the observed compo-nent O when the scores wTxj,k are mostly positive; or otherwise, 2) a disproportionately smallnumber of edges over O. As there are a super-exponential number of graphs, the algorithm willcontinue generating useless cuts until memory has been exhausted. We explore alternate methodsfor generating useful cutting planes in Section 5.2.

    We can interpolate between the formulations of Equation (22) and Equation (20) using a param-eter α ∈ [0, 1] that scales the contribution of the unobserved component U to the terms of the marginconstraint in Equation (20). This results in the following modified constraint in the formulation ofEquation (20)

    ξ ≥ minV∈B

    VO=YO

    maxZ∈B

    [ ∑j,k∈O

    wTxj,kZj,k + ∆H (VO,ZO)−∑

    j,k∈OwTxj,kVj,k

    α( ∑

    j,k∈UwTxj,kZj,k + ∆H (VU ,ZU )−

    ∑j,k∈U

    wTxj,kVj,k)]

    .

    5.2 Optimization details

    5.2.1 PERCEPTRON

    MORE HERE

    5.2.2 CUTTING PLANES

    MORE HERE

    15

  • ANDREWS AND JEBARA

    5.2.3 DUAL EXTRAGRADIENT

    To apply the dual extragradient algorithm to **, we write

    minw,ξ

    12‖w‖2 + Cξ

    s.t.ξ ≥ minV∈B

    VO=YO

    maxZ∈B

    ∑j,k

    wTxj,kZj,k + ∆H (V,Z)−∑j,k

    wTxj,kVj,k .

    We modify the dual extragradient approach of (Taskar et al., 2006). The max-margin convexquadratic program (problem M) in Equation () is di1fficult because of the form of the constraint,which involves a minimax optimization over valid structures defined by the set B.

    We can then express the problem from Equation () in saddle-point form

    S: minw∈WV∈B

    VO=YO

    maxZ

    ∆H (V,Z)− hw (V) + hw (Z) , (23)

    where the parameter w is restricted to some norm ball Wγ = {w| ‖w‖2 ≤ γ}. These problems areequivalent in the following sense. For a given C the solution to M has some norm, which we denoteγ (C). Then, by setting γ = γ (C) in S we will obtain the same solution. Although γ (C) may notbe invertible due to there being multiple C values with the same γ (C), there will be at least onevalue of C for which a solution to M matches a solution to S.

    In this section, we highlight the important differences between the standard version and ourextension. Most importantly, we maintain an additional structural variable V. The update steps forV are closely related to those of Z in (Taskar et al., 2006). In a nutshell, the “transductive loss-augmented inference” problem involves now involves V,Z. The gradient-based update steps for Vand Z are similar, each involving an additional term that was not present above. The projection stepis augmented with a constrained projection of V onto its convex hull BO.

    TBO

    (Ṽ,YO

    )= argmin

    V∈B,VO=YO

    ∑j,k

    (Vj,k − Ṽj,k

    )2The computation of the Lipschitz constant in order to determine the stepsize of the algorithm is alsomodified due to the interdependence of Z,V in the gradient.

    We call the algorithm NetSVM. The complete algorithm is presented in Box 2. Let variablesvZ,vV,uZ and uV denote vectorized versions of Z and V. We denote by 1 a column vector of onesof the appropriate dimension in the given context, and denote by F the matrix having vec

    (xjxTk

    )as columns.

    6. Evaluation

    We compared a total of seven algorithms, on three network data sets. Once a metric is selected byan algorithm, it evaluates wTxj,k for all pairs of nodes. Subsequently a global connectivity predic-tion is made by either: a) thresholding the affinities; b) solving for the optimal degree-constrainednetwork as in Equation (??); or c) by solving the approximate degree-constrained problem where Bis replaced by R. The algorithms are:

    16

  • STRUCTURED GRAPH INFERENCE

    Algorithm 2 NetSVM: an algorithm for transductive structured network learning.1: Initialize: Choose u̇ ∈ U , sw = 0, ûw = 0, η = 1/L2: Initialize: ûZ = TR (1) , ûV = TBO (1)3: for Iteration t, 1 ≤ t ≤ τ do4: vw = TW (u̇w + ηsw)5: vZ = TR

    (u̇Z + ηt

    [FT ûw − ûV + C+1

    ])6: vV = TBO

    (u̇V + ηt

    [FT ûw − ûZ + C+1

    ],YO

    )7: uw = TW (vw + η [FvV − FvZ])8: uZ = TR

    (vZ + η

    [FTvw − vV + C+1

    ])9: uV = TBO

    (vV + η

    [FTvw − vZ + C+1

    ],YO

    )10: sw = sw + [FuV − FuZ]11: û = (tû + u) / (t + 1)12: end for

    1. SVM: Thresholded predictions derived from the i. i. d. SVM-learned metric.

    2. Rnd: Thresholded predictions derived from a randomly selected metric.

    3. Rnd-R: Approximate degree-constrained predictions derived from a randomly selected met-ric.

    4. Raw: Thresholded predictions derived from the Euclidean metric.

    5. Raw-R: Approximate degree-constrained predictions derived from the Euclidean metric

    6. NetSVM-R: Approximate degree-constrained predictions derived from the structurally-learnedmetric

    7. NetSVM-B: 0/1 degree-constrained predictions derived from the structurally-learned metric.

    We used 5-fold cross-validation to measure the performance of all algorithms. The networkcompletion tasks were defined as follows. First, we randomly selected a subset of n = 100 nodes.For each fold, we partitioned the V into subsets V1,V2 with proportions 4 : 1. We used these nodesubsets to define subsets of edges Etrain = V1 × V1 ∪ V2 × V2 and Etest = V1 × V2 ∪ V2 × V1for training and testing. Following the protocol of (Liben-Nowell and Kleinberg, 2003), we removelow-degree nodes (δ < 2).

    For training the i. i. d. SVM, we used the edge features and labels from the Etrain. The prob-lem was solved using the Spider optimization package was used. On each fold, we allowed the SVMsolver to crossvalidate over 6 slack parameter values C = {0.005, 0.025, 0.125, 0.625, 3.125, 15.625, 78.125}.We used used a single setting of γ = 0.1 in NetSVM for all experiments.

    Receiver operator curves (ROC) are shown in Figure 4. In addition, Table 1 contains the follow-ing numeric performance metrics: accuracy, area under the ROC curve, and recall.

    6.0.4 CITATION NETWORK

    Our first experiment was performed using the CoRA citation database (McCallum et al., 2000). Thedatabase contains document attributes (e.g. abstract, authors’ names, date of publication, topic clas-sification), in addition to a citation network (edges from one paper to another) for several thousand

    17

  • ANDREWS AND JEBARA

    0 0.20.40.60.8 10

    0.20.40.60.8

    1

    ROC

    fp rate

    reca

    ll

    svm acc 91.9 recall 18.1raw acc 91.0 recall 12.0raw B acc 91.0 recall 20.5learn B acc 92.7 recall 27.7learn B1 acc 92.0 recall 30.1

    0 0.20.40.60.8 10

    0.20.40.60.8

    1

    ROC

    fp rate

    reca

    ll

    svm acc 92.9 recall 5.3raw acc 93.7 recall 8.8raw B acc 92.6 recall 14.0learn B acc 92.8 recall 19.3learn B1 acc 91.9 recall 8.8

    0 0.20.40.60.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    ROC

    fp rate

    recall

    svm acc 98.7 recall 79.7

    raw acc 97.1 recall 67.6

    raw B acc 98.0 recall 77.0

    learn B acc 98.0 recall 74.3

    learn B1 acc 99.0 recall 87.8

    0 0.20.40.60.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    ROC

    fp rate

    recall

    svm acc 92.9 recall 5.3

    raw acc 93.7 recall 8.8

    raw R acc 92.6 recall 14.0

    netsvm R acc 92.8 recall 19.3

    netsvm B acc 91.9 recall 8.8

    Figure 4: Receiver operator curves for the proposed algorithm and several baselines on CoRAcitations network (left), Online social network (middle), and Olivetti image equiv-alence graph (right). The algorithms are: (SVM) Thresholded predictions derivedfrom i. i. d. SVM-learned metric; (Raw) Thresholded predictions derived from Eu-clidean metric; (Raw R) Approximate degree-constrained predictions derived from Eu-clidean metric; (NetSVM-R) Approximate degree-constrained predictions derived fromstructurally-learned metric; (NetSVM-B) 0/1 degree-constrained predictions derivedfrom structurally-learned metric.

    papers on machine learning. The empirical degree-distribution was observed to exhibit a scale-freebehaviour. Assuming that the network formation is well described by a rich-get-richer model (Al-bert and Barabási, 2002), one can infer the expected degree of a node after a period of growth. Forour experiments, we assumed that the degrees δ were known.

    We constructed features xj based on the word counts in the respective fields for each documentusing latent dirichlet allocation (LDA) (Griffiths and Steyvers, 2004). This resulted in a vector of111 dimensions describing each document. We symmetrized the citation network to be compatiblewith our implementation.

    6.0.5 SOCIAL NETWORK

    For our second experiment, we considered a data set derived from an online social network. Webpages and the accompanying “friendship” structure were manually downloaded in a neighbourhooda seed page. We observed that the degree distribution of the graph exhibited a scale-free natureexcept at very low-degrees, where the empirical distribution dropped below its expected value underthis model. This behaviour has been also been documented for certain models of network formation(Albert and Barabási, 2002), supporting our assumption that the degrees are known in advance δ.As before, we used LDA features based on the word counts within individual fields on a user’s page,resulting in a 171 dimensional feature vector xj . The friendship graph is undirected in the networkwe studied.

    6.0.6 OLIVETTI FACE IMAGES

    The third network studied is one derived from an equivalence relation of n objects. An equivalencegraph is the union of vertex disjoint complete subgraphs induced by connecting pairs of objects that

    18

  • STRUCTURED GRAPH INFERENCE

    Rand Rand-R Raw Raw-R SVM i. i. d. NetSVM-R NetSVM-Bmean std. mean std. mean std mean std mean std mean std mean std.

    acc 92.3 1.1 93.4 1.4 92.5 1.2 93.5 1.5 92.9 0.7 94.6 1.2 94.0 1.2auc 0.64 0.04 0.66 0.03 0.65 0.05 0.66 0.04 0.68 0.06 0.85 0.02 0.72 0.05

    recall 9.8 3.7 18.6 5.6 10.9 4.3 20.3 6.8 17.4 9.8 29.4 13.1 26.0 8.9acc 92.5 0.8 92.2 0.5 92.6 0.9 92.2 0.5 92.3 0.7 93.0 0.7 92.5 0.9auc 0.62 0.05 0.63 0.05 0.63 0.06 0.64 0.06 0.62 0.05 0.80 0.03 0.66 0.05

    recall 8.5 2.7 15.5 4.7 9.3 2.6 15.8 4.1 8.6 3.3 26.0 5.8 19.2 7.9acc 92.2 0.9 95.3 0.9 97.5 0.7 98.2 0.5 98.9 0.4 97.3 0.2 99.3 0.5auc 0.84 0.01 0.89 0.02 0.97 0.01 0.98 0.01 0.99 0.01 0.96 0.02 0.99 0.01

    recall 24.5 4.0 55.4 6.9 73.9 4.1 82.3 3.6 88.4 7.8 74.3 6.0 92.3 5.7

    Table 1: Performance on various networks: (rows 1-3) CoRA citation net; (rows 4-6) Online socialnet; (rows 7-9) Olivetti image equivalence graph. Bold values indicate the best perfor-mance on each metric and network.

    belong to the same equivalance class. We constructed an equivalence graph using a collection ofimages from the Olivetti Research Laboratory1. The collection contains m = 300 face images of m= 30 individuals, each observed under 10 different viewing conditions that included pose, eyewear,expression and illumination variations. The images are 92 x 112 gray value. We applied PCA toreduce the feature dimensionality to 30. The degree-constraints were δj = 10.

    7. Related Work

    A graph is a simple and generic model for representing a pairwise relationship amongst a discreteset of objects; there are enumerable examples to be found in scientific literature. Extensive priorexists to formally and empirically characterize real world network graphs. One insight is that boththe topological properties of edges in a graph and attribute properties of the vertices in a network areuseful in working with real-world data. For instance, in a high school social network, two studentswith similar connectivities and many common friends are likely to eventually form a friendship.Similarly, two other students with similar tastes and personalities are likely to form a friendship.

    What came as somewhat of a surprise was the fact that many naturally occuring graphs share anumber of non-trivial global properties that distinguish them from random graphs. Some empiricaltopological analysis of real networks includes (Albert and Barabási, 2002) where networks arecharacterized by their degree distributions and power-laws they often obey. The scale-free behaviourof network topology has lead to a greater understanding of the dynamics of disease transmission,computer viruses, internet-scale data mining and the functional role of individual genes from DNA.Several additional novel characteristics of networks have been developed to further classify andunderstand networks, including cluster coefficients, centrality, modularity, etc. These topologicalobservations helped to spark the explosion of interest in networks.

    Other works identify topological properties like the presence frequently reoccuring subgraphsin community networks (Derenyi et al., 2005, Kashtan et al., 2004). In (Kleinberg, 1998, Page et al.,

    1. http://www.uk.research.att.com/facedatabase.html

    19

  • ANDREWS AND JEBARA

    1999, Ng et al., 2001), hubs and authorities and their self-reinforcing nature are described and usedto improve web search. Dynamic network formation models describe topological properties overtime (Albert and Barabási, 2002, Even-Dar and Kearns, 2006). These note that network formationschemes follow a rich-get-richer property as well as a graph diameter minimization property overtime. Another topological model includes (Liben-Nowell and Kleinberg, 2003), which performsdynamic edge prediction based on the current topological properties of an existing network.

    Additional references HERE.

    At the other extreme, is relation modeling from attributes of objects. The attributes of pairsof objects can be learned using kernels and distance metrics (Xing et al., 2003, Goldberger et al.,2004, Shalev-Shwartz et al., 2004, Kondor and Jebara, 2006, Lanckriet et al., 2002, Alfakih et al.,1999). Applications of attribute-driven approaches include reconstruction of social and biologicalnetworks (Vert and Yamanishi, 2004, Ben-Hur and Noble, 2005, Kato et al., 2005, Yu et al., 2006,Rabbat et al., 2006). However, many of these do not explicitly enforce topological structure onthe resulting networks and might make i.i.d. assumptions by basing connectivty only on distancemeasurements between object attributes.

    Clearly some work exists between these two extremes of topological and node attribute drivenmodeling. Some probabilistic approaches include (Getoor et al., 2001, Friedman and Koller, 2003,Taskar et al., 2004b, Pe’er et al., 2006). Probabilistic relational models use edges to predict classes/attributesof entities, and Markov networks bring in topological considerations and interdependency structurebeyond i.i.d. Recently, dicsriminative and margin-based methods have been emerging as an alter-native to probabilistic approaches which focus learning resources on the prediction task and aregenerally referred to as structured-output models (Altun et al., 2003, Taskar et al., 2003, Altunet al., 2004, Tsochantaridis et al., 2004, Altun et al., 2005, Taskar et al., 2006). These allow inter-dependence between predictions in a non-i.i.d. setting. This makes them useful for incorporatingtopological network properties. Furthermore, some promising transduction extensions are emergingthat allow structured output learning algorithm to use unlabeled examples (Altun et al., 2005).

    These prior works motivate the approach in this paper which explicitly incorporates attributeand topology into network reconstruction. The framework of structured-output modeling provides apath for handling both sources of information jointly while promising prediction accuracy via largemargins and transductive arguments.

    7.1 Structure Output Models

    The task of learning to predict structured output variables has received a great deal of attentionrecently (see (Bakir et al., 2007) for a broad overview). Applications of structured output modeling* predictions can be categorized according to the structure of the problem. Application areas includenatural language processing, computational molecular biology, and image processing. Structures ofinterest include:

    1. chains (label-sequences)

    2. trees (parse trees)

    3. matchings (word and sequence alignments)

    20

  • STRUCTURED GRAPH INFERENCE

    4. partitions of graphs (image segmentations)

    The structure of the problem affects the algorithms used for inference and learning.In natural language processing, structured learning methods have been developed for various

    forms of grammar learning such as PCFG parsing, non-projective dependency parsing (directedspanning trees), and HMM (label-sequences) learning. These are used for parsing and part-of-speech tagging, where the structures of interest are trees and chains respectively. Methods typicallyemploy dynamic programming (Viterbi-style decoding) along the input sequences to produce validstructures (Taskar et al., 2004a)(Koo et al., 2007) (Altun et al., 2003). For those models that involvethe partition function, the complexity these techniques increases (PCFG uses the inside-outsidealgorithm), (HMM uses the forward-backward algorithm), (directed spanning tree / non-projectivedependency parsing uses the inside-outside algorithm with reduction by matrix-tree theorem). Inmulti-lingual settings, structured learning methods have been applied to align or match words ina parallel corpus, using polynomial-time combinatorial optimization techniques to obtain bipartitematching graphs (Taskar et al., 2005). There has also been work on structured prediction of higher-order matchings in machine translation which allow a single word in one language to be matched totwo words in the translated text (Lacoste-Julien et al., 2006).

    Structured models have made forays in molecular biology. For example, one type of structuralannotation that has received attention, is the secondary structure of the amino acid chains that formproteins. These annotations are sequences of labels (α-helix, β-strand, coil) that describe the localbonding state of the amino acids in the chain. Secondary structures play an important role in deter-mining the 3 dimensional (tertiarary) structure, or fold, of the amino acid chain and the function ofthe protein. Using structured learning methods that incorportate dynamic programming techniques,analogous to those applied to languages, the secondary structure annotations are predicted for aminoacid sequences (Tsochantaridis et al., 2002) (Gassend et al., 2007). Biological sequence alignmentand protein homology prediction are considered in (Joachims, 2005) and (Yu et al., 2007) againusing sequence-based techniques (smith-waterman). Another application that employs dynamicprogramming along biological sequence appears in the work of (Rätsch et al., 2007), where exonsand introns are identified along unspliced mRNA.

    The applications in molecular biology go beyond the prediction of linear chains. In work by(Chatalbashev et al., 2005), structured learning is used to predict the disulphide bridges of a protein.These are bonds between distant cysteine residues of the linear amino acid chain that are responsiblefor the stability of the overall tertiarary structure of the protein. This method characterizes thetopology of the set of disulphide bridges as a perfect-matching.

    For many of the above mentioned applications, structured output models have proven to be moreeffective and are superseding the corresponding single and multiple stage techniques previouslyconsidered. Single stage techniques ignore the structure. (Examples: PROFcon (Punta and Rost,2005), Alternatively-spliced exons (Ratsch et al., 2005)). Multi-stage techniuqes apply some formof structure projection after thei.i.d. learning. The multiple stage approach is less advantageous,because it decouples the final prediction from the original data. However, it remains to be seenwhether structured output models can capture the complexities of such models while maintainingtractibility.

    One multi-stage model for predicting disulphide bridges uses four stages (Ceroni et al., 2006).A position and sequence independent support vector machine is trained to predict confidence valuesfor the individual cysteine residue bonding states (bonded / non-bonded). A neural network is used

    21

  • ANDREWS AND JEBARA

    to refine the i.i.d. predictions with the objective of capturing correlations in bonding states along thesequence. A dynamic programming algorithm is used further to convert the sequence of posteriorprobabilities into a valid sequence of n = 2m bonded cysteine residues. Finally, a neural networkis trained to predict the fraction of correctly assigned disulphide bridges from inputs consisting ofthe sequence of bonded / non-bonded cysteine residues, and an actual matching consistent with thebonding states. Once trained, a brute force algorithm is used for decoding; to find the matching withhighest predicted fraction of bridges.

    Another interesting example of a 3-stage approach is employed in the prediction of a contactgraph summarizing the tertiarary structure of a folded amino acid chain (Cheng and Baldi, 2005).Three stage prediction of protein beta-sheets by neural networks, alignments and graph algorithms.First they train a neural network to predict beta-residue pairs (1’s in adjacency matrix). Then theyuse dynamic programming to find the best alignment and corresponding scores for each pair ofbeta-strands. Finally, a greedy heuristic is proposed to find a graph matching that satisfies additionalstrand pairing constraints.

    Multi-stage prediction models combine i.i.d. predictions with one or more post-processing stepsto ensure valid structures are returned. For a growing number of structures, tractable algorithms forlearning and inference have been developed in the structured-outputs framework. The performanceof these models supercedes those of multi-stage prediction models. However, it remains to be seenwhether the structured output modeling framework can capture the complexities of increasinglycomplex models while maintaining tractibility.

    The main algorithms that have been proposed for training structured output models are:

    HERE

    7.2 Graphs in Machine Learning

    The abundance of relational data that have become available recently, has caught the attention of themachine learning community. Graphs appear in a variety of supervised and unsupervised machinelearning techniques. A large number of techniques have been designed that use that use a combi-nation of node and edge attributes for enhanced predictive modeling. These can be categorized intoseveral main groups:

    7.2.1 METHODS THAT PREDICT GLOBAL PROPERTIES (CLASSIFY, REGRESS, CLUSTERMEMBERSHIP) OR COMPARE, ENTIRE GRAPHS

    1. Graph summarization

    2. Graph kernels

    3. (Baldi, 2005)

    4. (Ralaivola et al., 2005)Graph kernels for chemical informatics.

    5. (Borgwardt et al., 2005) Protein function prediction via graph kernels - represent proteins asgraphs, and use random walk graph kernel to measure similarity

    6. (Kudo et al., 2005) (Tsuda and Kudo, 2006) gSpan-based techniques, gBoost

    22

  • STRUCTURED GRAPH INFERENCE

    7.2.2 METHODS THAT PREDICT NODE LABELS.

    1. Ranking and collective classification (using edges to introduce dependencies between predic-tions at nodes)

    (a) (Kleinberg, 1998)

    (b) (Page et al., 1999)

    (c) (Baldi et al., 1999) using the past for secondary structure prediction. Bidirectional re-current neural networks are used to model long-range interactions by way of forwardand backward contextual information about the entire sequence that is available for theprediction of the individual residue labels.

    (d) [COHN, HOFMANN 2001]

    (e) [JENSEN 2002] Linkage and autocorellation cause feature selection bias

    (f) (Dunn et al., 2005) edge-betweenness - cluster and assign cluster labels based on signif-icant correlations with known labels

    (g) (Segal et al., 2001) Probabilistic models for gene expression.

    (h) (Segal et al., 2002) From promoter sequence to expression: a probabilistic framework.

    (i) (Sharan et al., 2007) network-based prediction of protein function - review paper

    2. Topology sensitive (link-based) clustering

    (a) (Newman and Girvan, 2004)“Finding and evaluating community structure in networks”Novel clustering algorithms have been proposed in the literature on social network anal-ysis for elucidating community structure and aiding visualization.

    (b) [MCCALLUM] Group analysis in relational data. Assign individuals to groups proba-bilistically.

    (c) (Airoldi et al., 2006) stochastic block models of mixed membership - bayesian - key-word MEMBERSHIP (i.e. groups) - they consider clustering or situating objects in alow-dimensional space (identifying groups). Furthermore, they consider the estimationof relational structures among the clusters (group-group communication). Examples ofrelational data are 1) hand-curated protein complex graphs, 2) experimentally inferredprobabilistic protein complex graphs, 3) email communications between members ofan organization, 4) sociometric relations among monks (e.g. equivalence relations viasurveys). Asymmetric. Represent data as collection of graphs.

    (d) (Kranz et al., 2007) analysis of gene expression data on metabolic networks - usesconsecutive-ones algorithm to “simplify” adjacency matrix - due to network modularity,“significant expression patterns of topologically associated genes enables the identifica-tion of functionally relevant central components in the network with respect to differentconditions of interest”

    3. Manifold learning

    (a) (Weston et al., 2004) local-global protein similarity

    (b) (Zhou et al., 2005) Regularization framework for learning from graph data

    23

  • ANDREWS AND JEBARA

    7.2.3 METHODS THAT PREDICT EDGE LABELS.

    1. Link analysis, and link prediction

    (a) (Hasan et al., 2007) Link prediction by supervised learning. They use equal train/testsplits, and acknowledge that the train/test splits are unrealistic.

    (b) (Zhu et al., 2002) In link analysis, a user’s web browsing history is used to dynamicallygenerate hyperlink recommendations. Their method is based on a Markov chain modelthat parameterized by a probabilistic edge transition matrix. This matrix is learned fromthe browsing histories of a large collection of users. Pruning is used to select the mostpertinent links during learning. Efficiency is obtained by COMPRESSING the edgetransition matrix using a method proposed by [SPEARS].

    2. Metric learning (kernels, similarities, distances). Metric learning methods have been pro-posed to learn pairwise relationships between objects. These methods parameterize and learndistance functions, kernels or similarity functions. Because they do not take the overall struc-ture of all n2 relations into account , these methods are less restricted than structured graphinference.

    (a) (Weston et al., 2006) Rankprop - diffusion on protein similarity graph (PSI-BLAST)

    (b) [MANY OTHERS] .....

    3. Bayesian network structure learning. Several Bayesian methods have been proposed to re-construct interdependencies among collections of random variables associated with objects.The goal of learning the structure is to recover the local interactions that governs its overallbehaviour. The methods developed in this framework use a Bayesian scoring scheme to rankthe alternative structures, and typically rely on heuristic algorithms to perform hill-climbing.The Markov assumption made by these models forces the underlying graphs to be directedacyclic graphs.

    (a) (Pe’er et al., 2001) Inferring subnetworks from perturbed expression profiles.

    (b) (Friedman and Koller, 2003) Being bayesian about network structure. a bayesian ap-proach to structure discovery in bayesian networks.

    (c) (Segal et al., 2005) Learning Module Networks.

    (d) (Sachs et al., 2005) Causal Protein-Signaling Networks Derived from MultiparameterSingle-Cell Data.

    (e) (Lee et al., 2006) Efficient structure learning in markov networks using L1 regulariza-tion.

    (f) (Pe’er et al., 2006) MinReg: A Scalable Algorithm for Learning Parsimonious Regula-tory Networks in Yeast and Mammals.

    (g) (Jaimovich et al., 2006) Towards an integrated protein-protein interaction network: Arelational markov network approach.

    4. Graph inference / network reconstruction or completion

    24

  • STRUCTURED GRAPH INFERENCE

    (a) (Quach et al., 2006) elucidating the structure of genetic regulatory networks - dynamicgenerative network-based model for gene expression profiles - learning with EM - avoidNP-hard problem of recovering 0-1 network structure by learning real-valued weightsover complete graph

    (b) (Yamanishi et al., 2004, Vert and Yamanishi, 2004) Data is first converted into ker-nels, normalized on diagonal, centered in feature space. Expression data (RBF ker-nel sigma=5), Y2H Protein interaction data (diffusion kernel beta=1), Localization data(23 bits like GO, see Huh 2003) and Phylogenetic profile (i.e. protein has ortholog ineach of 145 organisms) both use (linear kernels). Gold standard (symmetric) interac-tion graph derived from KEGG (proteins are enzymes that catalyze successive reactionsin a known pathway, direct physical protein-protein interactions, and gene expressionregulation between a transcription factor and its target gene products - 769 nodes and3702 edges) uses (Kondor’s diffusion kernel beta=1). Methods: (1) Threshold kernelvalues, 2) Spectral projection (dim=50), then threshold, 3) Based on kernel CCA, maxi-mize correlation with observed connectivity (gold standard), while regularizing entries,project onto obtained eigenbasis, then threshold similarities. Results: (1) AUC close to0.5, for separate kernels, and for kernel sum, with thresholding and spectral approach,2) KCCA provides boost for all methods, again integrated kernel works best, 3) peakAUC when using dim=40 projections, declining up to dim=400.

    (c) (Kato et al., 2005) Selective integration of multiple biological data for network inference(d) (Geurts et al., 2006) completion of biological networks by the output kernel tree ap-

    proach

    7.2.4 METHODS THAT JOINTLY MODEL NODE AND EDGE ATTRIBUTES AND LABELS

    1. (Middendorf et al., 2004, 2005) Predicting binding sites and genetic regulatory response usingclassification

    2. Relational learning - There is a collection of formulations for relational learning: probabilis-tic relational models, statistical relation learning and relational markov networks. Althoughstatistical relational learning can be applied to graph inference (Popescul and Ungar, 2003,Taskar et al., 2002), our method differs in that we incorporate global structural constraints.

    3. [PRM - GETOOR 2001] Probabilistic relational models are joint probability models of en-tities and attributes in relational domains. Bayesian methods are typically used for trainingPRM’s. Being quite general, they can be used to pose a variety of questions about a relationaldomain. However, due to their generality and large numbers of parameters, these models aretypically hard to train.

    4. (Popescul and Ungar, 2003) SRL for edge prediction. They search for discriminative featuresin a relational database, and train a logistic regression model for predicting edges. Relationalfeature generation is performed as a search within a refinement graph.

    5. The statistical relational model of [SRL - GETOOR 2002] and the relational markov network(Taskar et al., 2002) are also quite general. They specify templates to model the data inrelational domains. Yet, they are more restrictive than the PRM’s and SRL model of (Popesculand Ungar, 2003).

    25

  • ANDREWS AND JEBARA

    8. Conclusions

    We have developed an novel framework and presented an efficient learning algorithm for solvingstructured network completion problems. The results on three networks demonstrate that it is pos-sible to learn a metric that facilitates prediction of structured networks. These network predictionsare useful because they not only predict edges accurately, but do so with high recall. An interestingapplication of our model would be to predict citations for an unfinished manuscript, as an aid to theauthor. Similarly, by predicting friendships, a social network can quickly connect new members.

    We would like to extend the model in several ways. First, we would like to experiment with adirected version. Second, we would like to take advantage of specialized QP solvers to handle largernetworks. Third, we would like to formalize the problem of “network filtering” by incorporating anoise model. Finally, we would like to remove our assumption that the in/out-degrees are known.For instance, we can refine the probabilistic model to allow networks with known upper and lowerbounds δj,L, δj,U on their degrees. Assuming a particular model of network formation, the degreesbounds can be chosen using a confidence interval around the expected degree of each node aftersome interval of network growth.

    References

    E.M. Airoldi, D.M. Blei, S.E. Fienberg, and E.P. Xing. Stochastic Block Models of Mixed Mem-bership. In Statistical Network Analysis Workshop, ICML, 2006.

    R. Albert and A.L. Barabási. Statistical mechanics of complex networks. Reviews of ModernPhysics, 74(1):47–97, 2002.

    A.Y. Alfakih, A. Khandani, and H. Wolkowicz. Solving Euclidean Distance Matrix CompletionProblems Via Semidefinite Programming. Computational Optimization and Applications, 12(1):13–30, 1999.

    Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. Proc. ICML,2003.

    Y. Altun, T. Hofmann, and A.J. Smola. Gaussian process classification for segmenting and annotat-ing sequences. ACM International Conference Proceeding Series, 2004.

    Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learning for structuredvariables. Advances in Neural Information Processing Systems, 18, 2005.

    S. Andrews. Learning from Ambiguous Examples. PhD thesis, Brown University, 2007.

    G. Bakir, T. Hofmann, B. Schölkopf, and S. V. N. Vishwanathan. Predicting Structured Data. MITPress, Cambridge, Massachusetts, 2007.

    P. Baldi. Chemoinformatics, Drug Design, and Systems Biology. GENOME INFORMATICS SE-RIES, 16(2):281, 2005.

    P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the past and the future inprotein secondary structure prediction. Bioinformatics, 15(11):937–946, 1999.

    26

  • STRUCTURED GRAPH INFERENCE

    A. Ben-Hur and W.S. Noble. Kernel methods for predicting protein-protein interactions. Bioinfor-matics, 21(Suppl 1):i38–i46, 2005.

    K. Bennett and A. Demiriz. Semi-supervised support vector machines. Advances in Neural Infor-mation Processing Systems, 11:368–374, 1998.

    K. Bleakley, G. Biau, and J.P. Vert. Supervised Reconstruction of Biological Networks with LocalModels. Bioinformatics, 23(13):57–65, 2007.

    K.M. Borgwardt, C.S. Ong, S. Schoenauer, SVN Vishwanathan, A. Smola, and H.P. Kriegel. Proteinfunction prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, 2005.

    A. Ceroni, A. Passerini, A. Vullo, and P. Frasconi. DISULFIND: a disulfide bonding state andcysteine connectivity prediction server. Nucleic Acids Research, 34(Web Server issue):W177,2006.

    V. Chatalbashev, B. Taskar, and D. Koller. Disulfide connectivity prediction via kernelized match-ing. In RECOMB, 2005.

    G. Chechik, G. Heitz, G. Elidan, P. Abbeel, and D. Koller. Max-margin classification of incompletedata. In Advances in Neural Information Processing Systems, 2007.

    J. Cheng and P. Baldi. Three-stage prediction of protein beta-sheets by neural networks, alignmentsand graph algorithms. Bioinformatics, 2005.

    R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive svms. Journal of MachineLearning Research, 2005.

    K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vectormachines. JMLR, pages 265–292, December 2001.

    I. Derenyi, G. Palla, and T. Vicsek. Clique percolation in random networks. Physical Review Letters,2005.

    R. Dunn, F. Dudbridge, and C.M. Sanderson. The Use of Edge-Betweenness Clustering to Investi-gate Biological Function in Protein Interaction Networks. BMC Bioinformatics, 2005.

    E. Even-Dar and M. Kearns. A small world threshold for economic network formation. In NIPS,2006.

    C. Fremuth-Paeger and D. Jungnickel. Balanced network flows. a unifying framework for designand analysis of matching algorithms. Networks, 1998.

    N. Friedman and D. Koller. Being bayesian about network structure. a bayesian approach to struc-ture discovery in bayesian networks. Machine Learning, 2003.

    H. Gabow. An efficient implementation of edmonds’ algorithm for maximum matching on graphs.ACM, 1976.

    A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In UAI, pages 148–155, 1998.URL citeseer.ist.psu.edu/gammerman98learning.html.

    27

    citeseer.ist.psu.edu/gammerman98learning.html

  • ANDREWS AND JEBARA

    B. Gassend, C. W. O’Donnell, W. Thies, A. Lee, M. van Dijk, and S. Devadas. Learningbiophysically-motivated parameters for alpha helix prediction. BMC Bioinformatics, 2007.

    L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relationalstructure. In ICML, 2001.

    P. Geurts, N. Touleimat, M. Dutreix, and F. d Alché-Buc. Completion of biological networks: theoutput kernel tree approach. Probabilistic Modeling and Machine Learning in Structural andSystems Biology, 2006.

    A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS, 2006.

    J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis.In NIPS, 2004.

    T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 2004.

    M. A. Hasan, V. Chaoji, S. Salem, and M. Zaki. Link Prediction using Supervised Learning. InSIAM International Data Mining Conference Workshop on Link Analysis, Counterterrorism andSecurity, 2007.

    M.J. Herrgaard, M.W. Covert, and BO Palsson. Reconstruction of microbial transcriptional regula-tory networks. Current Opinion in Biotechnology, 15(1):70–77, 2004.

    A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman. Towards an integrated protein-proteininteraction network: A relational markov network approach. Journal of Computational Biology,2006.

    T. Joachims. Learning to Align Sequences: A Maximum-Margin Approach. LECTURE NOTES INCOMPUTATIONAL SCIENCE AND ENGINEERING, 49:57, 2005.

    T. Joachims. Transductive inference for text classification using support vector machines. In ICML,1999. URL citeseer.nj.nec.com/joachims99transductive.html.

    N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon. Efficient sampling algorithm for estimating subgraphconcentrations and detecting network motifs. Bioinformatics, 2004.

    T. Kato, K. Tsuda, and K. Asai. Selective integration of multiple biological data for supervisednetwork inference. Bioinformatics, 2005.

    J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Sym-posium on Discrete Algorithms, 1998.

    R. Kondor and T. Jebara. Gaussian and wishart hyperkernels. In NIPS, 2006.

    T. Koo, A. Globerson, X. Carreras, and M. Collins. Structured Prediction Models via the Matrix-Tree Theorem. Proc. EMNLP, 2007.

    A. L. Kranz, M. Oswald, T. Bonato, H. Seitz, G. Reinelt, H. Runz, J. Zschocke, R. Eils, andR. Konig. Analysis of gene expression data on metabolic networks. Abstract, 2007.

    28

    citeseer.nj.nec.com/joachims99transductive.html

  • STRUCTURED GRAPH INFERENCE

    T. Kudo, E. Maeda, and Y. Matsumoto. An application of boosting to graph classification. Advancesin Neural Information Processing Systems, 17, 2005.

    S. Lacoste-Julien, B. Taskar, D. Klein, and M.I. Jordan. Word alignment via quadratic assignment.Proceedings of the main conference on Human Language Technology Conference of the NorthAmerican Chapter of the Association of Computational Linguistics, pages 112–119, 2006.

    Gert R.G. Lanckriet, N. Cristianini, and P. Bartlett. Learning the kernel matrix with semi-definiteprogramming. In ICML, 2002.

    S. I. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using l1-regularization. In NIPS, 2006.

    D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM,2003.

    A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internetportals with machine learning. Information Retrieval Journal, 3:127–163, 2000. URL www.research.whizbang.com/data.

    M. Middendorf, A. Kundaje, C. Wiggins, Y. Freund, and C. Leslie. Predicting genetic regulatoryresponse using classification. Bioinformatics, 20(1):232–240, 2004.

    M. Middendorf, A. Kundaje, M. Shah, Y. Freund, C.H. Wiggins, and C. Leslie. Motif Discoverythrough Predictive Modeling of Gene Regulation. Proc RECOMB, pages 538–552, 2005.

    MEJ Newman and M. Girvan. Finding and evaluating community structure in networks. PhysicalReview E, 69(2):26113, 2004.

    A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for link analysis. In SIGIR, 2001.

    L. Page, S. Brin, R. Motwani, and T. Winograd. The page rank citation ranking: Bringing order tothe web. Technical report, Stanford University, 1999.

    D. Pe’er, A. Regev, G. Elidan, and N. Friedman. Inferring subnetworks from perturbed expressionprofiles. Bioinformatics, 17(Suppl 1):S215–24, 2001.

    D. Pe’er, A. Tanay, and A. Regev. MinReg: A Scalable Algorithm for Learning ParsimoniousRegulatory Networks in Yeast and Mammals. The Journal of Machine Learning Research, 7:167–189, 2006.

    A. Popescul and L.H. Ungar. Statistical relational learning for link prediction. IJCAI Workshop onLearning Statistical Models from Relational Data, 2003.

    M. Punta and B. Rost. PROFcon: novel prediction of long-range contacts. Bioinformatics, 2005.

    M. Quach, P. Geurts, and F.A. Buc. Elucidating the structure of genetic regulatory networks: astudy of a second order dynamical model on artificial data. In Proceedings of XIVth EuropeanSymposium on Artificial Neural Networks (ESANN), 2006.

    29

    www.research.whizbang.com/datawww.research.whizbang.com/data

  • ANDREWS AND JEBARA

    M. Rabbat, M. Figueiredo, and R. Nowak. Network inference from co-occurrences. Technicalreport, University of Wisconsin, 2006.

    L. Ralaivola, S.J. Swamidass, H. Saigo, and P. Baldi. Graph kernels for chemical informatics.Neural Networks, 18(8):1093–1110, 2005.

    G. Ratsch, S. Sonnenburg, and B. Scholkopf. RASE: recognition of alternatively spliced exons inC. elegans. Bioinformatics, 21(Suppl. 1):i369–i377, 2005.

    G. Rätsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.R. Müller, R.J. Sommer, and B. Schölkopf.Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning. PLoS Com-putational Biology, 3(2):313–322, 2007.

    K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal Protein-Signaling Net-works Derived from Multiparameter Single-Cell Data. Science, 308(5721):523–529, 2005.

    E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller. Probabilistic models for gene expression.Bioinformatics, 1(1):1–9, 2001.

    E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller. From promoter sequence to expres-sion: a probabilistic framework. Proceedings of the sixth annual international conference onComputational biology, pages 263–272, 2002.

    E. Segal, D. Pe’er, A. Regev, D. Koller, and N. Friedman. Learning Module Networks. The Journalof Machine Learning Research, 6:557–588, 2005.

    Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In ICML, 2004.

    R. Sharan, I. Ulitsky, and R. Shamir. Network-based prediction of protein function. MolecularSystems Biology, page 1, 2007.

    B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. Eigh-teenth Conference on Uncertainty in Artificial Intelligence (UAI02), pages 895–902, 2002.

    B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.

    B. Taskar, D. Klein, M. Collins, D. Koller, and C. D. Manning. Max-margin parsing. In EMNLP,2004a.

    B. Taskar, M. F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In NIPS, 2004b.

    B. Taskar, S. Lacoste-Julien, and D. Klein. A discriminative matching approach to word alignment.In EMNLP, 2005.

    B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and breg-man projections. JMLR, 2006.

    I. Tsochantaridis, Y. Altun, and T. Hoffman. A crossover between SVMs and HMMs for proteinstructure prediction. NIPS Workshop on Machine Learning Techniques for Bioinformatics, 2002.

    30

  • STRUCTURED GRAPH INFERENCE

    I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning forinterdependent and structured output spaces. In ICML, 2004.

    K. Tsuda. Support Vector Classifier with Asymmetric Kernel Functions. European Symposium onArtificial Ne