On Learning Discrete Graphical Models using Group-Sparse ...pradeepr/paperz/dgm_aistats.pdf · On...

On Learning Discrete Graphical Models using Group-SparseRegularization

Ali Jalali Pradeep Ravikumar Vishvas Vasuki Sujay SanghaviECE, UT Austin

[email protected], UT Austin



[email protected]

Abstract

We study the problem of learning the graphstructure associated with a general discretegraphical models (each variable can take anyof m > 1 values, the clique factors havemaximum size c ! 2) from samples, underhigh-dimensional scaling where the numberof variables p could be larger than the num-ber of samples n. We provide a quantitativeconsistency analysis of a procedure based onnode-wise multi-class logistic regression withgroup-sparse regularization.

We first consider general m-ary pairwisemodels – where each factor depends on atmost two variables. We show that whenthe number of samples scale as n > K(m !1)2d2 log((m!1)2(p!1))– where d is the max-imum degree and K a fixed constant – theprocedure succeeds in recovering the graphwith high probability. For general modelswith c-way factors, the natural multi-way ex-tension of the pairwise method quickly be-comes very computationally complex. So westudied the e!ectiveness of using the pairwisemethod even while the true model has higherorder factors. Surprisingly, we show that un-der slightly more stringent conditions, thepairwise procedure still recovers the graphstructure, when the samples scale as n >

K(m! 1)2d32 c!1 log((m! 1)c(p! 1)c!1).

1 Introduction

Markov Random Fields and Structure Learning. Undi-rected graphical models, also known as Markov ran-

Appearing in Proceedings of the 13th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 ofJMLR: W&CP 9. Copyright 2010 by the authors.

dom fields, are used in a variety of domains, in-cluding statistical physics [14], natural language pro-cessing [19], image analysis [35, 13, 6], and spatialstatistics [27], among others. A Markov random field(MRF) over a p-dimensional discrete random vectorX = (X1, X2, . . . , Xp) is specified by an undirectedgraph G = (V,E), with vertex set V = {1, 2, . . . , p}– one for each variable – and edge set E " V # V .The structure of this graph encodes certain conditionalindependence assumptions among subsets of the vari-ables. In this paper, we consider the task of structurelearning, i.e. estimating the underlying graph struc-ture associated with a general discrete Markov randomfield from n independent and identically distributedsamples {x(1), x(2), . . . , x(n)}.

High-dimensional setting and Group sparsity. We areinterested in structure learning in the setting wherethe dimensionality p of the data is larger than thenumber of samples n. While classical procedures typ-ically break down under such high-dimensional scal-ing, an active line of recent research has shown itis still possible to obtain practical consistent proce-dures by leveraging low-dimensional structure. Themost popular example is that of leveraging sparsity us-ing !1-regularization (e.g., [4, 12, 21, 23, 31, 34, 37]).For MRF structure learning, such !1-regularization hasbeen successfully used for Gaussian [21] and discretebinary pairwise (i.e. Ising) models [26, 17]. In theseinstances, there is e!ectively only one parameter peredge, so that a sparse graph corresponds to a sparse setof parameters. In this paper, we are interested in moregeneral discrete graphical models – where each variablecan take m possible values, and factors can be of orderhigher than two. We now have multiple parameters peredge, and thus the relevant low-dimensional structureis that of group sparsity: all parameters of an edge forma group, and a sparse graph now corresponds to certaingroups of parameters being non-zero. The counterpartof !1 regularization for such group-sparse structure is!1/!q regularization for q > 1, where we collate the!q norms of the groups, and compute their overall !1norm. Recent work on group and block-sparse linear

On Learning Discrete Graphical Models using Group-Sparse Regularization

regression [32, 8, 22, 18, 24, 25, 2] show that under suchgroup-sparse settings, group-sparse regularization out-performs the use of !1 penalization.

Our Results: Pairwise m-ary models. In this paper, weprovide a quantitative consistency analysis of group-sparse regularized structure recovery for general dis-crete graphical models. We first consider the case ofpairwise but otherwise m-ary discrete graphical mod-els, and analyze a group-sparse variant of the proce-dures in [26, 21]: for each vertex r $ V , we estimateits neighborhood set using !1/!2-regularized maximumconditional likelihood. This reduces to multi-class lo-gistic regression, for which we characterize the numberof samples needed for sparsistency i.e. consistent re-covery of the group-support-set with high probability.This analysis extends recent high-dimensional analy-ses for linear models to logistic models, and is of in-dependent interest even outside the context of graph-ical models. We then combine the neighborhood setsacross vertices to form the graph estimate. There hasbeen a strong line of work on developing fast algo-rithms to solve these sparse multiclass logistic regres-sion programs including Meier et al. [20], Krishnapu-ram et al. [15]. Indeed, [9, 10] show good empiricalperformance using such !1/!q regularization even withthe joint likelihood over all variables.

Our Results: General m-ary models. One (natural,but expensive) extension to graphical models withhigher-order factors is to again use group-sparse reg-ularization but with higher order factors as groups.However, this leads to prohibitive computational com-plexity – e.g. there are O(pc) possible factors of orderc. Indeed, in their empirical study of such regulariza-tions, Dahinden et al. [9, 10] could scale up to smallgraph sizes, even while using some intelligent heuris-tics. This motivates our second main result. Sup-pose we solve the pairwise graphical model estima-tion problem, even when the true model has higherorder factors. What is the relationship of this estimatewith the true underlying graph? We investigate thisfor hierarchical graphical models where the absenceof any lower-order factor also implies the absence offactors over supersets of the lower-order factor vari-ables. Higher-order factors could, in principle, causeour pairwise estimator to include spurious edges. Sur-prisingly, we obtain the result that under slightly morestringent assumptions on the scaling of the sample size(dependent on the size of the higher-order factors) thepairwise estimator excludes the irrelevant edges, andincludes all “dominant” pairwise edges whose param-eters are larger than a certain threshold that dependson the size of the parameters values of higher-orderfactors. As a consequence, if all pairwise e!ects aredominant enough, we recover the graph exactly even

while using a simple pairwise estimator. But even oth-erwise, the guaranteed false edge exclusion could beused for further greedy procedures, though we deferfurther discussion in the sequel.

Existing approaches. Methods for estimating suchgraph structure include those based on constraint andhypothesis testing [29], and those that estimate re-stricted classes of graph structures such as trees [5],polytrees [11], and hypertrees [30]. Another class ofapproaches estimate the local neighborhood of eachnode via exhaustive search for the special case ofbounded degree graphs. Abbeel et al.[1] propose amethod for learning factor graphs based on local con-ditional entropies and thresholding, but the computa-tional complexity grows at least as quickly as O(pd+1),where d is the maximum neighborhood size in thegraphical model. Bresler et al. [3] describe a relatedlocal search-based method, and prove under relativelymild assumptions that it can recover the graph struc-ture with "(log p) samples. However, in the absence ofadditional restrictions, the computational complexityof the method is O(pd+1). Csiszar and Talata [7] showconsistency of a method that uses pseudo-likelihoodand a modification of the BIC criterion, but this alsoinvolves a prohibitively expensive search.

2 Problem Setup and Notation

MRFs and their Parameterization. We consider thetask of estimating the graph structure associated witha general discrete Markov random field. Let X =(X1, . . . , Xp) be a random vector, each variable Xi

taking values in a discrete set X = {1, 2, . . . ,m} ofcardinality m. Let G = (V,E) denote a graph with pnodes, corresponding to the p variables {X1, . . . , Xp}.Let C be a set of cliques (fully-connected subgraphs)of the graph G, and let {"C : X |C| %& R, C $ C} be aset of “clique potential” functions. With this notation,the distribution of X takes the form

P(x) ' exp

!"

C!C"C(xC)

#. (1)

Since X is discrete, each potential function "C can beparameterized as linear combinations of {0, 1}-valuedindicator functions – one for each configuration of xC .For each s $ V and j $ {1, . . . ,m ( 1}, we can definenode-wise indicators,

I[xs = j] =

!1 if xs = j

0 otherwise.

Note that we omit an indicator for xs = m from thelist, since it is redundant given the indicators for j =1, . . . ,m ( 1. In a similar fashion, we can define the

Ali Jalali, Pradeep Ravikumar, Vishvas Vasuki, Sujay Sanghavi

|C|-way clique-wise indicator functions I[xC = v], forv $ {1, 2, . . . ,m ( 1}|C|.

With this notation, any set of potential functions canthen be written as

"C(xC) ="

v!{1,...,m"1}|C|

##C;v I[xC = v] for C $ C

Thus, (1) can be rewritten as,

P!!(x) ' exp

$ "

C!C;v!{1,...,m"1}|C|

##C;v I[xC = v]

%.

(2)

Thus, the Markov random field can be parameterizedin terms of the collection of tensors ## := {##C;v C $C; v $ {1, . . . ,m(1}|C|}. In the sequel, it will be useful

to collate these into vectors ##C $ R(m"1)|C|associated

with the cliques C $ C.

Pairwise Markov Random Fields. Here the set ofcliques consists of the set of nodes V and the setof edges E. Thus, using nodewise and pairwise in-dicator functions as before, any pairwise MRF over(X1, . . . , Xp) can be expressed as

P(x) ' exp

$ "

s!V ;j!{1,...,m"1}

#s;jI[xs = j]

+"

(s,t)!E;j,k!{1,...,m"1}

#st;jkI[xs = j, xt = k]

%,

(3)

for a set of parameters ## := {##s;j , ##st;jk : s, t $V ; (s, t) $ E; j, k $ {1, . . . ,m ( 1}}. It will be use-ful to collate these into vectors ##s $ Rm"1 for each

s $ V , and the vectors ##st $ R(m"1)2 associated witheach edge.

Graphical Model Selection. Suppose that we are givena collection D := {x(1), . . . , x(n)} of n samples, whereeach p-dimensional vector x(i) $ {1, . . . ,m}p is drawni.i.d. from a distribution P!! of the form (2), for pa-rameters ## and graph G = (V,E#) over the p vari-ables. The goal of graphical model selection is to inferthe edge set E# of the graphical model defining theprobability distribution that generates the samples.Note that the true edge set E# can also be expressedas a function of the parameters as

E# = {(s, t) $ V # V : )C $ C; {s, t} $ C; ##C *= 0}. (4)

In this paper, we focus largely on the special case ofpairwise Markov random fields.

2.1 Pairwise Model Selection

We now describe the graph selection procedure westudy for the m-ary pairwise model. It is the natu-

ral generalization of the procedures for binary graph-ical models [26] and Gaussian graphical models [21].Specifically, we first focus on recovering the neighbor-hood of a fixed vertex r $ V , and then combine theneighborhood sets across vertices to form the graphestimate.

Let us define the vector "#\r $ R(m"1)2(p"1), which is

the concatenation of (p ( 1) groups – i.e. one (short)

vector ##rt $ R(m"1)2 for each t $ V \{r}. Note that rhaving a small neighborhood is equivalent to many ofthese vectors ##rt being zero; in particular, the problemof neighborhood estimation for vertex r corresponds tothe recovery of the set

N (r) =

$u $ V \{r} | +##ru+0 *= 0

%.

This is precisely the structure captured by group-sparsity. In particular, each ##rt, with t $ V \{r}, corre-sponds to a group; if r has a small neighborhood,onlyfew of these groups will be non-zero.

In order to estimate the neighborhood N (r), we thusperform a regression of Xr on the rest of the variablesX\r, using the group-sparse regularizer

&&"\r&&1,2

:='u!V \{r} +#ru+2. The conditional distribution of Xr

given the other variables X\r = {Xt | t $ V \{r}}takes the form

P!!(Xr = j | X\r = x\r

)=

exp*##r;j +

't!V \{r}

'k #

#rt;jkI[xt = k]

+

1 +'

" exp*##r;" +

't!V \{r}

'k #

#rt;"kI[xt = k]

+ ,

(5)

for all j $ {1, . . . ,m ( 1}. Thus, Xr can be viewed asthe response variable in a multiclass logistic regression,in which the indicator functions associated with theother variables

,I[xt = k], t $ V \{r}, k $ {1, 2, . . . ,m ( 1}

-,

play the role of the covariates.

Thus, we study the following convex program as anestimate for "#

r

"\r $ min!\r!R(m"1)2(p"1)

$!("\r;D) + $n

&&"\r&&1,2

%,

(6)where !("\r;D) = 1

n

'ni=1 !

(i)("\r;D) :=1n

'ni=1 logP!

.Xr = x(i)

r | X\r = x(i)\r

/is the rescaled

multiclass logistic likelihood defined by the conditionaldistribution (5), and $n > 0 is a regularization param-eter. The convex program (6) is an !1/!2-regularized


multiclass logistic regression problem, and is thus themulticlass logistic analog of the group Lasso [36].

The solution to the program (6) yields an estimate0N (r) of the neighborhood of node r by

0N (r) = {t $ V : t *= r; +#rt+2 *= 0}.

We are interested in the event that all the node neigh-boorhoods are estimated exactly, { 0N (r) = N (r); ,r $V }, which we also write as {E = E#} since it entailsthat the the full graph is estimated exactly.

Sparsistency. Our main result is a high-dimensionalanalysis of the estimator (6), where allow the prob-lems dimensions such as the number of nodes p, themaximum node degree d, the size of the state space m(and in the case of higher-order MRFs, the maximumclique size c) to vary with the number of observationsn. Our goal is to establish su#cient conditions on thescaling of (n, p, d,m, c) such that our proposed estima-tor is consistent in the sense that

P.0En = E#

/& 1 as n & +-.

We sometimes call this property sparsistency, as ashorthand for consistency of the sparsity pattern ofthe parameters.

2.2 Higher-order Model Selection

Natural, high-complexity Extension. Let us first seewhat this model selection recipe of node-wise regres-sion with group-sparse regularization, would entailwhen extended to the general higher-order Markovrandom fields (2) case. Recall that such a higher-order

MRF is parameterized by vectors ##C $ R(m"1)|C|for

C $ C. Let c be the maximum clique size. It wouldbe convenient to view the parameters as a collectionof'c

j=1

1pj

2vectors indexed by a cliques C of size less

than or equal to c, but non-zero if and only if the cliqueC $ C.

Again, we fix a node r, and define the long vector

"#\r $ R

!c"1j=1 (

p"1j )(m"1)j+1

as the concatenation of the

parameter vectors ##rC for all C . V \r; |C| < c. Notethat recovery of the neighborhood of a vertex r corre-sponds to the recovery of the set

N (r) =

$u $ V \{r} | )C . V \{r, u}; +##ruC+0 *= 0

%.

Thus, we could again make use of group sparsity wherein this case, the groups of parameters are the param-eter vectors ##rC for di!erent C . V \r; |C| < c. Wecan then see that a small neighborhood N (r) for noder entails that "#

\r will have many of these groups be

zero. The group-structured penalty would then takethe form +"#

\r+1,2 :='

{C$V \r |C|<c} +##rC+2.

Thus we would solve:

min!\r!R

!c"1j=1 (p"1

j )(m"1)j+1

$!("\r;D) + $n

&&"\r&&1,2

%,

(7)

where !("\r;D) is the likelihood of the data as be-fore. Dahinden et al. [9, 10] studied the related pro-gram of !1/!2 regularized maximum likelihood over thecomplete graph (instead of node-wise regressions) butshowed good empirical performance of discrete graph-ical model structure recovery. The caveat with thehigher-order group-sparse approach is the prohibitivecomputational complexity of this procedure. Note thatthe number of parameters is

'c"1j=1

1p"1j

2(m ( 1)j+1

which scales prohibitively even for moderate c. In-deed, even the computations in the pairwise case arenot inexpensive.

Sparsistency of a Simpler Estimate. But as we show inSection 4, even when the underlying model is a higherorder MRF, surprisingly just solving the pairwise pro-gram (6) is su!cient to recover the true edges, undercertain conditions. Thus, in our second main result,we again analyze the sparsistency of the estimator in(6), but for the case where the underlying graph is ahigher-order MRF.

2.3 Notation

We use the following notation for group-structured norms. For any vector u $ Rp

where {1, . . . , p} is partitioned into a set of Tdisjoint groups G = {G1, . . . , GT }, we define+u+G,a,b = +(+uG1+a, . . . , uGT +a)+b. In our case, forthe pairwise model, the nodewise regression has theparameter vector "#

\r $ R(m"1)2(p"1). Its groups are

collated on the edges: G = {Grs; s $ V \r} whereGrt is the index set of parameters on the (r, t) edge,{#rt;jk; j, k $ {1, . . . ,m ( 1}}.

Similarly, suppose "#\r is the nodewise regression pa-

rameter for the higher-order model case. Then itsgroups are collated on the cliques: G = {GrC ;C .V \r |C| < c}, where GrC is the index set of parame-ters on the r/C clique, {#rC;jv; j $ {1, . . . ,m(1}; v ${1, . . . ,m( 1}|C|}. In the sequel, we will suppress thedependence of the group norms on these group parti-tions G when it is clear from context, so that we willsimply use +"#

\r+a,b for +"#\r+G,a,b.

We will be focusing on the choice a = 1, b =2 which yields the group-lasso penalty [36]. Fora matrix M $ Rp%p, and denoting the i-th rowof M by M i, we can define the analogs of the


group-structured norms on matrices: +M+(a,b),(c,d) :=+(+M1+c,d, . . . , +Mp+c,d)+a,b. In our analysis, we willalways use b = d = 2, so that we use the minimizednotation: +M+a,c to denote +M+(a,2),(c,2).

3 Pairwise Discrete Graphical Models

Let Sr = {u $ V : (r, u) $ E} be the set of all neigh-bors of the node r in the graph and Sc

r = V \Sr. No-tice that +##ru+0 = 0 for all u $ Sc

r . Fixing r $ V ,

and defining "#\r as before, let S(ex)

r be the index set

of parameters {##rt;jk *= 0} in "#\r. When clear from

context, we will overload notation and again use Sr forthis index set.

Let Q# = E.02 log

*P!!

\r

(Xr

33X\r)+/

be the

population Fisher information matrix. Note thatQ# $ R(m"1)2(p"1)%(m"1)2(p"1). Similarly, letQn = 1

n

'ni=1 02!(i)

1"\r;D

2be the sample Fisher

information matrix.

Define J # = E.I [xt2 = k2] I [xt1 = k1]

T/

$R(m"1)(p"1)%(m"1)(p"1). Accordingly, define J n

to be the empirical mean of the same quantityover n drawn samples. In the proofs (specificallyin analyzing the derivative of the Hessian of thelog-likelihood function), we will actually need controlover 1# := J #21(m"1)%(m"1), the Kronecker productof J # and matrix of all ones (which would be of thesize of Q#). But by properties of Kronecker products,we have $max(1#) = $max(J #), so that it su#ces toimpose assumptions on the maximum eigen values ofJ # and J n.

3.1 Assumptions

We begin by stating the assumptions imposed on thetrue model. We note that similar su#cient conditionshave been imposed in papers analyzing Lasso [33] andblock-regularization methods [22, 24].

(A1) Invertibility: $min

1Q#

SrSr

2! Cmin > 0.

(A2) Incoherence:&&&Q#

ScrSr

1Q#

SrSr

2"1&&&&,2

3 1"2#'dr

for some % $10, 1

2

2.

(A3) Boundedness: $max (J #) 3 Dmax < -.

The next lemma states that imposing these assump-tions on the population quantities implies analogousconditions on the sample statistics with high probabil-ity.

Lemma 1. Assumptions (A1)-(A3) on the populationFisher information matrix yield the following (analo-gous) properties on the empirical Fisher informationmatrix:

(B1) P($min

1Qn

SrSr

2< Cmin ( &

)

3 2 exp(( 18 (&

4n (

4dr)2 + log((m ( 1)2dr)).

(B2) P4&&&Qn

ScrSr

1Qn

SrSr

2"1&&&&,2

> 1"#'dr

+ &

5

3 6 exp(( 18 (Cmin(

#3'dr

+ &)4n

( (1 +'dr

C2min

'n)4dr)2 + log((m ( 1)2(p ( 1))).

(B3) P [$max (J n) > Dmax + &] 3 2 exp*

( 18 (&

4n (

4dr)2 + log

1(m ( 1)2dr

2+.

3.2 Main Theorem

We can now state our main result on the sparsistencyof the group-sparse regularized estimator.

Theorem 1. Consider a discrete graphical model ofthe form (3) with parameters "# and associated edgeset E such that conditions (A1)-(A3) are satisfied.Suppose the regularization parameter satisfies

$n ! 8(2 ( %)%

67log(p ( 1)

n+

m ( 1

44n

8. (8)

Then, there exist positive constants K, c1 and c2 suchthat if the number of samples n scales as

n ! K(m ( 1)2d 2r log

1(m ( 1)2(p ( 1)

2, (9)

then with probability 1 ( c1 exp((c2$2nn) we are guar-anteed

(a) For each node r $ V , the !1/!2 regularized logis-tic regression (6) has a unique solution and hencespecifies a neigborhood 0N (r).

(b) For each node r $ V correctly excludes alledges not in the true neighborhood N (r).Moreover, it includes all edges (r, t) such that&&&##rt;jk

&&&2

! 10Cmin

$n.

Before sketching the proof outline, we first state somelemmas characterizing the solution of (3).

Lemma 2 (Optimality Conditions). Any optimalprimal-dual pair ("\r, Z\r) of (3) satisfies

1. (Stationary Condition).

0!*"\r

++ $nZ\r = 0. (10)


2. (Dual Feasibility). Z\r is equal to the subgra-

dient '+"\r+1,2 so that for any u $ V \r,

(a) if ("\r)u;jk *= 0 for some j, k then

(Z\r)u =("\r)u

+("\r)u+2.

(b) if the entire group ("\r)u = 0, then

+(Z\r)u+2 3 1.

The next lemma states that structure recovery is guar-anteed if the dual is strictly feasible.

Lemma 3 (Strict Dual Feasibility). Suppose that

there exists an optimal primal-dual pair*"\r, Z\r

+for

(6) such that

&&&&*Z\r

+

Scr

&&&&&,2

< 1. Then, any optimal

primal solution 9"\r satisfies*9"\r

+

Scr

= 0. Moreover,

if the Hessian sub-matrix.02!

*"\r

+/

SrSr

5 0 then

"\r is the unique optimal solution.

We are now ready to sketch the proof of Theorem 1.

Proof. Part (a). The proof proceeds by a primal-dualwitness technique, and consists of the construction ofa feasible primal-dual pair in the following two steps:

(i) Primal Candidate using an oracle subprob-lem: Let "\r be the optimal solution of the re-stricted problem

"\r = arg min(!\r)Sc

r=0

$!("\r;D) + $n

&&"\r&&1,2

%.

(11)

(ii) Dual Candidate from Stationary Optimal-ity Condition: For any column u $ Sr set*Z\r

+

u= 1

+(!\r)u+2

*"\r

+

u.

Set*Z\r

+

Scr

from the stationary condition (2).

Showing Strict Dual Feasibility. By construction, the("\r, Z\r) pair satisfies the stationary condition (10).

It remains to show that the the dual Z\r is strictlyfeasible. We show that this holds, and also that thesolution is unique, with high probability in Lemma 5.

Part (b). By uniqueness of the solution shown in part[(a)], the method excludes all edges that are not inthe set of edges. To show that all correct edges are

included, i.e., to show the correct correct sign recovery,it su#ces to show that

&&&0"Sr ( 0"#Sr

&&&&,2

3 #min

2,

where, #min = mint!V \{r}

+#rt;jk+2.

We provide an +·+&,2 bound on the error in (21), fromwhich

2

#min

&&&0"Sr ( 0"#Sr

&&&&,2

3 2

#min

5

Cmin$n

3 1,

provided that #min > 10Cmin

$n.

4 Higher-Order Discrete GraphicalModels

Consider the general higher-order MRF from (2)

P(x) ' exp

$ "

C!C;v!{1,...,m"1}|C|

##C;v I[xC = v]

%,

parameterized by the collection of vectors ##C $R(m"1)|C|

associated with the cliques C $ C.

As before, we fix a node r, and define the long vec-

tor "6\r $ R!c"1

j=1 (p"1j )(m"1)j+1

as the concatenationof the parameter vectors ##rC for all C . V \r; |C| <c. Let "#

P $ R(m"1)2dr be the vector containingonly neighbor-pairwise parameters ##rt;jk for all t $N (r). Accordingly, let "#

P c represent all non-zero non-pairwise entries.

Hierarchical Models. A common assumption imposedon such higher-order MRFs is that they be hierarchicalmodels [16]. Specifically, any MRF of the form (2) ishierarchical if for any clique C, ##C = 0 implies that##B = 0 for any clique B 7 A containing A. This hasan importance consequence: the set of pairwise e!ects

N (r) =

$u $ V \{r} | +##ru+0 *= 0

%,

completely characterizes the set of edges.

Thus, if we are able to estimate just the pairwise pa-rameters of the entire higher-order model, we wouldstill be able to recover the edge-set. Thus, we studythe estimator in (6) but now when the observationsare actually drawn from "#

\r. The hope is that thissolution would still estimate the pairwise parametersof the underlying higher-order model well.


4.1 Assumptions

For fixed positive values Cmin, Dmax and % $10, 1

2

2,

let ( := DmaxCmin

&&"#P c

&&1and ) = #+$(

'dr+1)

1+$ . We imposethe following assumptions on the truth:

(C0) Mismatch Factor: ( 3*

#2"#

+2Cmin

100'2(m"1)dr

.

This condition is required because of the mis-match of the true underlying model and our pair-wise model. In other words, we have a non-zero mean noise, caused by model mismatch, thatneeds to be small. Moreover, since Cmin 3 (m (1)

4dr (see section 7.2), this condition ensures

that ) $10, 1

2

2for suitable choice of %.

(C1) Invertibility:

!min

!E""2log

#P!!

\r

$Xr | X\r

% &'

SrSr

(#Cmin(1+!).

(C2) Incoherence:

Let Q# := E.02log

*P!!

\r

(Xr|X\r

)+/. Then

&&&Q#ScrSr

(Q#SrSr

)"1&&&&,2

3 1(2)4dr

.

(C3) Boundedness: $max (J #) 3 Dmax < -,

where J # = E(I[XS1 = xS1 ]I[XS2 = xS2 ]

T)

for any subset of nodes S1 and S2, and cis the size of the maximum clique in thetrue graphical model. Note that J # $R

!c"1i=1 (m"1)j(p"1)j%

!c"1i=1 (m"1)j(p"1)j .

As in the pairwise case, in the proofs (to controlthe derivative of the Hessian of the log-likelihoodfunction), we need to bound the maximum eigenvalue of matrix 1# = J # 2 1!c"1

j=1(m"1)j%!c"1

j=1(m"1)j .

But again by properties of Kronecker products,$max (1#) = $max (J #), so that it su#ces to imposeassumptions on J #.

The next lemma states that imposing these assump-tions on the population quantities implies analogousconditions on the sample statistics with high probabil-ity. Define D =

'c"1j=1 d

jr .

Lemma 4. Assumptions (C0) - (C3) imply the fol-lowing bounds on the pairwise parameters

(D1) P:$min

6.02!

*"#

P ;D+/

SrSr

83 Cmin ( &

;

32exp

<( 1

8

*&4n(

4D+2+log

*'cj=2(m ( 1)jd j"1

r

+=.

(D2) P4&&&02!

1"#

P ;D2ScrSr

02!1"#

P ;D2"1

SrSr

&&&&,2

> 1"#'dr+&

5

36exp*( 1

8

*Cmin

*%

3'D+&

+4n (

*1+

'D

C2min

'n

+4D+2

. +log*'c

j=2(m ( 1)j(1 ( p)j"1++.

(D3) P.$max

*E(I[Xt1 =k1]I[Xt2 =k2]T

)+!Dmax+&

/

32exp

<( 1

8

*&4n(

4D+2+log

*'cj=2(m ( 1)jd j"1

r

+=.

4.2 Main Theorem

The following theorem shows that if the graphicalmodel satisfies hierarchical assumption, then pairwiseestimation exactly recovers the underlying graphicalmodel provided that the higher order dependency pa-rameters are not too large.

Theorem 2. Consider an m-ary graphical model withparameter "# and associate edge set E such that condi-tions (C0)-(C3) and hierarchical assumption are satis-fied. Suppose the size of the largest clique in the graphis c and the regularization parameter satisfies

$n ! 8(2 ( %)%

67log(p ( 1)

n+

m ( 1

44n

+1

4

&&"#P c

&&1

8.

Then, there exist positive constants K, c1 and c2 suchthat for

n ! K(m ( 1)2d32 c"1r log

1(m ( 1)c(p ( 1)c"1

2,

with probability 1 ( c1 exp*

(c21$n ( 2

&&"#P c

&&1

22n+

we are guaranteed

(a) For each node r $ V , the !1/!2 regularized logis-tic regression (6) has a unique solution and hencespecifies a neigborhood 0N (r).

(b) For each node r $ V correctly excludes all edgesnot in the true neighborhood N (r). Moreover,

it includes all edges (r, t) such that&&&##rt;jk

&&&2

!10

Cmin$n.

Proof. Part [(a)]. The proof proceeds along the samelines as that of Theorem 1. We construct a primal-dual pair precisely as before using an oracle subprob-lem. However, showing strict-dual feasibility is moredelicate when the true model has higher-order factors.

Showing Strict Dual Feasibility. By construction, the("\r, Z\r) pair satisfies the stationary condition (10),

as before. We then show that the the dual Z\r isstrictly feasible, and also that the solution is unique,with high probability in Lemma 7.


0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Control parameter

Prob

. of s

ucce

ss

p=32, node 8p=16, node 4

(a) Line Graph

0 10 20 30 400

0.2

0.4

0.6

0.8

1

Control parameter

Prob

. of s

ucce

ss

(b) Grid

Figure 1: Probability of success P[ )N (r) = N (r)]versusthe control parameter *(n, p, d) = n

10d log(p) for discretegraphical models on a Line Graph and a Grid.

Part [(b)]. Here again, we can argue as in the proof ofTheorem 1 to show that all correct edges are includedgiven an + · +&,2 bound on the error provided in (24).

5 Experiments

In this section, we report a set of synthetic experimentsinvestigating the consequences of the main theorems.These results illustrate the behavior of the structurelearning algorithm on various types of graphs. We fixthe size of the alphabetm = 3. For a given graph type,we pick a pairwise parameter set "#. We generate nsamples according to the probability distribution cor-responding to "#. Then, we solve (6) and compare thegraph corresponding to the solution with the originalgraph. If the two graphs are identical, we declare thatthe algorithm has succeeded.

Pairwise Model: We consider two di!erent classes ofgraphs: line graphs and grids (Fig. 4.2). In particular,we consider line graphs of size p = 16, 32 and a gridof size

4p # 4

p = 16. In each of these cases, theparameter vector "# is generated by setting each non-zero entry ##rt;jk $ [(0.5, 0.5] for the line graphs and##rt;jk $ [0, 5] for the grid unfirmly at random. To

(a) Line Graph (b) Grid (c) Three-Way

Figure 2: Line graph (a) and Grid (b) are used instudying pairwise graphical model selection. Three-way graph (c) is used for studying higher-order graph-ical model selection.

0 50 100 1500

0.2

0.4

0.6

0.8

1

Control parameter

Prob

. of s

ucce

ssFigure 3: Probability of success P[ )N (r) = N (r))]versusthe control parameter *(n, p, d) = n

10d log(p) for ahigher order discrete graphical model on a Three-waygraph.

estimate the probability of success, we use 15 batchesof samples drawn from the distribution specified by"#. We consider two types of simulations:

Neighborhood Recovery: Here, we focus on the re-covery of the neighborhood of a particular node in agraph. Fixing a sample batch, for each pair (p, n), we

set $n = K

<>p"1n + m"1

4'n

=, where K is the constant

chosen by cross validation. We compare the graph in-duced by 0"K! with the graph induced by "# to getthe probability of success. Fig 4.2 shows the probabil-ity of success in neighborhood recovery. Notice thatfor di!erent values of n and p, the phase transitiongraphs stack on the top of each other; this shows thatthe scaling of the samples n is correct.

Higher-Order Model: In this case, we consider agraph with higher order dependencies and try to es-timate it using the pairwise model. We consider thethree-way graph (triangle + line graph of size p ( 2)shown in Fig 2(c). There is only one three-way fac-tor involving three nodes. The rest of the graph ischaracterized by pairwise parameters. Solving (7), weinvestigate the probability of success for neighborhoodrecovery of the node that connects the line graph andthe triangle. Fig. 4.2 illustrates the result.


References

[1] P. Abbeel, D. Koller, and A. Y. Ng. Learning fac-tor graphs in polynomial time and sample com-plexity. Jour. Mach. Learning Res., 7:1743–1788,2006.

[2] F. Bach. Consistency of the group lasso and multi-ple kernel learning. Journal of Machine LearningResearch, 9:1179–1225, 2008.

[3] G. Bresler, E. Mossel, and A. Sly. Recon-struction of Markov random fields from sam-ples: Some easy observations and algorithms.http://front.math.ucdavis.edu/0712.1402,2008.

[4] E. Candes and T. Tao. The Dantzig selector: Sta-tistical estimation when p is much larger than n.Annals of Statistics, 2006.

[5] C. Chow and C. Liu. Approximating discreteprobability distributions with dependence trees.IEEE Trans. Info. Theory, 14(3):462–467, 1968.

[6] G. Cross and A. Jain. Markov random field tex-ture models. IEEE Trans. PAMI, 5:25–39, 1983.

[7] I. Csiszar and Z. Talata. Consistent estimation ofthe basic neighborhood structure of Markov ran-dom fields. The Annals of Statistics, 34(1):123–145, 2006.

[8] C.Zhang and J.Huang. Model selection consis-tency of the lasso selection in high-dimensionallinear regression. Annals of Statistics, 36:1567–1594, 2008.

[9] C. Dahinden, G. Parmigiani, M.C. Emerick, andP. Buhlmann. Penalized likelihood for sparse con-tingency tables with an application to full-lengthcdna libraries.

[10] C. Dahinden, M. Kalisch, and P. Buhlmann. De-composition and model selection for large contin-gency tables. Biometrical Journal, 52(2):233–252,2010.

[11] S. Dasgupta. Learning polytrees. In Uncertaintyon Artificial Intelligence, pages 134–14, 1999.

[12] D. Donoho and M. Elad. Maximal sparsity repre-sentation via !1 minimization. Proc. Natl. Acad.Sci., 100:2197–2202, March 2003.

[13] M. Hassner and J. Sklansky. The use of Markovrandom fields as models of texture. Comp. Graph-ics Image Proc., 12:357–370, 1980.

[14] E. Ising. Beitrag zur theorie der ferromag-netismus. Zeitschrift fur Physik, 31:253–258,1925.

[15] B. Krishnapuram, L. Carin, M. A. T. Figueiredo,and A. J. Hartemink. Sparse multinomial logis-tic regression: Fast algorithms and generalization

bounds. IEEE Trans. Pattern Anal. Mach. Intell.,27(6):957–968, 2005.

[16] S. L. Lauritzen. Graphical Models. Oxford Uni-versity Press, Oxford, 1996.

[17] S.-I. Lee, V. Ganapathi, and D. Koller. E#cientstructure learning of markov networks using l1-regularization. In Neural Information ProcessingSystems (NIPS) 19, 2007.

[18] K. Lounici, A. B. Tsybakov, M. Pontil, andS. A. van de Geer. Taking advantage of spar-sity in multi-task learning. In 22nd ConferenceOn Learning Theory (COLT), 2009.

[19] C. D. Manning and H. Schutze. Foundationsof Statistical Natural Language Processing. MITPress, 1999.

[20] L. Meier, S. van de Geer, and P. Buhlmann. Thegroup lasso for logistic regression. 70:53–71, 2008.

[21] N. Meinshausen and P. Buhlmann. High dimen-sional graphs and variable selection with the lasso.Annals of Statistics, 34(3), 2006.

[22] S. Negahban and M. J. Wainwright. Joint supportrecovery under high-dimensional scaling: Benefitsand perils of !1,&-regularization. In Advances inNeural Information Processing Systems (NIPS),2008.

[23] A. Y. Ng. Feature selection, !1 vs. !2 regulariza-tion, and rotational invariance. In InternationalConference on Machine Learning, 2004.

[24] G. Obozinski, M. J. Wainwright, and M. I. Jor-dan. Support union recovery in high-dimensionalmultivariate regression. Annals of Statistics,2010.

[25] P. Ravikumar, H. Liu, J. La!erty, and L. Wasser-man. Sparse additive models. Journal of theRoyal Statistical Society, Series B, .

[26] P. Ravikumar, M. J. Wainwright, and J. La!erty.High-dimensional ising model selection using !1-regularized logistic regression. Annals of Statis-tics, 38(3):1287–1319, .

[27] B. D. Ripley. Spatial statistics. Wiley, New York,1981.

[28] A. Rothman, P. Bickel, E. Levina, and J. Zhu.Sparse permutation invariant covariance estima-tion. Electronic Journal on Statistics, 2:494–515,2008.

[29] P. Spirtes, C. Glymour, and R. Scheines. Causa-tion, prediction and search. MIT Press, 2000.

[30] N. Srebro. Maximum likelihood bounded tree-width Markov networks. Artificial Intelligence,143(1):123–138, 2003.


[31] J. A. Tropp. Just relax: Convex programmingmethods for identifying sparse signals. IEEETrans. Info. Theory, 51(3):1030–1051, March2006.

[32] B. Turlach, W.N. Venables, and S.J. Wright. Si-multaneous variable selection. Techno- metrics,27:349–363, 2005.

[33] M. J. Wainwright. Sharp thresholds for noisyand high-dimensional recovery of sparsity using!1-constrained quadratic programming (lasso).IEEE Transactions on Information Theory, 55:2183–2202, 2009.

[34] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using!1-constrained quadratic programming (lasso).IEEE Transactions on Info. Theory, To appear.Original version: UC Berkeley Technical Report709, May 2006.

[35] J.W. Woods. Markov image modeling. IEEETransactions on Automatic Control, 23:846–850,October 1978.

[36] M. Yuan and Y. Lin. Model selection and estima-tion in regression with grouped variables. Journalof the Royal Statistical Society B, 1(68):49, 2006.

[37] P. Zhao and B. Yu. On model selection consis-tency of lasso. J. of Mach. Learn. Res., 7:2541–2567, 2007.


Supplementary Material

6 Auxiliary Lemmas: Proof ofLemma 3

Proof. We can rewrite (6) as an optimization problemover the !1/!2 ball of radius C for some C($n) < -.

Since $n > 0, by KKT conditions,&&&9"\r

&&&1,2

= C for

all optimal primal solution 9"\r.

By definition of the !1/!2 subdi!erential, we know that

for any column u $ V \{r}, we have&&&*Z\r

+

u

&&&2

31. Considering the necessary optimality condition

0!*"\r

++ $nZ\r = 0, by complementary slackness

condition, we have?9"\r, Z\r

@( C =

?9"T\r, Z\r

@(

&&&9"\r

&&&1,2

= 0. Now if for an arbitrary column u $

V \{r}, we have&&&*Z\r

+

u

&&&2< 1 and

*9"\r

+

u*= 0 then

this would contradict the condition that?9"\r, Z\r

@=

&&&9"\r

&&&1,2

.

For this restricted problem, if the Hessian sub-matrixis positive definite, then the problem is strictly convexand it has a unique solution.

7 Derivatives of the Log-LikelihoodFunction

In this section, we point out the key properties of thegradient, Hessian and derivative of the Hessian for thelog-liklihood function. These properties are used toprove the concentration lemmas.

7.1 Gradient

By simple derivation, we have

'

'##rt;"k!(i)("\r;D)

=I.x(i)t = k

/*I.x(i)r =!

/(P!!

\r

.Xr=! |X\r=x(i)

\r

/+.

It is easy to show that E!!\r

.&

&!!rt;!k

!(i)("\r;D)/= 0

andVar*

&&!!

rt;!k!(i)("\r;D)

+31

4 . With i.i.d assumption

on drawn samples, we have Var*

&&!!

rt;!k!("\r;D)

+=

Var*

1n

'ni=1

&&!!

rt;!k!(i)("\r;D)

+3 1

4n . Hence, for a

fixed t $ V \{r} by Jensen’s inequality,

E!!\r

:&&&&&'

'##rt;"k!("\r;D)

&&&&&2

;

3

ABBBCE!!\r

D

E&&&&&

'

'##rt;"k!("\r;D)

&&&&&

2

2

F

G

3 m ( 1

24n

.

Considering the terms associated with ##rt;"k’s in thegradient vector of the log-likelihood function, for afixed t $ V \{r}, only m ( 1 (out of (m ( 1)2) val-ues are non-zero. By a simple calculation, we get

maxt!V \{r}

&&&&&'

'##rt;"k!(i)("\r;D)

&&&&&2

342 ,i.

By Azuma-Hoe!ding inequality, we get

P:&&&&&

'

'##rt;"k!("\r;D)

&&&&&2

>m ( 1

24n

+&

;32 exp

<(&

2

4n

=,

for all t $ V \{r}. Using the union bound, we get

P:

maxt!V \{r}

&&&&&'

'##rt;"k!("\r;D)

&&&&&2

>m ( 1

24n

+&

;

3 2 exp

<(&

2

4n+ log(p ( 1)

=.

(12)

7.2 Hessian

For the Hessian of the log-likelihood function, we have

'2 !(i)("\r;D)

'##rt2;"2k2'##rt1;"1k1

=I.x(i)t1 =k1

/I.x(i)t2 =k2

/+"1"2

*x(i)

+,

where,

+"1"2

*x(i)

+:= P!!

\r

.Xr = !1

333X\r = x(i)\r

/

*I.x(i)r =!1

/I.x(i)r =!2

/(P!!

\r

.Xr=!2

333X\r=x(i)\r

/+.

Consider the zero-mean random variable

Z(i)t1"1k1;t2"2k2

:=

'2 !(i)("\r;D)

'##rt2;"2k2'##rt1;"1k1

( E:

'2 !("\r;D)

'##rt2;"2k2'##rt1;"1k1

;.

Notice that Var*Z(i)t1"1k1;t2"2k2

+3 1 and consequently,

by i.i.d assumption, Var*

1n

'ni=1 Z

(i)t1"1k1;t2"2k2

+3 1

n .


Hence, for fixed values t1, !1, k1 and t2 $ S2 . V \{r},we have

E!!\r

:&&&&&1

n

n"

i=1

Z(i)t1"1k1;t2"2k2

&&&&&2

;

3

ABBBCE!!\r

D

E&&&&&1

n

n"

i=1

Z(i)t1"1k1;t2"2k2

&&&&&

2

2

F

G

37

|S2|n

.

(13)This radom variable, for fixed values t1, !1, k1and a fixed t2, is bounded and in particular,&&& 1n

'ni=1 Z

(i)t1"1k1;t2"2k2

&&&2

3 2. By Azuma-Hoe!ding in-

equality and the union bound,

P4&&Qn

SrSr( Q#

SrSr

&&&,2

>

4dr4n

+ &

5

3 2 exp

<(&

2

8n+ log

1(m ( 1)2dr

2=.

P4&&&Qn

ScrSr

( Q#ScrSr

&&&&,2

>

4dr4n

+ &

5

3 2 exp

<(&

2

8n+ log

1(m ( 1)2(p(dr(1)

2=.

(14)Similar analysis as (13) combined with the ineqality$max(·) 3 +·+&,2, shows that

P4$max

1Qn

SrSr( Q#

SrSr

2>

4dr4n

+ &

5

3 2 exp

<(&

2

8n+ log

1(m ( 1)2dr

2=.

(15)We also need a control over the deviation of the inversesample Fisher information matrix from the inverse ofits mean. We have

$max

*1Qn

SrSr

2"1 (1Q#

SrSr

2"1+

= $max

*1Q#

SrSr

2"1 1Q#

SrSr( Qn

SrSr

2 1Qn

SrSr

2"1+

3 $max

*1Q#

SrSr

2"1+$max

1Q#

SrSr( Qn

SrSr

2

$max

*1Qn

SrSr

2"1+

34dr

Cmin4n$max

*1Qn

SrSr

2"1+.

By part (B1) in Lemma 1, we have

P4$max

*1Qn

SrSr

2"1+>

1

Cmin+ &

5

32 exp

H

IJ(

*Cmin'

'n

1+Cmin' (4dr+2

8+log

1(m ( 1)2dr

2K

LM.

(16)Hence, we get,

P4$max

*1Qn

SrSr

2"1(1Q#

SrSr

2"1+>

4dr

C2min

4n+&

5

34 exp

H

IJ(

*Cmin'

'n

1+Cmin' (4dr+2

8+log

1(m ( 1)2dr

2K

LM.

(17)

7.3 Derivative of Hessian

We want to bound the rate of the change for the ele-ments of Hessian matrix. Let

0Q(i)t2"2k2;t1"1k1

:='

'"\r

'2 !(i)("\r;D)

'##rt2;"2k2'##rt1;"1k1

= I.x(i)t1 = k1

/I.x(i)t2 = k2

/ '

'"\r+"1"2

*x(i)

+.

Recall the definition of +(·) from section 7.2. We have

'+"1"21x(i)

2

'#rt3;"3k3

=I.x(i)t3 =k3

/P!!

\r

.Xr=!1

333X\r=x(i)\r

/

H

IJ+"2"3*x(i)

+(

+"1"21x(i)

2+"1"3

1x(i)

2

P!!\r

.Xr = !1

333X\r = x(i)\r

/2

K

LM .

For any t3 $ V \{r}, each entry is bounded by12 and there are only m ( 1 non-zero entries foreach k3. Hence, for any t3, one can colculde that&&& &

&!rt3;!3k3+"1"2

1x(i)

2&&&2

3 m"1'2

for all i. Finally, for

all !1 and !2 we have

maxt3!V \{r}

&&&&'

'#rt3;"3k3

+"1"2

*x(i)

+&&&&2

3 m ( 142

. (18)

8 Proof of Lemma 1

(B1) By variational representation of the smallesteigenvalue, we have

$min

1Q#

SrSr

2= min

(x(2=1xTQ#

SrSrx

3 yTQnSrSr

y + yT1Q#

SrSr( Qn

SrSr

2y,


for all y $ R(m"1)2dr with +y+2 = 1 and in particularfor the unit-norm minimal eigenvalue of Qn

SrSr. Hence,

$min

1Qn

SrSr

2! $min

1Q#

SrSr

2( $max

1Q#

SrSr(Qn

SrSr

2.

By (15), we get

P($min

1Qn

SrSr

2< Cmin ( &

)

3 P($max

1Q#

SrSr( Qn

SrSr

2> &

)

3 2 exp

<( (&

4n (

4dr)2

8+ log

1(m ( 1)2dr

2=.

(B2) We can write

QnScrSr

1Qn

SrSr

2"1= Q#

ScrSr

1Q#

SrSr

2"1

N OP QT0

+Q#ScrSr

*1Qn

SrSr

2"1 (1Q#

SrSr

2"1+

N OP QT1

+*Qn

ScrSr

( Q#ScrSr

+ 1Q#

SrSr

2"1

N OP QT2

+*Qn

ScrSr

( Q#ScrSr

+*1Qn

SrSr

2"1(1Q#

SrSr

2"1+

N OP QT3

.

Considering assumption (A3), +T0+&,2 < 1"2#'dr

and

hence, it su#ces to show that +Ti+&,2 < #3'dr

for i =

1, 2, 3. For the first term, we have

&&&Q#ScrSr

*1Qn

SrSr

2"1(1Q#

SrSr

2"1+&&&

&,2

=&&&Q#

ScrSr

1Q#

SrSr

2"11Q#

SrSr(Qn

SrSr

2 1Qn

SrSr

2"1&&&&,2

3&&&Q#

ScrSr

1Q#

SrSr

2"1&&&&,2

$max

1Q#

SrSr( Qn

SrSr

2

$max

*1Qn

SrSr

2"1+

3 1 ( 2%4dr

4dr4n

1

Cmin.

The last inequality follows from (14) and (16) withhigh probability. Setting Cmin = min (Cmin, 1), by ap-plying the union bound,

P4&&&Q#

ScrSr

*1Qn

SrSr

2"1 (1Q#

SrSr

2"1+&&&

&,2> &

5

34exp

H

IJ(

*Cmin&

4n(

4dr( 1"2#

Cmin

+2

8+log

1(m(1)2dr

2K

LM.

For the second term, we have&&&*Qn

ScrSr

( Q#ScrSr

+ 1Q#

SrSr

2"1&&&&,2

3&&&Qn

ScrSr

( Q#ScrSr

&&&&,2

$max

*1Q#

SrSr

2"1+

34dr4n

1

Cmin.

The last inequality follows from (14) with high proba-bility. Hence, we have

P4&&&*Qn

ScrSr

( Q#ScrSr

+ 1Q#

SrSr

2"1&&&&,2

> &

5

32exp

H

IJ(

*&4n( (1+Cmin)

'dr

Cmin

+2

8+log

1(m(1)2(p(1(dr)

2K

LM.

For the third term, we have&&&*Qn

ScrSr

( Q#ScrSr

+*1Qn

SrSr

2"1(1Q#

SrSr

2"1+&&&

&,2

3&&&Qn

ScrSr

( Q#ScrSr

&&&&,2$max

*1Qn

SrSr

2"1(1Q#

SrSr

2"1+

34dr4n

4dr

C2min

4n

=dr

C2minn

The last inequality follows from (14) and (17). Hence,we have

P4&&&*Qn

ScrSr

( Q#ScrSr

+*1Qn

SrSr

2"1(1Q#

SrSr

2"1+&&&

&,2>&

5

3 6 exp

6(

*Cmin&

4n (

*1 +

'dr

C2min

'n

+4dr+2

8

+ log1(m ( 1)2(p ( 1 ( dr)

28.

The result follows by substituting & with #3'dr.

(B3) We can write

P [$max (J n) > Dmax + &]

3 P:&&&&&

1

n

n"

i=1

*J (i) ( J #

+&&&&&F

> &

;.

Consequently, same analysis as part (B1) gives theresult.

This concludes the proof of the Lemma.

9 Su!ciency Lemmas for PairwiseDependencies

Lemma 5. The constructed candidate primal-dual

pair*"\r, Z\r

+satisfy the conditions of the Lemma 3


with probability 1(c1 exp((c2n) for some positive con-stants c1, c2 $ R.

Proof. Using the mean-value theorem, for some "\r in

the convex combination of "\r and "#\r, we have

02!*"#

\r;D+ ."\r ("#

\r

/

= 0!*"\r;D

+( 0!

*"#

\r;D+

+*

02!*"#

\r;D+

( 02!1"\r;D

2+ ."\r ("#

\r

/

= ($nZ\r ( 0!*"#

\r;D+

N OP QWn

\r

+*

02!*"#

\r;D+

( 02!1"\r;D

2+ ."\r ("#

\r

/

N OP QRn

\r

.

We can rewrite these set of equations as two sets ofequations over Sr and Sc

r . By Lemma 1, the Hessiansub-matrix on Sr is invertible with high probabilityand thus we get

QnScrSr

1Qn

SrSr

2"1<

($n*Z\r

+

Sr

(*Wn

\r

+

Sr

+*Rn

\r

+

Sr

=

= ($n*Z\r

+

Scr

(*Wn

\r

+

Scr

+*Rn

\r

+

Scr

.

Equivalently, we get

*Z\r

+

Scr

=1

$n

4*Wn

\r

+

Scr

(*Rn

\r

+

Scr

5

( 1

$nQn

ScrSr

1Qn

SrSr

2"1<*Wn

\r

+

Sr

(*Rn

\r

+

Sr

=

+QnScrSr

1Qn

SrSr

2"1*Z\r

+

Sr

.

Notice that

&&&&*Z\r

+

Sr

&&&&&,2

= 1. Thus, we can establish

the following bound&&&&*Z\r

+

Scr

&&&&&,2

3<1 +

&&&QnScrSr

1Qn

SrSr

2"1&&&&,2

Rdr

=

D

SE

&&&Wn\r

&&&&,2

$n+

&&&Rn\r

&&&&,2

$n+ 1

F

TG( 1

3 (2 ( %)<

%

4(2 ( %) +%

4(2 ( %) + 1

=( 1

= 1 ( %

2< 1.

The second inequality holds with high probability aco-ording to Lemma 1 and Lemma 6.

Lemma 6. For quantities defined in the proof ofLemma 5, the following inequalities hold:

P

D

SE

&&&Wn\r

&&&&,2

$n! %

4(2 ( %)

F

TG

3 2 exp

H

IJ(

*#

4(2"#)$n4n ( m"1

2

+2

4+ log(p ( 1)

K

LM

P

D

SE

&&&Rn\r

&&&&,2

$n>

%

4(2 ( %)

F

TG

3 2 exp

H

IJ(

*#

4(2"#)$n4n ( m"1

2

+2

4+ log(p ( 1)

K

LM.

Proof. The first inequality follows directly from(12), for & = #

4(2"#)$n ( m"12'n, provided that

$n ! 2(2"#)#

m"1'n. This probability goes to zero, if

$n ! 8(2"#)#

<>log(p"1)

n + m"14'n

=.

Before we proceed, we want to point out a tech-nical fact that we will use it through the restof the proof. For $n achieves the lower boundmentioned above, any positive value K and

n ! 1K2

64(2"#)2

#2

*Rlog(p ( 1) + m"1

4

+2d 2r , we

have $ndr 3 K. Hence, we can assume $ndr is lessthan any fixed constant K for su#ciently large n.

In order to bound Rn\r, we need to bound****

#"\r

&

Sr

!+""

\r,Sr

****#,2

, using the technique used in

Rothman et al. [28]. Let G : R(m"1)2dr & R be afunction defined as

G#(U)Sr

&:= "

#+""

\r,Sr+(U)Sr

;D&!"

#+""

\r,Sr

;D&

+#n

-***+""

\r,Sr

+(U)Sr

***1,2!***+""

\r,Sr

***1,2

..

By optimality of "\r, it is clear that#U&

Sr

=#"\r

&

Sr

!+""

\r,Srminimizes G. Since G(0) = 0 by

construction, we have G##

U&

Sr

&$ 0. Suppose there

exist an !&/!2 ball with radius Br such that for any***(U)Sr

***#,2

= Br, we have that G*(U)Sr

+> 0. Then,

we can claim that

&&&&*U+

Sr

&&&&&,2

3 Br; because if, in

contrary, we assume that#U&

Sr

is outside the ball,


then for an appropriate choice of t $ (0, 1), the point

t*U+

Sr

+(1( t)0 lies on the boundary of the ball. By

convexity of G, we have

G

<t*U+

Sr

+ (1 ( t)0

=3 tG

<*U+

Sr

=+(1 ( t)G (0)

3 0.

This is a contradiction to the assumption of thepositivity of G on the boundary of the ball.

Let (U)Sr $ R(m"1)2dr be an arbitrary vector with***(U)Sr

***#,2

= 5Cmin

#n. Applying mean value theorem

to the log liklihood function, for some * $ [0, 1], weget

G*(U)Sr

+=?1

W\r2Sr

, (U)Sr

@

+

U(U)Sr

,02!

<*"#

\r

+

Sr

+ * (U)Sr;D

=(U)Sr

V

+ $n

6&&&&*"#

\r

+

Sr

+ (U)Sr

&&&&1,2

(&&&&*"#

\r

+

Sr

&&&&1,2

8.

(19)We bound each of these three terms individually. ByCauchy-Schwartz inequality, we have

333?1

W\r2Sr

, (U)Sr

@333 3&&&1W\r

2Sr

&&&&,2

&&(U)Sr

&&1,2

3 %

4(2 ( %)$ndr5

Cmin$n

3 5

4Cmindr$

2n.

Moreover, by triangle inequality,

$n

6&&&&*"#

\r

+

Sr

+ (U)Sr

&&&&1,2

(&&&&*"#

\r

+

Sr

&&&&1,2

8

! ($n&&(U)Sr

&&1,2

! ( 5

Cmindr$

2n.

To bound the other term, notice that by Tailor expan-

sion, we get

$min

<02!

<*"#

\r

+

Sr

+ * (U)Sr;D

==

! min(![0,1]

$min

<02!

<*"#

\r

+

Sr

+ * (U)Sr;D

==

! $min

1Q#

SrSr

2

( max(![0,1]

$max

H

IJ

W'02! ("Sr ;D)

'"Sr

33333"!!

\r

#

Sr+((U)Sr

, (U)Sr

XK

LM

! Cmin (<

maxt3!V \{r}

&&&&'

'#rt3;"3k3

+"1"2

*x(i)

+&&&&2

Rdr

=

$max(1#)Rdr&&(U)Sr

&&&,2

,

(20)where, +(·) is defined in Section 7.2. We know that$max(1#) = $max(J #) as a property of Kronecherproduct. By (18) and assumption on the maximumeigenvalue of J #, we have

$min

<02!

<*"#

\r

+

Sr

+ * (U)Sr;D

==

! Cmin ( m ( 142

drDmax

&&(U)Sr

&&&,2

! Cmin ( m ( 142

drDmax5

Cmin$n

! Cmin

2

<$ndr 3 C2

min450(m ( 1)Dmax

=.

Hence, from (19), we get

G*(U)Sr

+! dr

5

Cmin$2n

<(1

4+

5

2( 1

=> 0.

We can colclude that&&&&*"\r

+

Sr

(*"#

\r

+

Sr

&&&&&,2

3 5

Cmin$n. (21)

with high probability. With similar analysis on themaximum eigenvalue of the derivative of Hessian as in(20), it is easy to show that&&&Rn

\r

&&&&,2

$n

3 1

$n

m ( 142

drDmax

&&&&*"\r

+

Sr

(*"#

\r

+

Sr

&&&&2

&,2

3 m ( 142

drDmax25

C2min

$n

3 %

4(2 ( %) ,

provided that $ndr 3 C2min

50'2(m"1)Dmax

#2"# .


10 Proof of Lemma 4

(D1) By variational representation of the smallesteigenvalue, we have

$min

6.02!

*"#

P ;D+/

SrSr

8

! $min

<.02!

*"#

\r;D+/

SrSr

=

( $max

<.02!

*"#

\r;D+/

SrSr

(.02!

*"#

P ;D+/

SrSr

=

! Cmin(1 + ()

( $max

<.02!

*"#

\r;D+/

SrSr

(.02!

*"#

P ;D+/

SrSr

=.

In the second inequality, we used the result ofLemma 1, i.e., the inequality holds with probabilitystated in Lemma 4. By Tailor expansion, for some* $ [0, 1], and by (23), we get

$max

<.02!

*"#

\r;D+/

SrSr

(.02!

*"#

P ;D+/

SrSr

=

3 $max

H

IJ

W'.02!

*";D

+/

SrSr

'"

33333!!

\r"(!!Pc

, "#P c

XK

LM

3&&&0+"1"2

*x(i)

+&&&&

Dmax

&&"#P c

&&1

= (Cmin.

Note that&&0+"1"2

1x(i)

2&&& 3 1 for +(·) defined in

section 7.3. The last inequality holds as a result ofLemma 1 with the probability stated in Lemma 4.Hence, the result follows.

(D2) We can write

02!1"#

P ;D2ScrSr

*02!

1"#

P ;D2SrSr

+"1=

3"

i=0

Ti,

where,

T0 = "2"+""

\r;D,ScrSr

#"2"

+""

\r;D,SrSr

&!1

T1 = "2"+""

\r;D,ScrSr-#

"2"+""

P ;D,SrSr

&!1!

#"2"

+""

\r;D,SrSr

&!1.

T2 =#"2"

+""

P ;D,ScrSr

!"2"+""

\r;D,ScrSr

&

#"2"

+""

\r;D,SrSr

&!1

T3 =#"2"

+""

P ;D,ScrSr

!"2"+""

\r;D,ScrSr

&

-#"2"

+""

P ;D,SrSr

&!1!

#"2"

+""

\r;D,SrSr

&!1..

By Lemma 1, we have that +T0+&,1 3 1"%'dr

with the

probability stated in Lemma 4. For the second term,we have

+T1+&,2

3+T0+&,2$max

H

IIJ02!*"#

\r;D+

SrSr

( 02!1"#

P ;D2SrSr

N OP QT12

K

LLM

$max

H

IIJ*

02!1"#

P ;D2SrSr

+"1

N OP QT13

K

LLM

3 1 ( )4dr(Cmin

1

Cmin=

1 ( )4dr(.

We used the result of (D1) for $max (T13) 3 1Cmin

.

For the third term, we have

+T2+&,2 3

&&&&&&&&&

02!*"#

\r;D+

ScrSr

( 02!1"#

P ;D2ScrSr

N OP QT21

&&&&&&&&&&,2

$max

H

IIIJ

<02!

*"#

\r;D+

SrSr

="1

N OP QT22

K

LLLM

3 (Cmin1

Cmin(1 + ()

=(

1 + (.

For the fourth term, we have

+T3+&,2 3 +T21+&,2 $max (T22)$max (T12)$max (T13)

3 (Cmin1

Cmin(1 + ()(Cmin

1

Cmin

3 (2

1 + (.

Putting all piences together, we get the result.

(D3) The result follows directly from Lemma 1.

This concludes the proof of Lemma.

11 Su!ciency Lemmas for HigherOrder Dependencies

Lemma 7. The constructed candidate primal-dual

pair*"\r, Z\r

+satisfy the conditions of the Lemma 3


with probability 1(c1 exp((c2n) for some positive con-stants c1, c2 $ R.

Proof. Using the mean-value theorem, for some "\r in

the convex combination of "\r and "#P , we have

02!1"#

P ;D2 ."\r ( "#

P

/

= 0!*"\r;D

+( 0!

1"#

P ;D2

+102!

1"#

P ;D2

( 02!1"\r;D

22 ."\r ( "#

P

/

= ($nZ\r ( 0!1"#

P ;D2

N OP QWn

\r

+102!

1"#

P ;D2

( 02!1"\r;D

22 ."\r ( "#

P

/

N OP QRn

\r

.

We can rewrite these set of equations as two sets ofequations over Sr and Sc

r . By Lemma 4, the Hessiansub-matrix on Sr is invertible with high probabilityand thus we get

02!1"#

P ;D2ScrSr

*02!

1"#

P ;D2SrSr

+"1

<($n

*Z\r

+

Sr

(*Wn

\r

+

Sr

+*Rn

\r

+

Sr

=

= ($n*Z\r

+

Scr

(*Wn

\r

+

Scr

+*Rn

\r

+

Scr

.

Notice that****#Z\r

&

Sr

****#,2

= 1 and hence, we get

&&&&*Z\r

+

Scr

&&&&&,2

361+

&&&&02!1"#

P ;D2ScrSr

*02!

1"#

P ;D2SrSr

+"1&&&&&,2

Rdr

8

D

SE

&&&Wn\r

&&&&,2

$n+

&&&Rn\r

&&&&,2

$n+ 1

F

TG( 1

3 (2 ( %)<

%

4(2 ( %) +%

4(2 ( %) + 1

=( 1

= 1 ( %

2< 1.

The second inequality holds with high probability ac-cording to Lemma 4 and Lemma 8.

Lemma 8. For quantities defined in the proof of

Lemma 7, the following inequalities hold:

P

D

SE

&&&Wn\r

&&&&,2

$n>

%

4(2 ( %)

F

TG

3 2 exp

6(

**#

4(2"#)$n(12

&&"#P c

&&1

+4n( m"1

2

+2

4

+ log(p ( 1)

8

P

D

SE

&&&Rn\r

&&&&,2

$n>

%

4(2 ( %)

F

TG

3 2 exp

6(

**#

4(2"#)$n(12

&&"#P c

&&1

+4n( m"1

2

+2

4

+ log(p ( 1)

8.

Proof. By simple derivation, we have

'

'##rt;"k!(i)("P ;D) = I

.x(i)t = k

/

*I.x(i)r = !

/( P!!

P

.Xr = ! | X\r = x(i)

\r

/+.

It is easy to show that

E!!\r

:'

'##rt;"k!(i)("P ;D)

;

= P!!\r

(Xr = ! | Xt = k,X\r,t = x\r,t

)

( P!!P

(Xr = ! | Xt = k,X\r,t = x\r,t

)

3&&"#

P c

&&1

max(![0,1]

&&&0P!!\r" (!!

Pc

(Xr=! |Xt= k,X\r, t=x\r, t

)&&&&

3 1

4

&&"#P c

&&1,

where, with abuse of notation "#\r ( *"#

P c repre-

sents the matrix "#\r purturbed only on the en-

tries corresponding to "#P c . Also, one can show

that Var*

&&!!

rt;!k!(i)("\r;D)

+3 1

4 . Consequently,

with i.i.d assumption on drawn samples, we have

Var*

&&!!

rt;!k!("\r;D)

+3 1

4n . For a fixed t $ V \{r}


by Jensen’s inequality,

E!!\r

:&&&&&'

'##rt;"k!("\r;D)

&&&&&2

;

3

ABBBCE!!\r

D

E&&&&&

'

'##rt;"k!("\r;D)

&&&&&

2

2

F

G

3 1

2

7(m ( 1)2

n+&&"#

P c

&&21

3 m ( 1

24n

+1

2

&&"#P c

&&1.

We have maxt!V \{r}

&&& &&!!

rt;!k!(i)("\r;D)

&&&2

342 for

all i and hence, by Azuma-Hoe!ding inequality andthe union bound, we get

P

D

E&&&&&

'

'##rt;"k!("\r;D)

&&&&&&,2

>m ( 1

24n

+1

2

&&"#P c

&&1+ &

F

G

3 2 exp

<(&

2

4n+ log(p ( 1)

=.

For $n ! 8(2"#)#

*m"14'n+ 1

4

&&"#P c

&&1

+, the result

follows.

In order to bound Rn\r, we need to control the estima-

tion error*"\r

+

Sr

(1"#

P

2Sr. Let H : R(m"1)2dr & R

be a function defined as

H(USr ) := !*1"#

P

2Sr

+ USr ;D+

( !*1"#

P

2Sr

;D+

+ $n

<&&&1"#

P

2Sr

+ USr

&&&1,2

(&&&1"#

P

2Sr

&&&1,2

=.

By optimality of "\r, it is clear that U# =

*"\r

+

Sr

(1"#

P

2Sr

minimizesH. SinceH(0) = 0 by construction,

we have H(U#) 3 0. Suppose there exist an !&/!2ball with radius Br such that for any +U+&,2 = Br,we have that H(U) > 0. Then, we can claim that+U#+&,2 3 Br. See proof of Lemma 6 for more dis-

cussion on this proof technique. Let U0 $ R(m"1)2dr

be an arbitrary vector with +U0+&,2 = 5Cmin

$n. Wehave

H(U0) := !*1"#

P

2Sr

+ U0;D+

( !*1"#

P

2Sr

;D+

+ $n

<&&&1"#

P

2Sr

+ U0

&&&1,2

(&&&1"#

P

2Sr

&&&1,2

=.

(22)We bound each of these three terms individually. Ap-plying mean value theorem to the log liklihood func-

tion, for some * $ [0, 1], we get

!*1"#

P

2Sr

+ U0;D+

( !*1"#

P

2Sr

;D+

=

U*Wn

\r

+

Sr

, U0

V+?U0,02!

*1"#

P

2Sr+ *U0;D

+U0

@.

Note that #4(2"#)$n 3 1

4$n and hence, by our bound

on Wn\r and Cauchy-Shwartz inequality, we have

3333

U*Wn

\r

+

Sr

, U0

V3333 3&&&&*Wn

\r

+

Sr

&&&&&,2

+U0+1,2

3 $n4dr +U0+&,2

3 5

4Cmin$2ndr.

To bound the other term, by Tailor expansion, we get

$min

*02!

*1"#

P

2Sr+ *U0;D

++

! min(![0,1]

$min

*02!

*1"#

P

2Sr+ *U0;D

++

! $min

*02!

*1"#

P

2Sr;D++

( max(![0,1]

$max

H

IJ

W'02!

*1"P

2Sr;D+

'1"P

2Sr

33333(!!

P )Sr+(U0

, U0

XK

LM

! Cmin

( maxt3!V \{r}

&&&&&'+"1"2

1x(i)

2

'#rt3;"3k3

&&&&&2

dr $max(1#) +U0+&,2

! Cmin ( m ( 142

drDmax +U0+&,2

! Cmin

2

<$ndr 3 C2

min450(m ( 1)Dmax

=.

(23)Here, we used the fact that $max(1#) = $max(J #) asa property of Kronecher product and also our assump-tion on the maximum eigenvalue of J #. By triangleinequality,

$n*&&"#

P + U0

&&1,2

(&&"#

P

&&1,2

+! ($n +U0+1,2! ($ndr +U0+&,2

! (5$2ndrCmin

.

Hence, from (22), we get H(U0) ! 5)2ndr

4Cmin> 0 and

hence,&&&&*"\r

+

Sr

(1"#

P

2Sr

&&&&&,2

3 5

Cmin$n, (24)

with high probability. With similar analysis as in 23,


we have&&&Rn

\r

&&&&,2

$n

3 1

$n

m ( 142

drDmax

&&&&*"\r

+

Sr

(*"#

\r

+

Sr

&&&&2

&,2

3 m ( 142

drDmax25

C2min

$n

3 %

4(2 ( %) ,

provided that $ndr 3 C2min

50'2(m"1)Dmax

#2"# .

On Learning Discrete Graphical Models using Group-Sparse ...pradeepr/paperz/dgm_aistats.pdf · On...

Documents

Transcript of On Learning Discrete Graphical Models using Group-Sparse ...pradeepr/paperz/dgm_aistats.pdf · On...