New skeleton-based approaches for Bayesian structure learning of Bayesian networks

11

Click here to load reader

Transcript of New skeleton-based approaches for Bayesian structure learning of Bayesian networks

Page 1: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

Nn

AD

a

ARRAA

KPBBMS

1

gcTpu

ado(ttb

s

1h

Applied Soft Computing 13 (2013) 1110–1120

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing

j ourna l ho me p age: www.elsev ier .com/ l ocate /asoc

ew skeleton-based approaches for Bayesian structure learning of Bayesianetworks

ndrés R. Masegosa ∗, Serafín Moralepartment of Computer Science and Artificial Intelligence, University of Granada, Spain

r t i c l e i n f o

rticle history:eceived 7 May 2012eceived in revised form 6 September 2012ccepted 28 September 2012vailable online 22 October 2012

eywords:robabilistic graphical modelsayesian networksayesian structure learningarkov chain Monte Carlo

tochastic search

a b s t r a c t

Automatically learning the graph structure of a single Bayesian network (BN) which accurately representsthe underlying multivariate probability distribution of a collection of random variables is a challengingtask. But obtaining a Bayesian solution to this problem based on computing the posterior probability ofthe presence of any edge or any directed path between two variables or any other structural feature isa much more involved problem, since it requires averaging over all the possible graph structures. Forthe former problem, recent advances have shown that search + score approaches find much more accuratestructures if the search is constrained by a previously inferred skeleton (i.e. a relaxed structure withundirected edges which can be inferred using local search based methods). Based on similar ideas, wepropose two novel skeleton-based approaches to approximate a Bayesian solution to the BN learningproblem: a new stochastic search which tries to find directed acyclic graph (DAG) structures with anon-negligible score; and a new Markov chain Monte Carlo method over the DAG space. These twoapproaches are based on the same idea. In a first step, both employ a previously given skeleton and build

a Bayesian solution constrained by this skeleton. In a second step, using the preliminary solution, theytry to obtain a new Bayesian approximation but this time in an unconstrained graph space, which isthe final outcome of the methods. As shown in the experimental evaluation, this new approach stronglyboosts the performance of these two standard techniques proving that the idea of employing a skeletonto constrain the model space is also a successful strategy for performing Bayesian structure learning ofBNs.

© 2012 Elsevier B.V. All rights reserved.

. Introduction

Bayesian networks (BN) are statistical models which allow toraphically represent, by means of a directed acyclic graph (DAG),omplex structures of dependencies among stochastic variables [1].hey have been widely employed in a great variety of real worldroblems because of their excellent properties for reasoning underncertainty [2].

The problem of automatic learning of the structure of a BN from database has been the subject of a great deal of research [3–5]. Tra-itionally, two procedures have been considered for this problem:ne based on scoring and searching [3,6]; and other, constraint basedCB) learning, based on carrying out several independence tests on

he learning data set to build a Bayesian network in agreement withests results [5]. However, in the past years several methods haveeen proposed which combine aspects of both basic procedures. For

∗ Corresponding author.E-mail addresses: [email protected] (A.R. Masegosa),

[email protected] (S. Moral).

568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved.ttp://dx.doi.org/10.1016/j.asoc.2012.09.029

example [7,8], employ Bayesian scores to carry out the statisticaltests in a PC-like algorithm. They showed that these scores reducethe average number of structural errors. Other works [9,10] intro-duce some PC variants which use a greedy procedure to introducelinks, similar to the K2 algorithm [6], in order to make independenttests only relative to the added links.

But the most successful strategy has been based on the inductionof a BN skeleton (i.e. a graph with undirected edges) by means ofCB approaches and, in a second step, run a score + search method tofind a maximum scoring structure over the DAG space constrainedby this skeleton. To the best of our knowledge, Van Dijk et al. [11]were the first ones to propose a learning method based on thisstrategy. But Max Min-Hill Climbing (MM-HC) [12] is probably thebest-known method of this kind of approaches. The idea of findinghigh scoring BNs in a restricted or constrained search space hasbeen further pursued in many works [13–16]. For example [16],presents a method which is able to find the BN model constrained

to a given skeleton with the global optimal score. At the same time,many other outperforming methods have been proposed in orderto elicit the skeleton of a BN by means of local structure discoverymethods (see [17] for a recent review).
Page 2: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

Soft C

tliwtlapfa

tmtMutitp

CsipabusefirmlrelMrioapss

bwiaFr

2

2

XArMtT

A.R. Masegosa, S. Moral / Applied

In parallel to these works, researchers have also focused onhe computation of full Bayesian solutions for the problem ofearning BNs from data. They were motivated by the uncertaintyn the model selection which is especially notorious in problems

here the size of the learning data sets is usually small comparedo the super-exponential size of the space of DAGs. Rather thanook for a single BN maximizing a given score, like score + searchpproaches, the problem now is computing the marginal posteriorrobability of the edges of a BN [18] (or any other structuraleature) and making predictions using a Bayesian model averagingpproach.

Two different sets of approaches have been employed to obtainhese Bayesian solutions. Markov chain Monte Carlo (MCMC)

ethods [18–22], whose aim is to sample different DAG struc-ures according to the stationary distribution of the defined

arkov chain; and stochastic search methods [23–25] which,nlike MCMC, do not have the goal of converging to a sta-ionary distribution (which may be unattainable, and is usuallympossible to assess), but simply listing and scoring a collec-ion of high scoring models which are visited during the searchrocess.

In this work we aim to show that the successful Max Min-Hilllimbing decomposition idea [12] for finding maximum scoringingle BNs can also be applied to develop new methods for obtain-ng Bayesian solutions to the BN structure learning problem. Morerecisely, what we show is that the computation of a Bayesianpproximation in two steps, firstly over the DAGs constrainedy a previously given skeleton and, in a second step, over annconstrained DAG space but using the information of the firsttep, is a very successful strategy for this problem. Two differ-nt implementations of this idea are evaluated in this work. Therst one is a stochastic search method which, as first step, car-ies out a search constrained by the skeleton performing randomovements according to the Bayesian scores of the alternative

ocal movements. In the second step, this stochastic search isun again, starting randomly at any of the high scoring mod-ls found in the first step and without constraining the randomocal moves. The second method introduced in this work is a new

CMC in the DAG space. Similarly to the other method, we firstlyun a MCMC constrained by the skeleton of the BN which shoulddentify high scoring constrained DAG models. After that, in a sec-nd step, a standard MCMC over the DAG space is run withoutny constraint. However, this MCMC is conducted by a new pro-osal distribution, which introduces global movements using theamples generated by the first MCMC over the constrained DAGpace.

This paper is structured as follows. In Section 2, we provideackground knowledge on Bayesian scores and skeletons, asell as on Bayesian structure learning methods. Subsequently,

n Section 3, we present the details of our two skeleton-basedpproaches. The experimental evaluation is given in Section 4.inally, in Section 5, we provide the main conclusions and futureesearch.

. Background knowledge

.1. The Bayesian score of a BN

Let us assume we are given a vector of n random variables = (X1, . . ., Xn) each taking values in some finite domain Val(Xi).

BN is defined by a directed acyclic graph, denoted as G, which

epresents the dependency structure among the network variables.ore precisely, this graph G is specified by means of a vector with

he parent sets, �i ⊂ X, of each variable Xi ∈ X: G = (�1, . . ., �n).he parent set �i is represented in G by those variables with an

omputing 13 (2013) 1110–1120 1111

edge pointing to Xi. The definition of a BN model is complete with anumerical vector, denoted by �, which contains the parameters ofthe conditional probability distributions encoded in this graph G: �ijis a vector of length |Val(Xi)| (| · | is the cardinality operator) asso-ciated to the conditional multinomial distribution of P(Xi|�i = j),where �i = j denotes the jth assignment of the variables in �i.We also use |Val(�i)| to denote the number of all the possiblecombinations.

Let us also assume we are given a fully observed multinomialdata set D. To compute the marginal likelihood of the data given thegraph structure, P(D|G) =

∫P(D|G, �)P(�|G)d�, the most common

settings [3] define a prior Dirichlet distribution for each parame-ter �ij with parameter vector ˛ij, �ij ∼ Dir(˛ij). They also assume aset of parameter independence assumptions in order to factorizethe joint probabilities and make feasible the computation of themultidimensional integral.

In that way, the marginal likelihood of data given a graphstructure and a set of vectors of Dirichlet parameters, ˛ij, has thefollowing well-known closed-form equation:

P(D|G) =n∏i

|Val(�i)|∏j=1

� (˛ij)� (˛ij + Nij)

|Val(Xi)|∏k=1

� (˛ijk + Nijk)� (˛ijk)

(1)

where Nijk are the number of data instances in D consistentwith jth assignment of �i and Xi = k, while Nij =

∑kNijk and

˛ij =∑

k˛ijk. In the case of the Bayesian Dirichlet equivalent met-ric or BDeu metric, these ˛ijk are set to ˛ijk = (1/(|Val(�i)||Val(Xi)|)).The relevance of these settings relies on the following propertyof this Bayesian metric known as likelihood equivalence: if twodifferent BN models encode the same conditional independen-cies, then the score metric assigns the same score value to eachmodel.

With the definition of a prior distribution for the graphstructures, P(G), we fully specify the Bayesian score metric ofa graph structure: score(G|D) = P(G)P(D|G). Furthermore, if the

prior distribution is locally decomposable, P(G) =∏

i

Pi(�i), then

this score can also be made locally decomposable: score(G|D) =∏i

score(Xi, �i|D).

The prior over graph structures, P(G), is usually taken to beuniform. However, as pointed out in several works [26–28], thisuniform prior is not optimal, specially because it does not accountfor the problem of “multiplicity correction” (i.e. if the number ofcandidate parents for a variable Xi grows, the probability of edgeinclusion should be decreased in order to control the number offalse positive edges). So, these authors propose the following prior,which is also the one employed by the approaches presented in this

work: P(G) ∝∏

i

(i

|�i|

)−1

.

2.2. Learning the skeleton of a BN

In this subsection we give a brief overview of the main ideasbehind the methods employed in this work to build the skeletonof the BN as a previous step to constrain the subsequent search inthe DAG space. We will use SK to denote this skeleton, which isdefined by a set of undirected edges between a pair of variables,SK = {Xi ↔ Xj : Xi, Xj ∈ X}. We also use NeighborsSK(Xi) to denote theset of variables which have an undirected edge in SK connected to

Xi (i.e. the neighbors of Xi in SK). As pointed out in [16], the mainproperty that this skeleton should satisfy in order to define a cor-rect constrained DAG search space, denoted by GSK , is that it mustbe a super-structure of the true DAG generating the data. We say
Page 3: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

1 Soft C

tGcqt

aoitpbXaX

vAsptpspmatdpaastaiasfo

twapiicmcpv

ceisrptt

2

fs

112 A.R. Masegosa, S. Moral / Applied

hat SK is a super-structure of a graph G if, for any directed edge in, Xi → Xj ∈ G, there is an undirected edge in SK, Xi ↔ Xj ∈ SK (SK canontain edges which are not included in G). In that way, a subse-uent search method restricted by this skeleton has the possibilityo find the true DAG.

The main proposed approaches to elicit this skeletonre based on a local search scheme [17]. So, these meth-ds are individually applied to each variable in order tonfer its set of neighbors, NeighborsSK(Xi). Once they havehe set of neighbors for each variable, the skeleton is com-osed by joining all these sets together. Two schemes cane used to carry out this join operation: the AND-Scheme,i ↔ Xj ∈ SK ⇔ (Xi ∈ NeighborsSK(Xj) AND Xj ∈ NeighborsSK(Xi));nd the OR-Scheme, Xi ↔ Xj ∈ SK ⇔ (Xi ∈ NeighborsSK(Xj) ORj ∈ NeighborsSK(Xi)).

One of the possible ways of defining this set of neighbors for eachariable, NeighborsSK(Xi), is to use the concept of Markov boundary.

Markov boundary (MB) of a variable Xi is a minimal variable sub-et such that, when all other variables are conditioned on it, they arerobabilistically independent of Xi. If these MBs are joined together,hey form a moral graph (the moral graph is the moralized counter-art of a DAG by connecting nodes that have a common child) whichatisfies the super-structure property. Several approaches have beenroposed to find the MB of a given variable in order to build thisoral graph [17]. In this work, we employ the so-called incremental

ssociation Markov boundary algorithm (IAMB) [10] as it is one ofhe first methods proposed for inferring MBs that is able to scale toata sets containing thousands of variables and provides very com-etitive results. This method is guaranteed to retrieve the true MB of

random variable if the conditional independency tests are correctnd the underlying distribution generating the data satisfies theo-called composition property, which is a weaker assumption thathe so-called faithfulness assumption [1]. More precisely, we employ

modified version of this method, which uses Bayesian scoresnstead of statistical dependency tests to decide when two variablesre conditionally (in)dependent and an OR-Scheme to generate thekeleton. As shown in [7,8,29], the employment of Bayesian scoresor this purpose gives raise to more accurate inferences for this kindf learning tasks.

Another possibility to build the skeleton is to use a methodhat infers the set of parents and children (those nodes in a BNhich are directly connected by an edge to the target node) of

random variable as the set of neighbors. This is the approachursued in the Max Min-Hill Climbing (MM-HC) method [12] and

t also satisfies the super-structure property. In this method, thenference of the set of parents and children of a random variable isarried out by means of the so-called MM-PC algorithm [30]. Thisethod guarantees the retrieval of the correct set of parents and

hildren of a random variable if and only if the conditional inde-endency tests are correct and the faithfulness assumption [1] iserified.

As previously mentioned, the main purpose of this skeleton is toonstrain or restrict the subsequent search in the DAG space (anydge of a visited DAG by the search method should be previouslyncluded in the skeleton). In the case of the MM-HC method, thisubsequent search is a simple Hill Climbing method with multiplee-starting. But there are other possibilities. For example [15], pro-osed a method that performs a set of conditional independencyests in order to orient the edges of the skeleton in a way similar tohe PC algorithm.

.3. Bayesian structure learning of BNs

As previously mentioned, when inferring the structure of a BNrom a real data set we can easily find that the Bayesian scoresupport several DAG models. In those cases the selection of one

omputing 13 (2013) 1110–1120

single model may give rise to unwarranted inferences about thestructure. Therefore, the employment of a full Bayesian solution isdesirable. In this case, the goal is to compute the posterior prob-ability of some structural feature f (e.g. the presence of a directededge between two variables Xi and Xj) as an expected posteriormean:

E(f |D) =∑

G

f (G)P(G|D) (2)

where f(G) = 1 if the structural feature holds in G (e.g. G containsa given edge between variables Xi and Xj) and f(G) = 0 otherwise.In these cases, the prediction of a future unseen data vector xhas also to be computed by averaging over the different graphstructures:

P(x|D) =∑

G

p(x|G)P(G|D) (3)

In this case, the underlying problem for obtaining a Bayesiansolution is mainly the super-exponential size of the DAG space. Sev-eral Metropolis–Hastings MCMC algorithms have been proposed toapproximate the sums above [19,18,21].

We briefly describe now how to define a MCMC over the spaceof DAGs. The first step is to define a proposal distribution q(G′|G) todefine which graphs G′ are proposed to be visited once the Markovchain is in the graph G. The first proposed MCMC [19] defined asa proposal distribution is a uniform random sampling among thelocal neighborhoods of graph G (adding, removing, and reversingan edge), denoted by Nb(G). However, the mixing and conver-gence rates offered by this MCMC are poor [18]. A different andmore recent approach, labeled as dynamic programming MCMC(DP-MCMC), was studied in [20]. It is based on the definition of anew proposal distribution for the Markov chain which is a com-bination of the previous local distribution and a global proposaldistribution:

qhybrid(G′|G) ={

qlocal(G′|G) with prob. ˇ

qglobal(G′) with prob. (1 − ˇ)(4)

where qlocal is the aforementioned local proposal distribution whileqglobal is the global proposal distribution. It proposes new graphsindependently of the graph visited currently by the Markov chain.These global movements are performed by sequentially samplingthe edges of a graph G′ according to a precomputed value of theposterior probability of these edges, pij = E(Xi → Xj|D). These pijvalues are precomputed using a different exact Bayesian averag-ing method which analytically marginalizes over all the possiblevariable linear orders [31] using a dynamic programming basedtechnique. The main aim of these global movements is to pre-vent the Markov chain from getting trapped in local optima,which are difficult to overcome using only local proposal distri-bution qlocal (locally add/delete/reverse an edge). However, thismethod follows a counter-intuitive approach to perform theseglobal moves, since before running the Markov chain, it must solvefirst the Bayesian learning problem to obtain the pij values usingan exact analytical method, and then employs this exact solutionto obtain an approximate solution (which is the one obtained bythe MCMC). They provide a justification of their approach based insome limitations related to these exact Bayesian averaging meth-ods [31]: they cannot compute the posterior probability of somestructural features, such as path features; they cannot be usedassuming a uniform prior probability over the graph structures;

and they are extremely slow when computing predictive densi-ties. Finally, we point out that this approach is computationallyunfeasible for problem domains with more than 25 or 30 variablesbecause these exact Bayesian averaging methods [31] have time
Page 4: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

Soft C

aB

ci

˛

pb

oawmtdcXXa

˛

wibvt

)

�t

P

wovK

3l

sSmaetaobcme

ssMii

line 12) to evaluate new edge additions (if the edge is notpresent in G0) or new edge reversals or removes (if the edgeis present in G0). Each of these movements (lines 7, 16, and18) are randomly carried out according to the score of thealternative graphs they generate. The random adding method isdetailed in lines 27–32 and the random remove/reversal method

A.R. Masegosa, S. Moral / Applied

nd space exponential complexity in the number of nodes of theN.

Once a new graph G′ is proposed for being visited by the Markovhain using any proposal distributions, this movement of the chains accepted with a probability ˛, which is computed as follows:

= min{

1,P(D|G′)P(G′)q(G|G′)P(D|G)P(G)q(G′|G)

}(5)

Both Markov chains have as stationary distribution the posteriorrobability over graph structures, P(G|D), and they converge to itecause their proposal distributions are aperiodic and irreducible.

A completely different approach firstly pursued in [18] consistsf building a Markov chain that traverses the space of possible vari-ble orderings (e.g. (X3, X2, X1, X4) is a variable order in a problemith four variables) instead of the DAG space. In this case, the spaceodel is defined by the set of possible variable orders or permu-

ations. A single variable order will be denoted by <. The proposalistribution q(<′|<) is simply defined as a random swapping in theurrent variable order (e.g. given that the MC is in state (X2, X4, X1,3), a possible proposal could be (X2, X3, X1, X4), where variables3 and X4 are swapped). And the new proposed variable order isccepted with probability:

= min{

1,P(D|<′)P(<′)q(< |<′)P(D| <)P(<)q(<′| <)

}(6)

here P(<) is the prior distribution for the variable order and P(D|<)s the probability of data given a variable order. This likelihood cane computed in polynomial time, O(n(K+1)), assuming that a givenariable has no more than a fixed number K of parents and thathe prior distribution for that graph structure is modular: P(G| <

∝∏

i

P(�i| <) whenever G is compatible with < (i.e. the parents

i of each Xi are predecessors of Xi in the variable order <) usinghe following expression:

(D| <) =∑G∈G<

P(D|G, <)P(G| <) =∏

i

∑�⊆Xi,<

score(Xi, �|D) (7)

here G< is the set of graph structures compatible with the variablerder <, � ⊆ Xi,< are the possible variable subsets created using theariables preceding variable Xi in the order < and with no more than

variables.

. Skeleton-based approaches for Bayesian structureearning of BNs

The starting point of our approaches is the elicitation of thekeleton, SK, of a BN by means of any of the methods detailed inection 2.2. As previously mentioned, when looking for single opti-al BNs, the DAG space constrained to this skeleton is correct if we

ssume that the skeleton is a super-structure of the true DAG gen-rating the data. However, when looking for Bayesian solutions,his constrained DAG space is not valid since it requires the evalu-tion of DAGs other than maximum scoring DAGs, rather than anyther DAG with a non-negligible score, whose skeleton may note compatible with SK. This is why our proposed methods use thisonstrained DAG space as a first step as opposed to the MM-HCethod, where this constrained space is the only one which will be

xplored by the subsequent search method.In that way, our strategy is based on computing, in a first

tep, an approximation of the posterior distribution over the con-

trained DAG space, P(GSK |D), which is easier to achieve by theCMC or stochastic search approaches because the search space

s much smaller (in fact, skeletons used were very sparse). Then,n a second step, a search over the unconstrained DAG space

omputing 13 (2013) 1110–1120 1113

is carried out, taking into account the results of the previousstep.

In the following two sections we detail the stochastic searchand the MCMC methods for Bayesian structure learning, designedfollowing the aforementioned strategy.

3.1. A new skeleton-based stochastic search method (SK-SS)

MCMC methods and other Monte Carlo methods such as Gibbssampling are based on random sampling of the different modelsproportionately to their posterior probability. However, when theposterior over the models is highly skewed and only a reduced pro-portion of models have non-negligible posterior probability, MonteCarlo methods will tend to sample the same small set of modelsrepeatedly and become very inefficient. In those cases, if a stochas-tic search method is powered by a search heuristic able to properlyexplore the model space, it can perform better because its goal issimply to visit and collect high scoring models.

In our problem, once the proposed stochastic search methodhas finished, it will retrieve a set of distinct DAG structures vis-ited, which will be denoted by VG. Using this set we can computeany estimation associated to this problem assuming that the differ-ent sums of Eqs. (2) and (3) are carried out using only the modelsin VG. For example, we can get an approximation of the poste-rior probability over the different DAG models using the followingequation:

P(G|D) = score(G|D)∑G′∈VG

score(G′|D)(8)

while the posterior probability of a structural feature f is computedas follows:

E(f |D) =∑G∈VG

f (G)P(G|D) (9)

We can also group together the DAGs with the same conditionalindependencies in order to have more compact representations ofthe posterior probability. This is also worthy when searching forthe maximum a posteriori (MAP) model. In this case the posteriorprobabilities of those equivalent DAGs are added and the group isconsidered as a single model.

3.1.1. The stochastic search methodAs previously mentioned, our method starts searching over

a DAG space constrained by a previously given skeleton, SK. InAlgorithm 1, we give the details of this first stochastic searchmethod, labeled as partialSK-SS. This method performs I globaliterations, each starting from a empty DAG, G0. For each globaliteration, in a first phase (lines 3–11), the addition of eachpossible edge of the skeleton to G0 is evaluated. In a secondphase (lines 12–25), several iterations are carried out (loop of

in lines 33–38. As can be seen, these random movements aresimply obtained by sampling for a three case categorical variable(labeled State) using probabilities proportional to the scores of thenetworks.

Page 5: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

1 Soft C

A1234

5678911

111111112222222

222333

333333

vdgSDioe8vsWw

tuo

A12345

67891

114 A.R. Masegosa, S. Moral / Applied

lgorithm 1. Stochastic search over GSK (partialSK-SS): INPUT: SK, a graph skeleton.: VG =∅;: for i = 1 to I do: G0 = EmptyDAG; {Initialize to an empty DAG}

{Firstly, the addition to G0 of each edge of the skeleton is evaluated}: for all X in X do {following a random permutation (r.p.)}: for all Y ∈ Neighbors(X, SK) do {following a r.p. }: G′ = performRandomAdding(G0, X, Y);: VG = VG ∪ G′;: G0 = G′;0: endfor1: endfor

{Secondly, the reversion/addition of each edge of the skeleton to G0 isevaluated again}

2: for j = 1 to J do3: forall X in X do {following a r.p. }4: forall Y ∈ Neighbors(X, SK) do {following a r.p. }5: if (X, Y) ∈ G0 then6: G′ = performRandomReverseRemove(G, X, Y)7: else8: G′ = performRandomAdding(G0, X, Y);9: endif0: VG = VG ∪ G′;1: G0 = G′;2: endfor3: endfor4: endfor5: endfor6: return VG;

{}7: METHOD : performRandomAdding(G, X, Y)8: State(0) = score(G0, D); {No Addition}9: State(1) = score(G0 ∪ {X → Y}, D); {One possible addition}0: State(2) = score(G0 ∪ {Y → X}, D); {Another possible addition}1: G′ = sampleMovement(G0, State);2: return G′;

{}3: METHOD : performRandomReverseRemove(G, X, Y)4: State(0) = score(G0, D); {No Change}5: State(1) = score(G0 \ {X → Y} ∪ {Y → X}, D); {Reversal}6: State(2) = score(G0 \ {X → Y}, D); {Removal}7: G′ = sampleMovement(G0, State);8: return G′;

A new stochastic search starts from the DAGs found in the pre-ious search and tries to add new edges without the restrictionsefined by the skeleton of the previous level. In Algorithm 2 weive a pseudo-code description of this new method, labeled as SK-S. As can be seen, the input of this algorithm is the set of visitedAGs in the previous level, VG, which is used to sample a new DAG

n each iteration of the main loop (line 3) using the approximationf the posterior probability given by Eq. (8). As can be seen, thedges which are fixed will not be evaluated anymore (lines 7 and). In this sense, it means that those areas of the search space pre-iously explored will not be evaluated in this new search. That is toay, in this level we remove some constrains but introduce others.

e maintain the idea that a search should not be carried out in thehole DAG space at once.

This level returns again a set of visited DAGs, which is joined tohe set of DAGs found in the previous level. This new set of DAGs issed to finally obtain an approximation of the posterior probabilityver the DAG space using Eq. (8).

lgorithm 2. Skeleton-based stochastic search (SK-SS): INPUT : Set VG returned by Algorithm 2;: V′

G = ∅;: for i = 1 to I do: G0 = SAMPLE− DAG(VG);: forall (X, Y) ∈ X do {Iterate over each pair of variables following a

random order}

: {Only edges not present in G0 are evaluated}: if {X → Y} ∈ G0 OR {Y → X} ∈ G0 then: continue;: endif0: G′ = performRandomAdding(G0, X, Y);

omputing 13 (2013) 1110–1120

11: V′G = V′

G ∪ G′;12: G0 = G′;13: endfor14: endfor15: return VG ∪ V′

G;

3.2. A new skeleton-based Markov Chain Monte Carlo (SK-MCMC)

Similarly to the above stochastic search approach, we run twodifferent Markov chains. The first one, labeled as partialSK-MCMC,will be run over the constrained DAG space, GSK , and will performstandard local movements constrained by the skeleton SK. Its pro-posal distribution is defined as follows:

qSK (G′|G) = 1|NbSK (G)| I(G′ ∈ NbSK (G)) (10)

where NbSK(G) is the set of graphs G′ that are obtained from G byany edge deletion or reversal, and by any edge addition Xi → Xjsuch that Xi ↔ Xj ∈ SK (SK will always be a super-structure of G′),and I(G′ ∈ NbSK(G)) indicates if G′ belongs or not to NbSK(G). It iseasy to see that the stationary distribution of this Markov Chain isP(GSK |D) and that it converges to it, since this MC is aperiodic andirreducible.

The second Markov chain, labeled as SK-MCMC, will employ thisapproximation in combination with unconstrained local moves todefine its proposal distribution. This proposal distribution is similarto the one used by the DP-MCMC [20] method previously describedin Section 2.3. It is computed as follows:

qhybrid(G′|G) ={

qlocal(G′|G) with prob. ˇ

qglobal(G′) with prob. (1 − ˇ)(11)

where ̌ defines the proportion between local and global moves,qlocal is the classical local proposal distribution in the DAG space,and qglobal is the probability distribution used to propose the globalmoves. In this case, qglobal is a smooth approximation of the station-ary probability distribution of the Markov chain partialSK-MCMC,P(GSK |D), and it is computed as follows:

qglobal(G) =∏

i

∏j>i

(pij + pji)I(Xi→Xj∨Xj→Xi)

(G)∏

i

qIXi→Xj(G)

ij (12)

where I is the indicator function, (pij + pji) ≤ 1 refers to the prob-ability of the existence of the edge between Xi and Xj, andqij = pij/(pji + pij) is the probability that this edge is oriented fromXi to Xj. The posterior edge probabilities pij are computed using thepreviously built Markov chain partialSK-MCMC, pij = EPSK ( · |D)(Xi →Xj|D) (see Eq. (2)), and they are also truncated to avoid that the edgemarginals are equal or too close to 0 or 1 (i.e. any pij < C is set to Cwhile pij > 1 − C are set to 1 − C, with C = 1e − 4). This truncation isintroduced in order to guarantee the smoothing of qglobal (∀G ∈ Gqglobal(G) /= 0) and to increase the acceptance rate of the globalmoves (i.e. the fraction of proposed global movements which areaccepted, see Eq. (5)) in this Markov chain. Let us consider that if wedefine qglobal(G′) = PSK(G′|D) without any smoothing and a global G′

is proposed (in this case, it will always occur that G′ ∈ GSK ) and thecurrent visited model by the Markov chain G /∈ GSK and G /∈ Nb(G′)(i.e. G is not compatible with the skeleton SK, which is possiblebecause the MC is running in an unconstrained DAG space, and Gis not a neighbor of G′, which is also likely because G′ is sampledindependently of G), then the acceptance probability of this globalmovement ̨ is null (see Eq. (5)) because qhybrid(G|G′) = 0.

It is easy to prove that if ̌ > 0 this Markov chain is aperiodic andirreducible because the local proposal distribution, qlocal, has these

two properties [19] while if ̌ = 0, then the chain is also aperiodicand irreducible because qglobal is an independence sampler and ∀G ∈G qglobal(G) /= 0 (we point out that this would not happen if qglobalis not smoothed).
Page 6: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

Soft C

[hmaficDstie

mmfmPtoabbTspobbseiisbD

4

4

dikaFrge(dop

p(FfRu“

me

A.R. Masegosa, S. Moral / Applied

The idea behind this approach is similar to that presented in20] (see Section 2.3). The introduction of global moves aims toelp preventing the Markov chain from getting trapped into localaxima and to focus it in those parts of the model space which

re more promising. The strong point of this strategy is that therst MC should be able to obtain an accurate and computationallyheap approximation of P(GSK |D), since it is run in a much smallerAG space. So, if the inferred skeleton defines a constrained DAG

pace around which most of the high scoring DAGs are placed,hese global moves will greatly improve the convergence and mix-ng rates of the second MC, as we will see in the experimentalvaluation.

The analysis of the computational complexity of the proposedethods is not straightforward because they are stochastic basedethods. The computational complexity of the approaches used

or learning the skeleton of the Bayesian network (see Section 2.2)ainly depends of the number of variables n. For example, the MM-

C algorithm has a polynomial computational complexity wherehe exponent of the polynomial depends of the maximum numberf variables that can make a variable conditionally independent ofnother if this is the case [30]. But in general they are known toe very efficient approaches for these problems because they haveeen extensively used to analyse high dimensional data sets [17].he computational complexity of the stochastic phase, either thetochastic search or the Monte Carlo based approach, built upon thereviously elicited skeleton mainly depends of the number of edgesf this skeleton (the model space they try to explore is constrainedy this skeleton). The key point is that this skeleton is very sparseecause the Markov boundary of a variable usually involves a smallubset of variables as it has been extensively shown for many differ-nt domain problems [30,17]. Moreover, although it is not exploredn this work, these methods could be further speeded up by apply-ng the recursive approach pursued in [14]. In that work, Xie et al.how how the DAG space can be partitioned when it is constrainedy a given graph skeleton leading to a recursive method for learningAG models which strongly reduces the size of the problem.

. Experimental evaluation

.1. Experimental settings

We evaluate the different contributions in artificially generatedata sets from 5 standard BNs whose main features are depicted

n Table 1. We employ this kind of data sets because this way wenow the true model generating the data and we can measure theccuracy of the different approaches when recovering this model.or each of these networks, and by means of logic sampling, weandomly generate 10 data samples with the same size in order toet estimates of the error measures for a fixed sample size. Differ-nt sample sizes are considered: 100, 500, 1000, and 5000 samplesthey are displayed on the X-axes of the figures). Each of theseata samples is used to run the different proposals. In the figuresf the experimental evaluation, averaged values of the differenterformance measures across the different BNs are reported.

The quality of a Bayesian solution is evaluated through the com-utation of the posterior probability of the edge and path featuresi.e. f = 1 if there is a directed path between two nodes i and j).ollowing [20,31], we threshold these posterior features at dif-erent levels to trade off sensitivity and specificity. The resultingOC curves are summarized in a single number, namely the areander the curve. They are labeled in the figures as “AUC-Edge” and

AUC-Path”, respectively.

Our contribution is also experimentally validated as a searchethod which looks for an optimal single BN model, the MAP infer-

nce problem, in order to show that our proposed methods are also

omputing 13 (2013) 1110–1120 1115

able to retrieve quite accurate BN models. For this problem, weemploy a combination of the precision and recall of the PDAG edges(an extension of DAGs where some edges are undirected if any ofthe directions encode the same conditional independencies). Moreprecisely, it is considered the Euclidean distance from the perfectprecision/recall, which is labeled in the figures as “P/R distance”,

distance =√

(1.0 − precision)2 + (1.0 − recall)2.The settings of our approaches are as follows. Firstly, we use the

IAMB method with the BDeu score with the OR-Scheme (see Section2.2) to fix the skeleton either for stochastic search or for MCMCmethods. For the stochastic search methods, they were run with anumber of iterations I = 30 · ln(n) and J = 30, where n is the numberof variables. While running the MCMC methods for a total numberof iterations I = 2000 · ln(n), the first 10% of these iterations werediscarded (burning phase), and then we collected a DAG every 5iterations. We also fixed the parameter ̌ = 0.1 of Eq. (12) followingthe indications of [20].

4.2. Evaluating the effect of constraining to a skeleton

In this first part of the evaluation we want to show that thesolutions found by both stochastic methods over the model spaceconstrained by a previously inferred skeleton are much better thanthe solutions found by these same methods over an unconstrainedDAG space. We also aim to show that the constrained search canbe improved if we apply a posterior unconstrained optimization.

In Fig. 1 we show the performance of the following methods:our skeleton-based approaches, labeled as SK-SS and SK-MCMC;these same approaches but only with the results obtained usingthe approximation in the constrained DAG space, labeled aspartialSK-SS and partialSK-MCMC (i.e. without the subsequentsearch in the unconstrained DAG space); and the classic MCMCand stochastic search directly over the unconstrained DAG space,labeled as unconstr-MCMC and unconstr-SS respectively (i.e. theyare obtained applying the same algorithms than in partialSK-SSand partialSK-MCMC but with any constraints, e.g. assuming thatthe skeleton is a fully connected graph).

The main conclusions drawn from the these analysis are thefollowing:

• Methods partialSK-SS and partialSK-MCMC are able to obtainoutperforming MAP models. As can be seen, they have a per-formance similar to the full versions, SK-SS and SK-MCMC, so asubsequent exploration of the unconstrained DAG space does notseem to be necessary if we only want to obtain a single MAP DAGmodel. On the other hand, the unconstrained versions unconstr-MCMC and unconstr-SS perform very poorly for this task and theirperformance grows very slowly when increasing the size of thelearning data.

• Methods partialSK-SS and partialSK-MCMC are able to obtainmore accurate probability estimates of edge and path featuresthan unconstr-MCMC and unconstr-SS (which also exhibited apoor performance in this task). But this is not the case for verysmall sample sizes (100 data samples). This probably happensbecause at very small sample sizes the skeleton is very sparse andthe search space is overrestricted. In fact, if we look at the recall(i.e. percentage of edges recovered from the true skeleton) of theinferred skeleton for 100 data samples with the IAMB method,we find that it is only 32%, while for 500 data samples it is higherthan 50%.

• Methods SK-SS and SK-MCMC outperformed the others. This

shows that running subsequent searches over the unconstrainedDAG space (but starting from the previously found solutions) isworth when looking for a Bayesian solution and helps to cor-rect some of the deficiencies that may be introduced when the
Page 7: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

1116 A.R. Masegosa, S. Moral / Applied Soft Computing 13 (2013) 1110–1120

Table 1Bayesian networks employed in the experimental evaluation.

Name N. nodes Av. values N. links Av. parents Max. parents Av. conf.

Alarm 37 2.8 46 1.2 4 6.6Boblo 23 2.7 24 1.0 2 5.1Boerlage92 23 2.0 36 1.6 4 3.7Hailfinder 56 4.0 66 1.2 4 19.4Insurance 27 3.3 52 1.9 3 15.2

thods

tM5

Fig. 1. Evaluation of skeleton-based me

skeleton overrestricts the DAG space. As can be seen, the perfor-mance we get for small sample sizes improves strongly. We alsopoint out that the stochastic search method, SK-SS, works betterthan the MCMC methods across different sample sizes, althoughboth offer a similar performance for the biggest sample size.

We now analyze the impact that using a skeleton to constrainhe DAG space has on the convergence and the mixing rates of the

CMC methods employed in this evaluation. To this aim, we took00 data samples from the Alarm and the Hailfinder networks, and

for Bayesian structure learning of BNs.

conducted 10 independent runs of the Markov chains on each oneof them. In Fig. 2 we plot the log-score of the visited models alongthe different iterations. Black lines correspond to iterations of theunconstr-MCMC; blue lines to iterations of partialSK-MCMC; andred lines to iterations of SK-MCMC. As can be seen, partialSK-MCMC mixes extremely quickly compared to unconstr-MCMC in

both data samples. It also converges to the same probability lev-els in all the runs (except for one run in the Alarm data set)as opposed to unconstr-MCMC, which shows a very bad conver-gence behavior. Again, we would like to stress that these very
Page 8: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

A.R. Masegosa, S. Moral / Applied Soft Computing 13 (2013) 1110–1120 1117

F mplesp referea

sDspswspuct

scma

4m

tMO(lmO(tcwv

auM

S

ig. 2. Mixing and convergence of MCMC methods for Alarm and Hailfinder data saartialSK-MCMC; and red lines to iterations of SKMCMC. (For interpretation of therticle.)

ignificant improvements are obtained just by constraining theAG space using the inferred skeleton. On the other hand, we can

ee that SK-MCMC also mixes very quickly, although not as quickly asartialSK-MCMC, but it converges to a set of models with a highercore. Moreover, we can see that the iteration in the Alarm data sethere partialSK-MCMC fails to converge is corrected by SK-MCMC,

ince all runs reach similar probability levels. Finally, we want tooint out that the very quick mixing of partialSK-MCMC can besed to speed up the overall execution of SK-MCMC, since the Markovhain of partialSK-MCMC used to compute the proposal distribu-ion would only need to be run for a very small number of iterations.

Summarizing, we have shown that the application of standardtochastic methods, such as stochastic search or MCMC, to aonstrained DAG space greatly boosts the performance of theseethods. This is the same effect that was found with the MM-HC

lgorithm when looking for a single optimal BN model.

.3. Benchmarking using state-of-the-art structure learningethods

We now benchmark our skeleton-based approaches againstwo state-of-the-art methods for Bayesian structure learning: the

CMC method over the space of variable orders [18] (labeled asrder-MCMC) and the dynamic programming MCMC method [20]labeled as DP-MCMC), see Section 2.3 for details. We took pub-ic available implementations of both methods.1 The DP-MCMC

ethod was run using the same settings than SK-MCMC, while therder-MCMC method was run using the following settings: K = 4

maximum number of parents per variable), burn-in steps = 10,000,hining = 100 (number of MC iterations before a variable order isollected), total number of steps = 20,000. This method was runith a higher number of iterations in order to guarantee the con-

ergence of this Markov chain, as pointed out in [18].Fig. 3(a)–(d) shows the results of comparing SK-SS, SK-MCMC

nd Order-MCMC using the AUC-Edge measure (AUC-PATH was notsed because, since it cannot be directly computed using Order-CMC) for the Bayesian networks Alarm, Boblo, Boerlage92 and

1 http://www.cs.helsinki.fi/u/tzniinim/bns/; http://www.cs.ubc.ca/ murphyk/tructureLearning.

. Black lines correspond to iterations of unconstr-MCMC; blue lines to iterations ofnces to color in this figure legend, the reader is referred to the web version of the

Insurance, respectively (see Table 1). The Hailfinder network wasnot employed because we were not able to obtain the results forthe Order-MCMC method due to the excessive computational loadassociated with the size of this BN (see comments in the next para-graphs). We are not aware of any similar experiment using this BN.Fig. 3(e) displays the results of the same comparison averaging overthe previous four Bayesian networks.

The comparison with the DP-MCMC method was not carried outusing any of the previous BNs because, as commented in Section 2.3,this method needs to run first an exact Bayesian averaging method[31] whose memory space complexity is exponential in the num-ber of nodes and the available implementation crashes for any ofthe BNs depicted in Table 1.2 For this reason, we took the Childnetwork, which has 20 nodes, and was the employed BN in theoriginal experimental evaluation where DP-MCMC was presented[20]. The results of the comparison with this BN are displayed inFig. 3(h) and (i). As a baseline method, we also included in thiscomparison the results obtained from the exact Bayesian averag-ing approach (labeled as DP-Exact) used by DP-MCMC to define itsproposal distribution (see Section 2.3).

When benchmarking against Order-MCMC, we can see thatour approaches perform poorly for small samples sizes in mostof the networks. As we mentioned in the analysis above, this isdue to the low quality of the inferred skeletons (they are verysparse and over-constrain the DAG space). However, for larger sam-ple sizes, the performance of both methods (SK-SS and SK-MCMC)quickly improves and outperforms Order-MCMC. On average, ourapproaches consistently exhibit a better performance for mediumto large sample sizes than Order-MCMC, while SK-SS shows thebest performance for intermediate sample sizes. At the same time,our methods are much less computational expensive than Order-MCMC, which requires computing a very high number of score values(see Eq. (7)): the number of possible parent sets whose size isequal or lower than 4 (we point out that our approach is not

constrained by this parameter). This implies that, in the case ofthe Alarm network, Order-MCMC requires computing more thantwo million score values (i.e. O(n5)), whereas our methods, which

2 Experiments were run on a computer with 8 GB RAM.

Page 9: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

1118 A.R. Masegosa, S. Moral / Applied Soft Computing 13 (2013) 1110–1120

Fig. 3. Benchmarking using state-of-the-art structure learning methods.

Page 10: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

Soft C

f(swws

wsftdftMenmao

aisaswrmgsta

5

aapTmiicmoieMsfip

mdaprtib

[

[

[

[

[

[

[

[

[

[

[

[

A.R. Masegosa, S. Moral / Applied

eature a cache mechanism, only require computing 6000 scoresapproximately) for SK-MCMC and 5000–50,000 (depending of theample size) for SK-SS. In the case of Hailfinder, Order-MCMCould require computing more than 20 millions of score valueshile any of our approaches will not require more than 12,000

cores in the worst case.When benchmarking against DP-MCMC in the Child network,

e can see that it performs quite similarly to our approaches. Formall sample sizes its performance is far from Order-MCMC androm DP-Exact (we recall that this is the exact optimal solution forhis problem) but it quickly improves as the size of the learningata grows. For large sample sizes, DP-MCMC and SK-MCMC per-orm quite similar and a bit better than SK-SS. However, we recallhat the key difference between our approach, SK-MCMC, and DP-

CMC is that the latter defines its proposal distribution using anxact Bayesian averaging method (DP-Exact), which has an expo-ential cost in time and space, while our approach uses a muchore efficient MCMC running over a constrained DAG space. This

llows our approaches to scale for problems with a larger numberf variables.

Finally, we also benchmark against the Max Min-Hill Climbinglgorithm [12] (labeled as MM-HC), a state-of-the-art method fornferring a MAP Bayesian network model using the experimentalettings detailed in Section 4.1. We also took a publicly availablend validated implementation in Java3 of this method with defaultettings (MM-PC with ̨ = 0.05 and a Hill Climbing search methodith a tabu list and 10 multiple-restarts). In Fig. 3(f) we show the

esults of this last comparison for inferring a MAP Bayesian networkodel. As we can see, our two skeleton-based approaches clearly

et better P/R distance than MM-HC across the different sampleizes. This is another advantage provided by our methods, sincehey are able to retrieve either a high quality single DAG model or

high quality Bayesian approximation using the same algorithm.

. Conclusions and future work

In this work we have presented novel skeleton-basedpproaches for Bayesian structure learning of BNs. Thesepproaches are supported by standard techniques to address thisroblem, such as stochastic search or Metropolis–Hasting methods.hey start with the inference of a graph skeleton using a specificethod. With this skeleton, they decompose the Bayesian approx-

mation problem in two simpler problems. The first one consistsn computing the Bayesian approximation over the space of DAGsonstrained by this skeleton. This problem is simpler because theodel space is much smaller. Then, the Bayesian solution is inferred

ver the unconstrained DAG space but using the previous approx-mation that, as shown in the experimental evaluation, makes itasier for the approximation techniques (stochastic search and theCMC method) to get higher quality solutions. In consequence, we

how that this idea, previously employed by algorithm MM-HC tond high quality single DAG models [12], can also be extended toerform Bayesian structure learning.

Future research will be focused on the application of thisethodology for performing Bayesian structure learning on high

imensional problems (problems with several hundreds of vari-bles). Other lines of research will also be followed to improve theerformance of our skeleton-based methods. For example, the edge

eversal move employed by [21] for improving MCMC methods inhe DAG space can be also integrated in SK-MCMC. Following thedeas pursued in [14], we also plan to develop new approachesased on dividing the graph skeleton in smaller graphs involving a

3 http://kdl.cs.umass.edu/powerbayes/.

[

[

omputing 13 (2013) 1110–1120 1119

lower number of nodes, computing a Bayesian approximation foreach of them, and then using all of these approximations to builda better proposal distribution. The integration of expert/domainknowledge in order to refine the inferred models will be also con-sidered in future work.

Acknowledgements

This work has been jointly supported by the research pro-gramme Consolider Ingenio 2010, the Spanish Ministerio de Cienciade Innovación and the Consejería de Innovación, Ciencia y Empresade la Junta de Andalucía under projects CSD2007-00018, TIN2010-20900-C04-01, TIC-6016 and P08-TIC-03717, respectively. We alsothank the reviewers for their constructive suggestions.

References

[1] J. Pearl, Probabilistic Reasoning with Intelligent Systems, Morgan & Kaufman,San Mateo, 1988.

[2] O. Pourret, B. Marcot, P. Naim, Bayesian Networks: A Practical Guide to Appli-cations, Statistics in Practice, Wiley, Chichester, 2008.

[3] D. Heckerman, D. Geiger, D. Chickering, Learning, Bayesian networks: the com-bination of knowledge and statistical data, Machine Learning 20 (3) (1995)197–243.

[4] R.E. Neapolitan, Learning Bayesian Networks, Prentice Hall, 2004.[5] P. Spirtes, C. Glymour, R. Scheines, Causation, Prediction and Search, Springer-

Verlag, Berlin, 1993.[6] G. Cooper, E. Herskovits, A Bayesian method for the induction of probabilistic

networks from data, Machine Learning 9 (1992) 309–347.[7] J. Abellan, M. Gomez-Olmedo, S. Moral, Some variations on the PC algorithm,

in: M. Studeny, J. Vomlel (Eds.), Proceedings of the Third European Workshopon Probabilistic Graphical Models, 2006, pp. 1–8.

[8] J. Gámez, J. Puerta, Constrained score + (local)search methods for learningBayesian networks, in: L. Godo (Ed.), Symbolic and Quantitative Approachesto Reasoning with Uncertainty, Lecture Notes in Computer Science, vol. 3571,Springer, Berlin/Heidelberg, 2005, p. 470.

[9] A. Cano, M. Gómez-Olmedo, S. Moral, An scored based ranking of the edges forthe PC algorithm, in: M. Studeny, J. Vomlel (Eds.), Proceedings of the FourthEuropean Workshop on Probabilistic Graphical Models, 2006, pp. 1–8.

10] I. Tsamardinos, C.F. Aliferis, A.R. Statnikov, Algorithms for large scale Markovblanket discovery, in: I. Russell, S.M. Haller (Eds.), FLAIRS Conference, AAAIPress, 2003, pp. 376–381.

11] S. van Dijk, L.C. van der Gaag, D. Thierens, A skeleton-based approach to learningBayesian networks from data, in: N. Lavrac, D. Gamberger, H. Blockeel, L. Todor-ovski (Eds.), PKDD, Lecture Notes in Computer Science, vol. 2838, Springer,2003, pp. 132–143.

12] I. Tsamardinos, A. Brown, Constantin, The max-min hill-climbing Bayesiannetwork structure learning algorithm, Machine Learning 65 (1) (2006)31–78.

13] J.A. Gámez, J.L. Mateo, J.M. Puerta, Learning Bayesian networks by hill climbing:efficient methods based on progressive restriction of the neighborhood, DataMining and Knowledge Discovery 22 (2011) 106–148.

14] X. Xie, Z. Geng, A recursive method for structural learning of directed acyclicgraphs, Journal of Machine Learning Research 9 (2008) 459–483.

15] J.-P. Pellet, A. Elisseeff, Using Markov blankets for causal structure learning,Journal of Machine Learning Research 9 (2008) 1295–1342.

16] E. Perrier, S. Imoto, S. Miyano, Finding optimal Bayesian network given a super-structure, Journal of Machine Learning Research 9 (2008) 2251–2286.

17] C.F. Aliferis, A.R. Statnikov, I. Tsamardinos, S. Mani, X.D. Koutsoukos, Localcausal and Markov blanket induction for causal discovery and feature selec-tion for classification. Part I. Algorithms and empirical evaluation, Journal ofMachine Learning Research 11 (2010) 171–234.

18] N. Friedman, D. Koller, Being, Bayesian about Bayesian network structure:a Bayesian approach to structure discovery in Bayesian networks, MachineLearning 50 (1–2) (2003) 95–125.

19] D. Madigan, J. York, Bayesian graphical models for discrete data, InternationalStatistical Review 63 (1995) 215–332.

20] D. Eaton, K. Murphy, Bayesian structure learning using dynamic programmingand mcmc, in: Proceedings of the 23th Conference on Uncertainty in ArtificialIntelligence (UAI’07), 2007, pp. 1–8.

21] M. Grzegorczyk, D. Husmeier, Improving the structure mcmc sampler forBayesian networks by introducing a new edge reversal move, Machine Learning71 (2–3) (2008) 265–305.

22] T. Niinimaki, P. Parviainen, M. Koivisto, Partial order mcmc for structure dis-covery in Bayesian networks, in: Proceedings of the 27th Conference Annual

Conference on Uncertainty in Artificial Intelligence (UAI-11), AUAI Press, Cor-vallis, Oregon, 2011, pp. 557–564.

23] G. Scott, C.M. Carvalho, Feature-inclusion stochastic search for Gaussian graph-ical models, Journal of Computational and Graphical Statistics 17 (2008)790–808.

Page 11: New skeleton-based approaches for Bayesian structure learning of Bayesian networks

1 Soft C

[

[

[

[

[

[

[30] I. Tsamardinos, C.F. Aliferis, A.R. Statnikov, Time and sample efficient discovery

120 A.R. Masegosa, S. Moral / Applied

24] C. Hans, A. Dobra, M. West, Shotgun stochastic search for “large p” regression,Journal of the American Statistical Association 102 (478) (2007) 507–516.

25] B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, M. West, Experiments instochastic computation for high dimensional graphical models, Statistical Sci-ence 20 (2004) 388–400.

26] J.G. Scott, J.O. Berger, An exploration of aspects of Bayesian multiple test-ing, Journal of Statistical Planning and Inference 136 (7) (2006) 2144–2162,http://dx.doi.org/10.1016/j.jspi.2005.08.031.

27] C.M. Carvalho, J.G. Scott, Objective Bayesian model selection in Gaussian graph-ical models, Biometrika 96 (3) (2009) 497–512.

[

omputing 13 (2013) 1110–1120

28] J.O. Berger, J. Bernardo, D. Sun, Reference priors for discrete parameter spaces,Tech. Rep., Universidad de Valencia, Spain, 2009.

29] A. Masegosa, S. Moral, A Bayesian stochastic search method for discoveringMarkov boundaries, Knowledge-Based Systems 35 (2012) 211–223.

of Markov blankets and direct causal relations, in: L. Getoor, T.E. Senator, P.Domingos, C. Faloutsos (Eds.), KDD, ACM, 2003, pp. 673–678.

31] M. Koivisto, K. Sood, Exact Bayesian structure discovery in Bayesian networks,Journal of Machine Learning Research 5 (2004) 549–573.