An interactive approach for Bayesian network learning using domain/expert knowledge

14
International Journal of Approximate Reasoning 54 (2013) 1168–1181 Contents lists available at SciVerse ScienceDirect International Journal of Approximate Reasoning journal homepage: www.elsevier.com/locate/ijar An interactive approach for Bayesian network learning using domain/expert knowledge Andrés R. Masegosa , Serafín Moral Department of Computer Science and Artificial Intelligence, University of Granada, Spain ARTICLE INFO ABSTRACT Article history: Received 13 February 2012 Revised 13 March 2013 Accepted 18 March 2013 Available online 8 April 2013 Keywords: Probabilistic graphical models Bayesian networks Interactive structure learning Domain expert knowledge Stochastic search Using domain/expert knowledge when learning Bayesian networks from data has been con- sidered a promising idea since the very beginning of the field. However, in most of the previously proposed approaches, human experts do not play an active role in the learning process. Once their knowledge is elicited, they do not participate any more. The interactive approach for integrating domain/expert knowledge we propose in this work aims to be more efficient and effective. In contrast to previous approaches, our method performs an active interaction with the expert in order to guide the search based learning process. This method relies on identifying the edges of the graph structure which are more unreliable consider- ing the information present in the learning data. Another contribution of our approach is the integration of domain/expert knowledge at different stages of the learning process of a Bayesian network: while learning the skeleton and when directing the edges of the directed acyclic graph structure. © 2013 Elsevier Inc. All rights reserved. 1. Introduction Bayesian networks (BN) [26] are a state-of-the-art model for reasoning under uncertainty in the machine learning field. They are especially useful in real-world problems composed by many different variables with a complex dependency struc- ture. Examples of areas where these models have been successfully applied include genomics, text classification, automatic robot control, fault diagnostic, etc. (see [28] for a good review of practical applications). Every Bayesian network (BN) has a qualitative part and a quantitative part. The qualitative part (i.e., the structure of the BN) consists of a directed acyclic graph (DAG) where the nodes correspond to the variables in the domain problem and the edges between two variables correspond to direct probabilistic dependencies. On the other hand, the quantitative part consists of the specification of the conditional probability distributions that are stored in the nodes of network. One of the main challenges in this research field is the problem of learning the structure (the qualitative part) of a Bayesian network from a previously given set of observational data. This problem has been the subject of a great deal of research [13, 24, 32]. In many of these approaches, humans only participate in the definition of the problem and the structural learning is carried out automatically, without human intervention, like usually happens with most of the machine learning models [11]. However, Bayesian networks provide a graphical representation of the dependencies among the variables that can be easily interpreted by humans [26]. This key property opened the possibility of human intervention during the learning process. In fact, from the very beginning of the field [13, 6] and until recent years [10, 3, 2], many approaches have been proposed to introduce domain or expert (d/e) knowledge for boosting the reliability of the automatic learning methods. This is also becoming a new emerging trend in other relevant fields, like gene expression data mining [3] where there is an Corresponding author. E-mail addresses: [email protected] (A.R. Masegosa), [email protected] (S. Moral). 0888-613X/$ - see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ijar.2013.03.009

Transcript of An interactive approach for Bayesian network learning using domain/expert knowledge

International Journal of Approximate Reasoning 54 (2013) 1168–1181

Contents lists available at SciVerse ScienceDirect

International Journal of Approximate Reasoning

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / i j a r

An interactive approach for Bayesian network learning using

domain/expert knowledge

Andrés R. Masegosa ∗, Serafín Moral

Department of Computer Science and Artificial Intelligence, University of Granada, Spain

A R T I C L E I N F O A B S T R A C T

Article history:

Received 13 February 2012

Revised 13 March 2013

Accepted 18 March 2013

Available online 8 April 2013

Keywords:

Probabilistic graphical models

Bayesian networks

Interactive structure learning

Domain expert knowledge

Stochastic search

Using domain/expert knowledgewhen learning Bayesian networks from data has been con-

sidered a promising idea since the very beginning of the field. However, in most of the

previously proposed approaches, human experts do not play an active role in the learning

process. Once their knowledge is elicited, they do not participate any more. The interactive

approach for integrating domain/expert knowledgewepropose in thiswork aims to bemore

efficient and effective. In contrast to previous approaches, our method performs an active

interactionwith the expert in order to guide the search based learning process. Thismethod

relies on identifying the edges of the graph structure which are more unreliable consider-

ing the information present in the learning data. Another contribution of our approach is

the integration of domain/expert knowledge at different stages of the learning process of a

Bayesian network: while learning the skeleton andwhen directing the edges of the directed

acyclic graph structure.

© 2013 Elsevier Inc. All rights reserved.

1. Introduction

Bayesian networks (BN) [26] are a state-of-the-art model for reasoning under uncertainty in the machine learning field.

They are especially useful in real-world problems composed by many different variables with a complex dependency struc-

ture. Examples of areas where these models have been successfully applied include genomics, text classification, automatic

robot control, fault diagnostic, etc. (see [28] for a good review of practical applications).

Every Bayesian network (BN) has a qualitative part and a quantitative part. The qualitative part (i.e., the structure of the

BN) consists of a directed acyclic graph (DAG) where the nodes correspond to the variables in the domain problem and

the edges between two variables correspond to direct probabilistic dependencies. On the other hand, the quantitative part

consists of the specification of the conditional probability distributions that are stored in the nodes of network.

One of themain challenges in this researchfield is the problemof learning the structure (the qualitative part) of a Bayesian

network from a previously given set of observational data. This problem has been the subject of a great deal of research

[13,24,32]. Inmany of these approaches, humans only participate in the definition of the problem and the structural learning

is carried out automatically, without human intervention, like usually happens with most of the machine learning models

[11]. However, Bayesian networks provide a graphical representation of the dependencies among the variables that can

be easily interpreted by humans [26]. This key property opened the possibility of human intervention during the learning

process. In fact, from the very beginning of the field [13,6] and until recent years [10,3,2], many approaches have been

proposed to introduce domain or expert (d/e) knowledge for boosting the reliability of the automatic learning methods.

This is also becoming a new emerging trend in other relevant fields, like gene expression data mining [3] where there is an

∗ Corresponding author.

E-mail addresses: [email protected] (A.R. Masegosa), [email protected] (S. Moral).

0888-613X/$ - see front matter © 2013 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/j.ijar.2013.03.009

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181 1169

growing interest in exploiting the large amount of domain knowledge available in the literature, but especially in knowledge

repositories such as KEGG [25] or MIPS [22].

However, the methodologies for introducing d/e knowledge that have been proposed so far demand or request this

knowledgeapriori. In somecases, this knowledge is providedpursuing aBayesian approachwith thedefinitionof informative

priordistributionsover themodel space [6,13,15]. Forexample, ina recentwork [2], authorsemploystochastic logicprograms

to define these priors. In other cases the knowledge is given by defining fixed restrictions in the model space by explicitly

enumerating the existence/absence of some edges between the variables [10] or, as proposed in [5], by defining directed path

constraints which encode some previously known causal relationships between variables of the domain problem. This kind

of knowledge is extracted from experimental data or from some previously built ontology [4]. Other works [29] also explore

the problem of unreliable d/e knowledge and propose methods which combine (unreliable) knowledge of independent

sources as an effective way to improve the overall quality of the elicited information. But in all of these works the role of the

expert in the learning process ends once s/he has provided the required knowledge.

The problem we find in these approaches is that human experts do not receive any support from the learning system

when they introduce their d/e knowledge. As mentioned before, this knowledge is usually introduced by giving information

about particular edges. But the number of possible edges in a domain problem with many variables is very large, and the

elicitation of this knowledge could be quite costly [3] (for any elicited edge, a direct dependency relationship between two

variables needs to be asserted). Furthermore, the presence of a particular link is not an isolated event that can be asserted

separately from the rest of the graph structure. The simplest restriction is that it is not allowed to create directed cycles, but

it could also happen that the existence of a link between two variables depends on the absence of an alternative path joining

them. So, context information (what it is already known about the graph structure) can be useful in order to introduce

d/e knowledge. What we show in this paper is that many edges can be reliably inferred by simply analyzing the learning

data; we do not need to elicit any prior d/e knowledge about them, while other edges remain very uncertain using only the

information present in the data sample and introduce a lot of noise in the inferred DAG structure. In this work we argue

that the efforts of experts should be focused on these conflictive edges in order to boost the quality of the learnt Bayesian

networks, and that the certain information already extracted from the data can be useful in this interactive phase.

Following similar ideas, the so-called active learning approaches use experimental data as a complementary source of

information to the given observational data [23,33,21]. These methods assume that some variables in the domain problem

can be intervened (i.e. their value can be fixed to a predetermined value) when collecting data samples. Hence, the collected

data are experimental, not observational. The above reference proposes alternative strategies to decide how to perform these

interventions and the number of experimental samples that must be collected. They show that the request of experimental

data can be minimized by firstly analyzing the observational data that we have available, and conclude that using exper-

imental data is only worthy if it contains information that is not already present in the observational data. For example,

[23] proposed a decision theoretic approach for deciding which interventions should be performed. This decision approach

translates to selecting in each step the intervention which most reduces the conditional entropy of the posterior over the

graph structures given the experimental data. They apply an online Markov Chain Monte Carlo (MCMC) method to estimate

the posterior over the alternative graph structures, and use importance sampling to find the best action to perform in each

step.

Our work is along the lines of these last mentioned works. We propose an interactive methodology to identify which

edges of the DAG model cannot be reliably inferred with the information present in the given observational data. We then

assume that this information can be obtained from an expert (who might not be fully reliable) and integrated in our data

learning process. Under this methodology, there is a close direct interaction between the human and the learning system,

since the human answers questions submitted by the system, and the system performs the structure learning guided by

the information provided by the expert. As mentioned before, one of the main advantages of this interactive procedure is

that the system only requests information about those edges whose presence in the inferred model cannot be discerned

with the information present in the data. Therefore, this procedure reduces the amount of d/e knowledge that must be

requested.

This paper also shows that the integration of d/e knowledge can be carried out at different levels of the model space.

In the first level, a skeleton (i.e., an undirected graph which may contain cycles) is learnt with the help of d/e knowledge

and then, with the constraint of this initial skeleton, a BN model is inferred, using d/e knowledge as well. We will show

that the integration of d/e knowledge in every level boosts the quality of the learnt BN w.r.t. the model inferred using

only the information present in the data or integrating d/e knowledge in only one of the levels. This way, we extend the

ideas previously presented in [7,8] for this problem. In particular, we remove the restriction imposed by the previously

presentedmethods, where the BN learning process has to be carried assuming that a total order of the variables is previously

given. This change is quite relevant, since now the model space is much larger (i.e., from an exponential size space to a

super-exponential size space) and the learning problem is much more challenging. In addition to this, the methodology to

integrate d/e knowledge is also extended to refine the initial skeleton structure that is inferred to constrain the BN model

space. The presentedmethodology is only developed formultinomial Bayesian networks, but it can be extended to deal with

continuous Gaussian variables.

The paper is structured as follows. Section 2 exposes the previous knowledge. Our approach is presented in Section 3 and

experimentally validated in Section 4. Finally, Section 5 includes the main conclusions and future work.

1170 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181

2. Previous knowledge

2.1. The Bayesian score of a BN

Let us assume we are given a vector of n random variables X = (X1, . . . , Xn) each taking values in some finite domain

Val(Xi). A BN is defined by a directed acyclic graph, denoted by G, which represents the dependency structure among the

variables in the BN. More precisely, this graph G is represented by means of a vector G = (�1, . . . , �n) where �i ⊂ X is

the parent set of variable Xi (those variables with an edge pointing to Xi). The definition of a BN model is complete with the

set, denoted by �, of the conditional probability distributions of each variable Xi given its parents �i.

Let us also assume we are given a fully observed multinomial data set D. To compute the marginal likelihood of the data

given the graph structure, P(D|G) = ∫P(D|G, �)P(�|G)d�, the most common settings as provided in [13] assume a prior

Dirichlet distribution for each parameter defining the different conditional probability table distributions. They also assume

a set of parameter independence assumptions in order to factorize the joint probabilities and make the computation of the

multidimensional integral feasible. In that way, the marginal likelihood of data given a graph structure has the following

well-known closed-form equation:

P(D|G) =n∏i

|�i|∏j=1

�(αij)

�(αij + Nij)

|Xi|∏k=1

�(αijk + Nijk)

�(αijk)(1)

whereNijk are thenumberof data instances inD consistentwith the jth joint assignment of the variables in�i andwith the

kth value of Xi; Nij = ∑k Nijk; |Xi| is the number of values of Xi; |�i| = ∏

Xj∈�i|Xj| is the total number of joint assignments

of the variables in �i; (αij1, . . . , αij|Xi|) is the set of parameters of the prior Dirichlet distribution of the probabilities with

which Xi takes its values given that �i takes the jth assignment; and αij = ∑k αijk . In the case of the Bayesian Dirichlet

equivalent metric, or BDeu metric, these αijk are set to αijk = S|�i||Xi| , where S is the so-called equivalent sample size [13].

Finally, we employ a non-uniform prior distribution for the graph structures, P(G) = ∏i P(�i) ∝ ∏

i

(i

|�i|)−1

, in order

to account for the problem of “multiplicity correction” [31]. So, the Bayesian score of a graph structure, G, is computed as the

product of this prior and the marginal likelihood, score(G|D) = P(D|G)P(G), while the posterior probability is computed by

normalizing this score as follows:

P(G|D) = score(G|D)∑G′∈G score(G′|D)

(2)

where G denotes the space of all possible DAG structures over the variables in X.

The score of a graph can be decomposed as a product of factors, one for each variable:

score(G|D) = ∏Xi∈X

score(G, Xi|D)

where

score(G, Xi|D) =(

i

|�i|)−1 |�i|∏

j=1

�(αij)

�(αij + Nij)

|Xi|∏k=1

�(αijk + Nijk)

�(αijk). (3)

The value score(G, Xi|D) only depends on the data, variable Xi, and its set of parents�i in G, and thereforewill be denoted

as score(Xi, �i|D).If two graphs, G and G′, are equal except for the set of parents of one variable Xi, which is �i in G and �′

i in G′, then the

ratio of their probabilities can be easily computed as

P(G|D)

P(G′|D)= score(Xi, �i|D)

score(Xi, �′i|D)

(4)

2.2. Automatic learning of Bayesian networks

Most of the previously proposed methods for automatic learning of BNs retrieve one single structure after the learning

process. For the induction of this structure, most of these learning methods fall into any of these two families: constraint

based (CB) methods, which carry out several independence tests on the learning data set to find a Bayesian network in

agreement with the test results [32]; and score based methods, which employ a specialized search method in the DAG space

that tries to recover the structure with the highest Bayesian score.

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181 1171

One of the most successful methods, the so-called MaxMin HillClimbing (MM-HC) algorithm [34], combines these two

basic procedures. This method employs a CB approach to elicit a skeleton of the BN (i.e., a graph with undirected edges).

This skeleton is then used to restrict or constrain the DAG space of a hill climbing search procedure which looks for a BN

model maximizing a Bayesian score. The advantages of this approach are that, firstly, the search of the structure of the BN is

less prone to errors and far more efficient than non-restricted score and search approaches, especially if this skeleton is not

very dense. And, secondly, the elicitation of the skeleton can be approached using methods that search locally the part of

the skeleton around each single variable, which makes this problemmuchmore efficient and scalable for high-dimensional

data sets [1].

There is another family of BN learning methods based on the claim that the selection of a single model may give rise

to unwarranted inferences about the structure if the Bayesian scores give support to several DAG models (i.e., they explain

the data similarly well). Therefore, the employment of a full Bayesian solution is desirable. Now, the goal is to compute the

posterior probability of some structural feature f (e.g. the presence of a directed edge between two variables X and Y) as an

expected posterior mean:

P(f |D) = ∑G∈G

f (G)P(G|D) (5)

where f (G) = 1 if the structural feature holds in G (e.g. G contains the edge) and f (G) = 0 otherwise.

In this case, the underlying problem for getting a Bayesian solution is mainly caused by the super-exponential size of

the DAG space G. Several Markov chain Monte Carlo (MCMC) approaches have been proposed in the literature to overcome

this issue [12,18]. However, when the stationary distribution is too complex, it is really hard for MCMC methods to achieve

convergence in a limited number of iterations [30,17]. In order to approach this problem, other methods based on stochastic

search have been proposed [30,17]. In contrast to MCMC, where the goal is to converge to a stationary distribution, these

methods simply list and score a collection, G ⊂ G, of high-scoring models. As shown in [20] by the authors of this work,

good approximations of the aforementioned posterior probabilities of structural features, P(f |D), can be obtained using this

set G if it contains all the models with a non-negligible Bayesian score.

3. Interactive learning of Bayesian networks

Thismethod is along the lines of a previously proposed approach [8] to integrate d/e knowledgewhile learning BNswhich

is consistentwith a previously given causal order of the variables. In thisworkwego further and omit this strong requirement

(in many domain problems, is not clear how this causal order can be elicited from an expert), which makes the problem

much more difficult. For this reason, this method approaches the interactive learning of BNs by decomposing the learning

process in different levels. In a first level, the learning process is focused on inferring a skeleton of the BN (i.e., an undirected

graphwhichmay contain cycles). More precisely, the skeleton is built by combining theMarkov boundaries of each variable,

which are independently induced from the data using d/e knowledge. We point out that, in this case, the queries for the

expert are directly related to the structure of the skeleton, not the edges of the DAG. In a second level, we also use d/e

knowledge to try to learn DAGs constrained by the previously inferred skeleton. That is to say, we assume the expert is able

to answer whether two variables should be connected by a directed edge. This does not always imply that the expert has to

elicit causal relationships, since the direction of an edge can be determined inmany cases by the conditional independencies

encoded in the alternative graphs. In a third level, we try to learn DAGs with the help of d/e knowledge, but in this case the

DAGs are not constrained by any skeleton. This additional learning step, which starts from the approximation obtained in

the second level, tries to further improve the overall result. The idea is that, when learning, decisions made at a given point

of the process depend on the available knowledge. So, while starting an unconstrained search from the beginning can be a

bad idea considering the huge size of the search space, it can be useful to extend a given solution constrained by the skeleton,

since now we have more knowledge about the presence an orientation of the edges, and this can help to override previous

decisions about the skeleton. This method will be referred as multi-level stochastic search (ML-SS). A graphic description of

this approach is depicted in Fig. 1. In [20] we give empirical arguments in favor of this multi-level approach as a robust

method for approximating the posterior over the graphs, P(G|D), and over the structural features, P(f |D) (see Section 2.2).

In the next subsection,we firstly detail ourmethod for interactive integration of d/e knowledge,which is valid to integrate

knowledge either when learning the Markov boundary of a given variable or when learning the DAG structure of a BN. In

Section 3.2, we finally detail how this methodology is applied to learn BNs from data using the ML-SS method.

3.1. Interactive integration of domain/expert knowledge

The setup of the problem is as follows. We are given a data set D and a family of statistical models, denoted by M.

We also denote by M a single statistical model. Moreover, each model is defined by a different vector of components,

M = (m1, . . . ,mT ), where T is the number of possible components. This framework is flexible enough to embrace different

learning problems. For example, when inferring the Markov boundary (MB) of a target variable X , then T is the number of

candidate variables (T = n − 1), and mk = 1 if the kth variable is in the true Markov boundary of X , or mk = 0 otherwise.

1172 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181

Fig. 1. Multi-level stochastic search with expert interaction.

Fig. 2. Influence diagram modelling the interaction process.

If we are learning BNs then T is the total number of possible edges, mk corresponds to the kth edge and takes values in

{0, −1, 1} depending on whether the edge is absent or in any of the possible directions, respectively.

Themethodology proposed in thiswork requires interactingwith an expert to infermore accurate statisticalmodels from

the learning data. We model this interaction process as a decision making problem (Fig. 2 shows a graphical description

using an influence diagram [16]) composed by the following elements:

• M refers to a random variable whose state space is the family of statistical models used for learning.• D denotes the observed data set. It depends on the variable M, under the assumption that the available learning data

set has been generated by one of these models. The incoming and outgoing arcs are plotted as dashed lines to denote

that the decision making problem is only solved for the particular data set given. D is then a variable which is always

observed in this decision problem.• me

k refers to the information provided by the expert about the kth component, mk , of a model M. This variable is only

observed if the expert is asked. This condition is expressed by the incoming arc from the decision variable Askk . Its

conditional distribution also depends on the model which generates the data and on variable Rk .• Rk models if the expert is reliable or not when providing information about component mk . If the expert is reliable, the

conditional probability of mek given M is a deterministic function: P(me

k = i|Rk = yes,M = M) = 1 if mk = i. If the

expert is not reliable, then we assume the expert’s answer is distributed uniformly over the wrong values of mk (e.g. if

the model contains a link between X and Y , and the expert is wrong, then we assume her/his answer will be either that

there is a link from Y to X or that there is no link, with equal probability). We assume that the reliability of the expert is

independent of the model and specific for each component in the model (i.e., in the case of BNs, there are probabilistic

relationships that might be much harder to elicit than others, and the reliability of the expert might vary).• Askk is a decision variable with two states: ‘ask’ or ‘do not ask’ the expert about her/his belief on the model component

mk which s/he thinks is generating the learning data. This decision only depends on the data sample and, as mentioned

above, it determines if the variable mek is observed or not (i.e., it is only observed if we decide to ask the expert).

• Uk refers to the utility associated to the decision Askk , which also depends on modelM, on the information provided by

the expert mek and, implicitly, on data D.

Since our primary goal is recovering the true model structure, a natural choice for the utility function is the logarithm of

the posterior, Uk(Askk,mek, M,D) = ln P(M|me

k,D) (i.e., the larger this value, the stronger our belief in M) minus the cost

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181 1173

of asking the expert about component mk: Ck . In this case, the value or expected utility associated to the decision of asking

the expert is computed as follows:

V(Askk) = ∑me

k

∑M

P(M|D)P(mek|M,D) ln P(M|me

k,D) − CAk (6)

Similarly, the value of not asking aboutmk would also be computed as follows:

V(Askk) = ∑M

P(M|D) ln P(M|D) (7)

In this case, the posterior P(M|D) does not depend on mek because this information will not be available (i.e., this can be

modelled in the influence diagram by fixing the variable mek to an artificially added “not-observed” state when Askk = No)

and there is no additional cost because no action is performed.

Our strategy is then to find the asking decision with themaximum value or expected utility, and submit this query to the

expert if the value of asking is higher than the value of not asking, V(Askk) > V(Askk). Because the value of not asking is the

same for all the asking decisions, the above strategy can be reduced to finding the decision, Askk , with the highest difference

between both actions, V(Askk) − V(Askk), and this difference turns out to be equal to the information gain between M and

mek minus the cost of asking:

V(Askk) − V(Askk) = IG(M : mek|D) − CAk

= H(M|D) − ∑me

k

P(mek|D)H(M|me

k,D) − CAk (8)

The computation of the information gain of componentmek can be simplified to computing its entropy plus some constant

terms due to the following equality:

IG(M : mek|D) = H(me

k|D) − H(Rk) + (1 − τk) ln(|mk| − 1) (9)

where |mk| is the number of values of a component (e.g. in the case of BNs, a component has 3 different values) and τk is

the probability that the expert is reliable when giving information aboutmk, P(Rk = reliable) = τk.The computation ofH(me

k|D) requires estimating the posterior probability of the answer provided by the expert, P(mek =

i|D), which is computed by marginalizing out Rk: P(mek = i|D) = τkP(mk = i|D) + (1 − τk)P(mk �= i|D). The posterior

of a component P(mk = i|D) is computed by summing up the posterior of the models M that contain this component, as

previously mentioned in Eq. (5):

P(mk = i|D) = ∑M

I[mk=i](M)P(M|D) (10)

where I[mk=i](M) is the indicator function for component mk in model M. This posterior probability indicates how many

hypotheses support the presence/absence of these components. In that way, when the entropy of this posterior is high, we

are in a situationwhere a similar number of hypotheses ormodels support either the presence or absence of this component

in the model that generates the data.

So, according to Eq. (8), we finally select the component about which we are going to ask the expert, denoted bymemax , as

the componentwith thehighest information gainminus the cost of the query. If this difference is positive, IG(M : memax|D) >

CAmax (i.e., the expected utility of asking is higher than the expected utility of not asking) we submit the query to the expert.

If we decide to submit the query to the expert, the answer is gathered and this new evidence is integrated in the inference

process by updating the posterior P(M|D). We use E to denote the set of answers given by the expert so far. The posterior

update is made using Bayes rule and assuming that the probability of the data is independent of the expert knowledgewhen

the model is given (this assumption is encoded in Fig. 2):

P(M|E,D) = P(D|M)P(M|E)∑M′∈M P(D|M′)P(M′|E) (11)

As can be seen in the above equation, the integration of expert knowledge is achieved by updating the prior probability

over the model space, P(M|E), with the information provided by the expert. This updating is made as follows:

P(M|E) ∝ P(M)P(E|M) = P(M)∏

mek∈E

P(mek|M) (12)

1174 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181

Fig. 3. Interactive integration of D/E knowledge.

where P(M) is the initial prior and P(E|M) is the likelihood of the expert answers given that the real model generating

the data is M. These answers are assumed to be independent given the model M. The probability of an expert’s answer is

computed by marginalizing out the expert reliability variable, Rk: P(mek|M) = ∑

RkP(me

k|M, Rk)P(Rk).After the above description, we jointly detail the different steps of our proposed methodology to interact with an expert.

To simplify the exposition, we have assumed a constant cost of answering to an expert for any componentmk , which will be

denoted by λ (λ = CAk, ∀k). A flowchart of this process is shown in Fig. 3. Details about the different steps and conditions

are given in the following paragraphs:

Step 0:Apriori, there is no information given by the expert: E = ∅.We have to fix aminimum information gain threshold,

λ, to submit a query (i.e., λ is related to the cost associated to submitting a query). Then, the method starts at Step 1.

Step 1: Compute an approximation of P(M|D, E) using any Monte Carlo or stochastic search method (see Section 2.2) to

compute a set of models containing all the models with a non-negligible conditional probability. When E is not empty, the

approximation is computed as usual by employing the posterior conditioned to this information: P(M|E), as shown in Eqs.

(11) and (12). Next, go to Step 2.

Step 2: Using the current estimation of P(M|D, E), compute the component memax with the highest information gain

measure with respect toM. Then go to Condition 1.

Condition 1: If the information gain of the previously selected memax is lower than the information gain threshold,

IG(memax) < λ, go to Condition 2. Otherwise, go to Step 3.

Step 3: Ask the expert about the value of componentmemax . Using Eqs. (11) and (12), update the posterior probability with

the new expert knowledge: Update E to E ∪ E(memax) (E is the previous set of submitted answers and E(me

max) is the last

answer of the expert about mmax) and compute the new conditional information about the models P(M|D, E). Next, go to

Step 2.

Condition 2: If immediately after getting a new approximation of P(M |D, E) in Step 1 we did not perform any query

because there was no mek component whose information gain was higher than λ (i.e., the process went from Step 1 to

Condition 2 without passing through Step 3), the interaction is finished and we will return the estimation of the posterior,

P(M|D, E). Otherwise, go to Step 1 and recompute a new approximation of the posterior over the model space using the

new set of answers given by the expert so far.

As we have seen, within the inner loop (i.e., Step 2, Condition 1 and Step 3), we try to recursively fix those uncertain

components found in a first approximation of the posterior. After that, we use the new knowledge provided by the expert

to recompute again this approximation (Step 1). As new information is available apart from the learning data, a better

approximation of the posterior should be obtained and new uncertain components can be found in this new iteration (Step

2, Condition 1). So, once again the systemasks the expert about them (Step 3).We iterate until there are nomore components

with information gain above the threshold λ (Condition 2).

So, this methodology uses the interaction with the expert to guide the learning process, since for each iteration the new

updated prior P(M|E) discards the models which are not in agreement with the information provided by the expert.

3.2. Multi-level stochastic search of BNs with the help of an expert

As mentioned before, this method decomposes the BN learning problem in three different levels (see Fig. 1). This same

multi-level approach for learning BNs, but without using d/e knowledge, was firstly proposed and evaluated in [20]. In that

work two different approacheswere proposed, one based onMarkov ChainMonte Carlo techniques, and the other one based

on a stochastic search algorithm, labelled skeleton-based stochastic search of BNs (SS-BN), which is the one that we employ

in this work. In addition to this, we also use other previously proposed method for learning Markov boundaries, labelled

Bayesian stochastic search of Markov boundaries (SS-MB) [19]. We employ this method in order to integrate d/e knowledge

when inferring the skeleton of the BN (which is built by joining the Markov boundary of each variable). The SS-MB method

scales very well to problems with hundred of thousands of variables, while the SS-BN method has been shown to give

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181 1175

accurate approximations for problems up to one hundred of variables, further details about the computational efficiency of

these approaches can be found in [20,19].

In the next subsections we give the details about how d/e knowledge is included in each level and we also briefly revise

the fundamentals of the SS-BN and SS-MB methods.

3.2.1. Level 1: interactive learning of the skeleton of a BN

As previously mentioned, the aim of this first level is to induce a skeleton of the BN as a preliminary step, in order to

constrain the subsequent search in the DAG space. A skeleton, denoted by SK , is composed by a set of undirected edges

between a pair of variables, SK = {X − Y : X, Y ∈ X}. More precisely, this skeleton will be induced by joining the Markov

boundary (MB) of every variable: X − Y ∈ SK iff X ∈ MB of Y or Y ∈ MB of X. These MBs are in turn inferred individually

from the data. We recall that a Markov boundary of a variable X is a minimal variable subset, conditioned on which all

other variables are probabilistically independent of X . When these MBs are joined together, they form a moral graph (the

moralized counterpart of a DAG by connecting nodes that have a common child [26]) which acts in our case as the desired

skeleton. As pointed out in [27], the main property that this skeleton should satisfy in order to define a correct constrained

DAG search space, denoted by GSK , is that it must be a super-structure of the true DAG that generates the data. We say that SK

is a super-structure of a graph G if, for any directed edge in G, X → Y ∈ G, there is an undirected edge in SK , X − Y ∈ SK , (SK

can contain edges which are not included in G). In that way, a subsequent search method in the space of DAGs constrained

by this skeleton, GSK , has the possibility of finding the true DAG. In our case, it is straightforward to verify that the moral

graph satisfies this super-structure property.

Thus, our goal now is to infer the MB of each variable X ∈ X from the learning data with the help of an expert. For this

purpose we employ the previously defined interactive scheme detailed in Section 3.1. Then, the model space is denoted by

MBX and is composed by all the alternative MBs for random variable X ∈ X. One particular MB of X is denoted byMBX and

it is defined by a particular subset of the variables in X \ X . EachMBX can be defined by a vector of T = n − 1 components

where mk = 1 if the kth candidate variable of X is included in MBX and mk = 0 otherwise.

The next step is defining a method to compute the posterior probability over the space of the alternative MBs of X ,

P(MBX |D). Aswe alreadymentioned,method SS-MB [19] is employed for this purpose. The SS-MBmethoduses the following

formalization of a Markov boundary in terms of conditional independence statements. A MB of a variable X is a subset of

variables MBX ⊆ X \ X satisfying the following two sets of conditional independence statements: {X ⊥ Z|MBX : Z ∈X \ (X ∪ MBX)} ∪ {X �⊥ Y |(MBX \ {Y}) : Y ∈ MBX} (the first set guarantees that X is independent of the rest of variables

givenMBX while the second set guarantees that thisMBX is minimal). Instead of employing classic hypothesis tests in order

to accept/reject each of the previous conditional independence statements, thismethod employs a Bayesian perspective and

computes the posterior probability of each conditional independence statement: P([X ⊥ Z|C]|D).These posterior probabilities can be easily computed if they are interpreted as the comparison of two alternative models

generating the data, one in which the parents of variable X are MBX (independence) and other in which the parents of X

are C ∪ {Z} (dependence). The normalized ratio of these two probabilities can be computed according to expression (4),

obtaining

P([X ⊥ Z|C]|D) = score(X, C|D)

score(X, C|D) + score(X, C ∪ {Z}|D)(13)

These probabilities are then employed to perform random movements (i.e., adding/removing candidate variables) fol-

lowing a specific stochastic search method with the aim of visiting different alternative MBs of variable X . The search [19]

produces first a randomordering of the variables, and setsMBX to the empty set. Then it visits the variables in the given order.

If Z is not in the current Markov boundary, it decides about including it, taking into account the probability of dependence

(oneminus the value computed in (13) where C isMBX ). If Z is already in the currentMarkov boundary, then it decides about

excluding the variable, taking into account the probability of independence given in expression (13), where C isMBX \ {Z}.In this way, several alternative MBs are generated, since the conditional independencies are accepted/rejected with a

given probability. The final result is a set of plausible Markov boundaries. We then associate a score value, score(X,MBX |D),to each visitedMB which can be computed as in (3) (in [19] we give arguments and more specific details on this approach).

If we normalize these score values, we can obtain an approximation of the posterior probability for each MBX , under the

assumption that the Markov boundaries found by SS-MB are the only ones with a non-negligible score, which is computed

as follows:

P(MBX |D) = score(X,MBX |D)∑MB′

X∈MBXscore(X,MB′

X |D)(14)

where MBX is the set of different MBs of X found by SS-MB, and score(X, MBX |D) is computed as in expression (3) where

�i is set toMBX , the current Markov boundary.

So, our interactive methodology can now be applied to this problem. The expert will be asked if a particular variable

belongs or not to the trueMBofX . This interactivemethodologywill provide us an approximation of the posterior probability

over the space of the alternativeMBs ofX , P(MBX |D, E), with thehelp of the answers submitted by the expert.Wealso have to

1176 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181

consider that, due to the symmetries present in the variables inside aMB, Y ∈ MBX ↔ X ∈ MBY ,1 the answers of an expert

about theMB of a given target variable can be used as expert knowledgewhen recovering theMB of other variables (i.e., if an

expert says that Y belongs to the MB of X , when inferring the MB of Y wewill directly included X in the MB of Y as available

expert knowledge). The final output of this level is a set of posterior probabilities {P(MBX1 |D, E), . . . , P(MBXn |D, E)} overthe different MBs of the variables.

3.2.2. Level 2: interactive learning of constrained DAG structures

In this new level we try to build an approximation for the posterior probability P(GSK |D), where GSK is the DAG search

space constrained by a skeleton, SK . We need these approximations in order to integrate d/e knowledge in this new level

using the interactive methodology described in Section 3.1.

As mentioned at the beginning of this section, we employ for this purpose a specific skeleton-based stochastic search

method, the SS-BNmethod [20]. The aim of this stochastic search method is to find a set of high-scoring models, GSK ⊂ GSK .If GSK contains all the DAGmodels with a non-negligible score, good approximations of the posterior probability of the DAGs

can be computed as follows:

P(GSK |D) = score(GSK |D)∑G′∈GSK

score(G′|D)(15)

A rough description of this method for collecting models of high-scoring constrained DAGs is as follows: The method

begins by sampling a skeleton, SK . This is made by sampling a MB for each variable Xi according to P(MBXi |D), which was

computed in the previous level, and joining all the MBs together. Then, the stochastic search starts from an empty graph G0

and, in a first phase, each edge of the skeleton, X − Y ∈ SK , is evaluated for being added to the currently visited graph G

in any of the directions. In a second phase, several iterations are carried out to evaluate new edge additions (if the edge is

present in SK but it is not included in the currently visited graph G) or new edge reversals or removals (if the edge is present

in G). Each of these movements (i.e., add/remove/reverse an edge) are randomly carried out with probability equal to the

normalized Bayesian score of the alternative DAGs they generate. This whole process is repeated a large number of times

and the set of different visited constrained DAGs are collected in the GSK to compute the posterior P(GSK |D).In that way, we perform the expert interaction in order to learn a constrained DAG structure. Now the model space, GSK ,

is composed by all the possible constrained DAG structures over the set of variables in X. Any given DAG structure can also

be defined by a vector of T = n(n−1)2

components corresponding to every possible pair of variables (Xi, Xj). Thus, each single

component has three possible states: no edge between Xi and Xj , and the two states corresponding to the two alternative

directions of the edge. So, the expert will be inquired about if there is (and in which direction) or not an edge between a

given pair of variables. We point out that in this step the expert will not receive questions about any edges that are not

contained in SK .

Once the interaction process ends, the output of this level is an approximation of the posterior P(GSK |D, E) using d/e

knowledge.

3.2.3. Level 3: interactive learning of unconstrained DAG structures

In the last level we try to build a final approximation for the posterior probability P(G|D, E) over the whole DAG space

with the help of d/e knowledge. Following the SS-BNmethod [20], in this level a new stochastic searchmethod is run starting

from the DAGs found in the previous level. It will try to add new edges without the constraints defined by the skeletons of

the previous level. A rough description is as follows: The method begins by sampling a new graph GSK from the posterior of

the previous level, P(GSK |D, E). The edges included in GSK are copied to the initial graph G0. These sampled edges are fixed

and will not be evaluated anymore in the search. Then, several iterations are carried out to evaluate new edge additions (if

the edge is not present in the currently visited graph G) or new edge reversals or removals (if the edge is present in G but

not in GSK ). Similarly to the previous level, each of these movements are randomly carried out with probability equal to the

normalized Bayesian score of the alternative DAGs they generate. This whole process is repeated a large number of times

and the set of different visited DAGs are collected in a new set Gwhich is used to compute the posterior P(G|D) as in Eq. (15).

In this case, we also perform expert interaction in order to learn a new DAG structure. Once the interaction process ends,

the output of this level is the final approximation of the posterior P(G|D, E) using d/e knowledge.

4. Experimental validation

4.1. Experimental set-up

We validate experimentally our interactive approach to learn BNs using d/e knowledge with synthetic data samples

generated from five BNs which are commonly used in this kind of experimental settings [24]: Alarm network with 37

1 This symmetry is always satisfied for positive probabilities in which there is only one Markov boundary for each variable. This will always be our case, as the

estimation of probabilities from data will always be positive.

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181 1177

variables; boblo network with 23 variables; boerlage92 network with 23 variables; hailfinder network with 56 variables; and

insurance networkwith 27 variables. For each of these networks and bymeans of logic sampling [14], we randomly generated

100 data samples with the same size, and we considered different sample sizes: 100, 500, 1000 and 5000 cases (these are

displayed on the X-axes of the figures). The different methods were evaluated for each data sample, and averaged values

across the different BNs are displayed.

One of the main advantages of using artificially generated data samples is that we know the model that generated the

data, so we can simulate the interaction with an expert by accessing the true BNmodel and looking if the inquired edge was

actually absent or present in the model. Thus, although we assume that in this experimental validation experts never give

wrong answers, this methodology can be adapted to deal with this issue, as previously detailed in [8].

We employ different errormeasures to evaluate the quality of an inferred BNwith andwithout expert interaction. Firstly,

we evaluate the quality of the maximum a posteriori (MAP) DAG structure using the structural hamming distance (“SHD”)

which is equal to the number of edge deviations (missing plus additional plus orientation errors) between the learnt and the

true PDAGmodels. An acyclic partially directed graph (PDAG) is an extension of a DAGwhere some edges can be undirected if

any of the alternative directions of those undirected edges create DAGs which encode the same conditional independencies

[9]. However, this last metric does not allow us to evaluate if, apart of an improvement in the number of structural errors,

there is any reduction in the uncertainty of the posterior probability over the edge features (i.e., if Xi → Xj is in the true

model and the posterior of an edgewith no expert interaction is 0.8 but it becomes equal to 1 after the interaction, this effect

is not measured by this metric as both MAP models, before and after the interaction, contain this edge). For this purpose

we also consider the so-called L1 edge error to compare the structural errors of a mixture of models [23,33]; and the overall

edge entropy tomeasure to what extent the posterior is concentrated around a singlemodel. L1 error is computed as follows:

L1(P(G|D)) = ∑i,j<i IG∗(Xi → Xj)(1 − Pj→i) + IG∗(Xi � Xj)Pj→i. Similarly, the edge entropy is computed as follows:

EdgeEntropy(P(G|D)) = ∑i,j<i Pi→j ln Pi→j + Pj→i ln Pj→i + Pi�j ln Pi�j . Where Pj→i refers to the posterior probability that

there is an edge from Xj to Xi, Pj→i to the probability that there is no edge between Xi and Xj , G∗ is the true graph and

IG∗(A) = 1 if A holds in G∗ and is zero otherwise.

Furthermore, in this experimental validation we also try to evaluate the effect of the integration of the d/e knowledge

when inferring the skeleton of the BN (see Section 3.2.1). To this aimwe employ a combination of the precision and recall of

the DAG edges, but without considering the direction of the edges (i.e., only the presence/absence of an undirected edge).

More precisely, it is considered the Euclidean distance from the perfect precision/recall, which is labelled in the figures as

“P/R SK Distance”, distance =√

(1.0 − precision)2 + (1.0 − recall)2.

The SS-BN and SS-MB methods are run with default settings as detailed in [19,20].

4.2. Evaluating the effect of the interaction methodology

In this first analysis, we aim to measure the effect of applying the proposed interactive methodology to integrate d/e

knowledge. The first results of this evaluation are displayed in Fig. 4, in which we plot the three different error measures

employed in these experiments for the following methods (we now simulate an expert who always gives the right answer):

NoQuery: It refers to the baseline method SS-BN, where only stochastic search is employed with no expert inter-

action.

DAGQuery.Q0.8: It refers to the learningmethodwhere the interaction is carried out only about the DAG structure (Level

2 andLevel 3), not about the skeleton (i.e., the skeleton is learnt in the samewayas in theSS-BNmethod).

We stop the interaction if the information gain of any edge is lower than λ0.8 = 0.8 ln 0.8+ 0.2 ln 0.2.SKDAGQuery.Q0.8: It refers to the learning method where the interaction is carried out firstly about the graph skeleton

and, subsequently, about the DAG structure. We stop the interaction about the skeleton or about the

DAG if the information gain of any elementmk (i.e., an edge or a variable Y ∈ MB(X)) is lower than λ0.8

SKDAGQuery.Q0.9: Same as the previousmethod, but with a lower entropy level, λ0.9 = 0.9 ln 0.9+0.1 ln 0.1, as the stopcondition for DAG queries.

As can be see in Fig. 4, any of the interaction methods reduces the structural errors of the inferred models both for

the MAP model (SHD measure) and the mixture of models (L1 edge error). At the same time, this reduction is stronger for

lower sample sizes, which is somewhat expected since for low sample sizes the model uncertainty is higher. In addition,

we can see that there is an improvement when we ask about the DAG and the skeleton, compared to only asking about the

DAG structure. This improvement is especially significant if we look at the P/R SK distance. As can be seen, when asking

only about the DAG there is hardly any improvement in the skeleton of the BN with respect to the case where there is no

interaction. That is to say, in this case the information of the expert mainly affects the structural errors of the DAG edges

which are included in the initial skeleton inferred using the SS-BN method (we recall that in this method there is a step of

the stochastic search that is not constrained by the initial skeleton). However, when there is interaction about the skeleton,

there is a larger improvement in this error measure. Finally, if we fix a lower information gain threshold, from I0.8 to I0.9, as

the stop point for the expert interaction process, the difference error metrics improve, especially for lower sample sizes. In

1178 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181

Fig. 4. Querying about the DAG structure versus Querying about the DAG and the skeleton.

Table 1

Comparison between themethods of Fig. 4 using a paired t-testwith p = 0.05 for 100 data samples. Each entry in the

table indicates for howmany BNs, out of the five BN employed in the analysis, the method in the row is significantly

different with respect to the method in the column. A positive value indicates differences in favor of the method

of the row while a negative value indicates differences in favor the method of the colum. This comparison is made

for the three error measure of the above figure: Structural Hamming Distance (SHD), L1 Error and P/R SK Distance,

respectively. For example, the “−4/−5/−3” entry in the third column of the second row of the table indicates that

NoQuery method has a statistically significantly higher SHD than DAGQuery0.8 in four out of the five analyzed BNs;

a statistically significantly higher L1 Error than DAGQuery0.8 in five out of the five analyzed BNs; and a statistically

significantly higher P/R SK Distance in three BNs.

SHD/L1/PR NoQuery DAGQuery0.8 SKDAGQuery0.8 SKDAGQuery0.9

NoQuery −4/−5/−3 −4/−5/−5 −5/−5/−5

DAGQuery0.8 +4/+5/+3 −4/−4/−4 −5/−5/−5

SKDAGQuery0.8 +4/+5/+5 +4/+4/+4 −5/−5/−5

SKDAGQuery0.9 +5/+5/+5 +5/+5/+5 +5/+5/+5

Fig. 5. Effect in the number of queries with respect to sample size.

Table 1 we give support to the above conclusions by comparing the methods with a paired t-test with p = 0.05 (full details

of the comparison are given in the table header).

In Fig. 5(a), we plot the number of queries submitted to an expert using the different methods with respect to the size of

the learning data samples. The DAGQuery and SKDAGQuery series refers to the number of queries about the DAG structure

while the SKQuery refers to the number of queries about the skeleton (see Section 3.2.1). In Fig. 5(b), we plot the percentage

of interactions over the total number of possible queries, which is equal ton(n−1)

2, either about theDAGor about the skeleton.

As can be seen, the number of queries in any case is small because we only ask about a few edges, which represent a small

percentage of the total number of possible queries. Moreover, the number of interactions decreases as more data samples

are available. This is in agreement with the previous results where the interaction seems to have no impact on the quality

of the inferred models.

In Fig. 6we show the effect of our interactionmethodologywhen the query threshold is increased.We also evaluate in this

analysis the impact of an unreliable expert on the performance of our methodology (series “DAGQuery-UnreliableExpert”).

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181 1179

Fig. 6. Analyzing the effect of the query threshold and of an unreliable expert.

This last case is simulated by providing wrong answers to the system with probability 0.1. To simplify the analysis we only

display the results using 100 data samples.

Looking at the above figures we can see that higher query thresholds reduce the structural errors of the inferred models,

but increase the number of queries submitted to the expert. At the same time, the edge entropy is also reduced and our

certainty about the MAP model increases, which is one of the positive effects of our interaction methodology. Same con-

clusions can be obtained when dealing with an unreliable expert. More specifically, the presence of an unreliable expert

increases the number of submitted queries with the same query threshold. This is an expected behavior since the expert’s

answers are less informative and, in consequence, more queries are needed to obtain the same information gain. Curiously,

this increment in the number of queries means that, in some cases, the averaged number of structural errors will be lower

than for a fully reliable expert.

4.3. Evaluating the efficiency of the interaction methodology

In the previous subsection we have looked at the effect of the interaction measured via different error metrics and we

have seen that the interaction with an expert can help to discover more accurate BN models. In this subsection, we try to

evaluate the efficiency of this interactive methodology compared to standard approaches that were propose previously, as

mentioned in the introduction, which take advantage of the d/e knowledge by fixing some parts of the DAG model a priori,

before the learning process takes place. That is to say, we try to answer the following question: What is more efficient?

Trying to collect, before analyzing any data, as much d/e knowledge as possible and employ it as prior knowledge using

a given BN learning algorithm, or running first an interactive learning algorithm and collecting d/e knowledge only about

queries submitted by this interactive system? We consider efficiency in terms of the improvement in the structural errors

of the inferred model with respect to the quantity of d/e knowledge needed to achieve this improvement.

In order to answer the above question,we conduct the following experiment.We randomly pick 10% of then(n−1)

2possible

edges of a DAG (i.e., some are really present in the true model and some are not), and denote the resulting set by K. As we

seek the true model that generates the data, we define a prior P(G|K) which gives probability 0 to any graph which is not in

agreement with the answers to the queries in K. So, this prior encodes the d/e expert knowledge that we have simulated.

We then run the so-called SS-BN algorithm without any further expert interaction but using this informative prior. At the

same time, we also run the SS-BN algorithmwith a non-informative prior and allow for interaction about the DAG structure

with λ0.9. However, in this case, the only interactive queries that can be answered are the ones in K. Both executions, using

the same K, were run over 10 data sets with 500 samples and averaged values across the five different gold-standard BNs

were considered. The whole experiment was repeated five times with different randomly built Ki sets, i ∈ {1, . . . , 5}. Theresults of this evaluation are displayed in Table 2. Prior-Knowledge refers to the L1 edge error obtained using the Ki sets as

prior knowledge while Interactive-Knowledge refers to the error obtained using the interactive methodology where queries

are restricted to the ones in the corresponding Ki. For this last method, we also displayed the percentage of queries over

the whole possible number of queries that are really submitted to the expert. For both methods we also compute the range

of the L1 error over the different Ki at each BN, rangeBN = maxKi(L1(BN)) − minKi

(L1(BN)), where L1(BN) refers to the

average L1 error over the 10 data sets of 500 data samples for one of the five gold-standard BNs (see Section 4.1). In the

column “Range” of Table 2, we report the mean of the rangeBN values over the five gold-standard BNs for both methods.

The results of this evaluation are quite clear. Interactive-Knowledge’s performance is quite similar performance to that

of Prior-Knowledge, but it uses a much smaller quantity of d/e knowledge. As can be seen in Table 2, in any case, method

Interactive-Knowledge asks to the expert less than 0.5% of the possible queries, in contrast with the 10% of queries employed

by the Prior-Knowledge method. Moreover, the performance of Prior-Knowledge is muchmore dependent on the particular

Ki, as can be seen in the range values for both methods.

1180 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181

Table 2

Results of the evaluation of the efficiency of the interactive methodology (Section 4.3).

Method K1 K2 K3 K4 K5 Range

Prior-Knowledge 29.48 31.28 30.25 29.85 29.97 4.81

Interactive-Knowledge 30.83 30.51 30.85 31.19 30.21 1.42

% of queries 0.221 0.199 0.273 0.328 0.406 -

Table 3

ComparisonbetweenMCMCandDAGQueryusing the SHDof thefinal selectedBNwithdifferent number

of queries. The displayed values correspond to the averaged SHD across the 5 gold-standard BNS.

N. Queries 5 10 15 20

DAGQuery 30.01 28.86 28.23 28.05

MCMC 44.39 43.54 43.17 42.45

Summarizing, aswehaveseen in this experiment,notall thed/eknowledge isequallyworthy forboosting theperformance

of a BN learning algorithm. As shown in this analysis, employing an interactive methodology which looks for those edges

which have a high entropy and only request knowledge about them is much more efficient than trying to collect as much

d/e knowledge as possible a priori without looking at the data, as it may happen that much of this knowledge is not really

needed since its information may be found in our data sample.

4.4. Comparing to active learning methods

In this last analysiswe compare ourmethod to an adaptedmethod inspired by previously published approaches for active

learning of Bayesian networks. More precisely, we adapt the method proposed in [23] (labelled here as MCMC) to select

the query that is submitted to the expert. As mentioned in the introduction, this method selects the intervention that most

reduces the entropy of the posterior probability over the different DAG structures (which is the same strategy we use in our

interactivemethodology). The entropy of the posterior probability is computed bymeans of aMCMC algorithm over the DAG

space [18] (see Section 2.2). Then, the adapted method is completely similar to the one proposed in Section 3.1 but using a

MCMC method instead of the specific multi-level method described in Section 3.2.

The results of this comparison are given in Table 3. In this table we detail the average structural Hamming distance (SHD)

for the generated data sets with 500 samples across the different BNs. Both methods, MCMC and DAGQuery, are compared

using a previously fixed number of queries: 5, 10, 15 and 20 (the information provided by the expert does not contain errors

in this experiment). Using a paired t-test with p = 0.05, we found that DAGQuery consistently outperforms MCMCmethod

for each of the five evaluated BNs and for each of the different numbers of queries.

5. Conclusions and future work

The d/e knowledge is generally employed to boost the reliability of BN learning methods, especially for model selection

problems with a very high number of variables and a low number of data samples, where pure data-oriented methods

usually fail to recover reliable models. However, previous approaches provide non-interactive mechanisms to introduce this

d/e knowledge in the learning process, and this poses severe limitations. Sowe have presented and evaluated amethodology

for interactively learning BNs from data with the help of an expert. This methodology allows integrating knowledge about

the skeleton of the BN or about the DAG structure, thus offering a greater flexibility. It also aims to employ a low number of

interactions with the expert while trying to get at the same time the maximum effect in the learning process.

As future work, we will examine different aspects of our approach. For example, we want to extend this methodology

to consider other utility or cost functions similar to the structural hamming distance, instead of the currently used based

on the logarithm of the model probability. Another different issue which requires further examination is the existing inter-

relationship between the d/e knowledge in the different levels of the algorithm. For example, when the expert introduces

information about a directed edge X → Y in the second or the third level, it also informs that X belongs to the Markov

boundary of Y and vice versa. In this case, we want to analyze whether it is worthy or not to run again the first step of the

algorithm and recompute the posterior probability of the different Markov boundaries of X and Y . We also want to explore

alternative queries that could be submitted to an expert based on questions about particular conditional independencies.

The possibility of using an ontology as a source of d/e knowledge, as previously used in [4], is another line of future research.

Acknowledgements

This work has been jointly supported by the research programme Consolider Ingenio 2010, the Spanish Ministerio de

Ciencia de Innovación and the Consejería de Innovación, Ciencia y Empresa de la Junta de Andalucía under projects CSD2007-

00018, TIN2010-20900-C04-01, TIC-6016 and P08-TIC-03717, respectively. We also thank the reviewers for their insightful

and constructive comments.

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning 54 (2013) 1168–1181 1181

References

[1] C.F. Aliferis, A.R. Statnikov, I. Tsamardinos, S. Mani, X.D. Koutsoukos, Local causal and Markov blanket induction for causal discovery and feature selectionfor classification part i: Algorithms and empirical evaluation, Journal of Machine Learning Research 11 (2010) 171–234.

[2] N. Angelopoulos, J. Cussens, Bayesian learning of Bayesian networkswith informative priors, Annals ofMathematics and Artificial Intelligence 54 (1) (2008)53–98.

[3] R. Bellazzi, B. Zupan, Towards knowledge-based gene expression data mining, Journal of Biomedical Informatics 40 (6) (2007) 787–802.[4] M. Ben Messaoud, P. Leray, N. Ben Amor, Integrating ontological knowledge for iterative causal discovery and visualization, Symbolic and Quantitative

Approaches to Reasoning with Uncertainty (2009) 168–179.

[5] G. Borboudakis, I. Tsamardinos. Incorporating causal prior knowledge as path-constraints in Bayesian networks and maximal ancestral graphs, arXivpreprint arXiv:1206.6390, 2012.

[6] W. Buntine, Theory refinement on Bayesian networks, in: Proceedings of the seventh conference on uncertainty in artificial intelligence, San Francisco, CA,USA, 1991, pp. 52–60.

[7] A. Cano, A. Masegosa, S. Moral, An importance sampling approach to integrate expert knowledge when learning Bayesian networks from data, in: IPMU’10,13th International Conference on Processing and Management of Uncertainty in Knowledge-Based Systems, Dortmund, Germany, June 28–July 2, 2010,

pp. 685–695.

[8] A. Cano, A. Masegosa, S. Moral, A method for integrating expert knowledge when learning Bayesian networks from data, IEEE Transactions on Systems,Man, and Cybernetics – Part B: Cybernetics 41 (2011) 1382–1394.

[9] D.M. Chickering, Optimal structure identification with greedy search, Journal of Machine Learning Research 3 (2003) 507–554.[10] L.M. de Campos, J.G. Castellano, Bayesian network learning algorithms using structural restrictions, International Journal of Approximate Reasoning 45 (2)

(2007) 233–254.[11] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, John Wiley Sons, New York, 1973.

[12] N. Friedman, D. Koller, Being Bayesian about Bayesian network structure: A Bayesian approach to structure discovery in Bayesian networks, Machine

Learning 50 (1–2) (2003) 95–125.[13] D. Heckerman, D. Geiger, D. Chickering, Learning Bayesian networks: the combination of knowledge and statistical data, Machine Learning 20 (3) (1995)

197–243.[14] M. Henrion, Propagating uncertainty by logic sampling in Bayes’ networks, in: J. Lemmer, L. Kanal (Eds.), Uncertainty in Artificial Intelligence, vol. 2,

Amsterdam, 1988, pp. 149–164.[15] S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara, S. Miyano, Combining microarrays and biological knowledge for estimating gene networks via Bayesian

networks, in: CSB’03: Proceedings of the IEEE Computer Society Conference on Bioinformatics, IEEE Computer Society, Washington, DC, USA, 2003, p. 104.

[16] F.V. Jensen, Bayesian Networks and Decision Graphs. Statistics for Engineering and Information Science, Springer-Verlag, New York, USA, 2001.[17] B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, M.West, Experiments in stochastic computation for high dimensional graphical models, Statistical Science

20 (2004) 388–400.[18] D. Madigan, J. York, Bayesian graphical models for discrete data, International Statistical Review 63 (1995) 215–332.

[19] A.R. Masegosa, S. Moral, A Bayesian stochastic search method for discovering Markov boundaries, Knowledge-Based Systems 35 (2012) 211–223.[20] A.R. Masegosa, S. Moral, New skeleton-based approaches for Bayesian structure learning of Bayesian networks, Applied Soft Computing 13 (2) (2013)

1110–1120.[21] S. Meganck, P. Leray, B. Manderick, Learning causal Bayesian networks from observations and experiments: a decision theoretic approach, Modeling

Decisions for Artificial Intelligence (2006) 58–69.

[22] H.W. Mewes, D. Frishman, K.F.X. Mayer, M. Mnsterktter, O. Noubibou, T. Rattei, M. Oesterheld, V. Stmpflen, Mips: analysis and annotation of proteins fromwhole genomes, Nucleic Acids Research 32 (2004) 41–44.

[23] K.P. Murphy, Active learning of causal Bayes net structure, Technical report, 2001.[24] R.E. Neapolitan, Learning Bayesian Networks, Prentice Hall, 2004.

[25] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, M. Kanehisa, Kegg: Kyoto encyclopedia of genes and genomes, Nucleic Acids Research 27 (1) (1999) 29–34.[26] J. Pearl, Probabilistic Reasoning with Intelligent Systems, Morgan & Kaufman, San Mateo, 1988.

[27] E. Perrier, S. Imoto, S. Miyano, Finding optimal Bayesian network given a super-structure, Journal of Machine Learning Research 9 (2008) 2251–2286.

[28] O. Pourret, P. Naim, B. Marco, Bayesian Networks: A Practical Guide to Applications, John Wiley Sons, New York, 2008.[29] M. Richardson, P. Domingos, Learningwith knowledge frommultiple experts, in:Machine Learning-InternationalWorkshop then Conference, vol. 20, 2003,

p. 624.[30] G. Scott, C.M. Carvalho, Feature-inclusion stochastic search for Gaussian graphical models, Journal of Computational and Graphical Statistics 17 (2008)

790–808.[31] J.G. Scott, J.O. Berger, An exploration of aspects of Bayesian multiple testing, Journal of Statistical Planning and Inference 136 (7) (2006) 2144–2162.

[32] P. Spirtes, C. Glymour, R. Scheines, Causation, Prediction and Search, Springer Verlag, Berlin, 1993.

[33] S. Tong, D. Koller, Active learning for structure in Bayesian networks, in: International Joint Conference on Artificial Intelligence, 2001, pp. 863–869.[34] I. Tsamardinos, L.E. Brown, C.F. Aliferis, The max–min hill-climbing Bayesian network structure learning algorithm, Machine Learning 65 (1) (2006) 31–78.