A Score-Based Method for Inferring Structural Equation ...€¦ · We implement and analyse a new...

Swiss Federal Institute of Technology Zurich Seminar forStatistics

Department of Mathematics

Master Thesis Summer 2015

Marco Eigenmann

A Score-Based Method for Inferring

Structural Equation Models

with Additive Noise

1

2

3

4

5

6

7

8

9

Submission Date: August 13th 2015

Co-Adviser: Jan ErnestAdviser: Prof. Dr. Peter Buhlmann

Preface

The (directed) path which have brought me to this thesis and, in particular, to this topic,began with a seminar in causal inference held in 2013. During this seminar I had theopportunity to present the PC algorithm to the audience. I was so enthusiast about thetopic that I looked for a related topic whereon to write my bachelor’s thesis. Lookingback, I have been very fortunate to have asked Prof. Dr. Marloes Maathuis for a topicin this field. Indeed, she suggested to write my bachelor’s thesis working on a very recentand interesting paper studying the Markov Blanket Fan Search algorithm which is directlyderived from the PC algorithm. Towards the end of my bachelor’s thesis I slowly started tohave a good understanding of the linear Gaussian model related to causality. At the sametime, it was also clear that in order to deal with different and more involved models I wouldhave need better knowledge of many topics of statistical analysis surrounding causality.Consequently, I deepened my theoretical and computational proficiency in statistics duringmy master’s studies. This thesis would not been the same without this comprehensivevariety of courses that only the ETH Zurich could have made available.

As I have just mentioned above, this thesis is the result of a long journey which, from apersonal perspective, would not have been possible without the support and the guide ofmany people. First of all I would like to thank my parents for their unconditional supportduring my whole permanence in Zurich and in particular at ETH. I am equally thankfulto my girlfriend which directly supported and stood me during my highs and lows evenif sometimes she could not understand the reasons behind them. Finally, I would like tothank “I Tavoli”, literally “The Tables”, for the moral support and the great time spenttogether. They represent the place and the people I worked with during the whole periodof my studies at ETH Zurich.

With regards to this thesis, I would sincerely like to thank Prof. Dr. Peter Buhlmannwho accepted to supervise my master’s thesis even being already extremely busy with hisresearch and being also at the head of the department of mathematics. He paid greatattention to my wishes concerning the topic and the computational challenges I likedto engage. The result is a highly interesting theme dealing with state-of-the-art researchproblems which not all master’s students have the opportunity to encounter. My deep andsincere gratitude goes also to Jan Ernest who closely followed me during the whole thesis.He trusted in my proficiency giving to me the possibility to experience and, at the sametime, he made himself available being always very pleased to help me. His advice duringthe whole thesis and his help during the debugging phase of the algorithms have been

iv Preface

crucial to the success of this thesis. Moreover, he provided a first draft of the PCLiNGAMalgorithm which undoubtedly enhance the validity of the comparisons present in the thesis.I would also like to thank Christopher Nowzohour who provided the R code of his paper(Nowzohour and Buhlmann (2014)). Even not being my co-adviser, he was always verywilling to help answering my questions about both, theoretical and algorithmic aspects ofthe paper. Finally, I would like to thank all the people at the Seminar for Statistics whoprovided an extraordinary working climate making feel me at home.

Abstract

We implement and analyse a new score-based algorithm for inferring linear structuralequation models with a mixture of both, Gaussian and non-Gaussian distributed additivenoise. After introducing some well-known algorithms providing theory, pseudo-codes, mainadvantages and disadvantages as well as some examples, we extensively cover the technicalpart which endorses the ideas behind our new algorithm. Finally, we present our algorithmin great detail describing its R implementation and showing its performance compared tothe algorithms introduced in the previous chapters.

vi CONTENTS

Contents

1 Introduction 1

2 Basic concepts and notation 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Markov properties and the faithfulness assumption . . . . . . . . . . 72.1.2 Structural equation models and properties of DAGs . . . . . . . . . 10

3 Linear SEMs with Additive Gaussian or non-Gaussian Noise 133.1 Identifiability of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Linear SEMs with additive Gaussian noise . . . . . . . . . . . . . . . . . . . 16

3.2.1 The PC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 The GES algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Linear SEMs with additive non-Gaussian noise . . . . . . . . . . . . . . . . 203.3.1 The LiNGAM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Linear SEMs with Additive Mixed Noise 234.1 The PCLiNGAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 The score-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 The linearity assumption . . . . . . . . . . . . . . . . . . . . . . . . 264.2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.1.2 Estimation of the precision matrix . . . . . . . . . . . . . . 28

4.2.2 Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Simulations 395.1 The algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.1 Empirical analysis of the parameters . . . . . . . . . . . . . . . . . . 475.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Summary 616.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Bibliography 63

A Equivalence of the Markov Conditions 65A.1 Equivalence of the Markov conditions for undirected graphs . . . . . . . . . 65A.2 Equivalence of the Markov conditions for DAGs under separation . . . . . . 69A.3 Equivalence of the Markov conditions for DAGs under d-separation . . . . . 70

Chapter 1

Introduction

In the past decades, the importance of causality as a research field has become more andmore evident. Applications of causality can be found in several areas such as biostatistics,machine learning and social sciences. All this importance is due to the need of knowingwhat to do in order to achieve a certain result and/or due to the need of knowing whatwill be the effect of a certain change in the system. Speaking in a more technical wayand quoting Pearl (1993), “whereas the joint distribution tells us how probable events areand how probabilities would change with subsequent observations, the causal model alsotells us how these probabilities would change as a result of external interventions in thesystem.”

Example 1.0.1: Consider a scenario where we would like to know the consequences ofhaving some particular aspects of an experiment set to a precise level. However, assumethat one cannot do it with an experiment because of the costs or because of some ethicalaspects1. Still, knowing the casual structure underlying the experiment would allow us tobe able to compute such effects. In this way, causal inference can be used to come up withproposals to achieve the desired level of population or to prevent a disease could be basedon a solid scientific argumentation.

Knowing the underlying causal model also opens the possibility of inferring what wouldhave been, if we had done something differently. Those scenarios are called counterfactualsand can be very interesting for social sciences. However, sometimes they are misleadingand have to be used carefully. Let us consider the following example.

Example 1.0.2: Consider a patient with a deadly disease for which only two treatmentsexist. Treatment A succeeds in 99% of the cases but in the remaining 1%, the patient dies.Treatment B only succeeds when treatment A fails and also when treatment B fails tosucceed, the patient dies. Assume that a doctor prescribes treatment A to the patient andthe patient dies. As a counterfactual we have that if the doctor had prescribed treatmentB, the patient would have survived.

1It is not accepted to kill people to reduce the population of a certain region or to infect people to studythe resistance or diffusion of a disease.

2 Introduction

This does not necessarily mean that the doctor made the wrong decision. A problemis that the human being often tries to accuse someone else of being the cause of whathappened. Our work will surely not investigate these social and ethical aspects any further.But nevertheless, they illustrate potential (mis)applications of this field of study and aretherefore worth mentioning.

The study of causality is continuously evolving and theory and algorithms for severalscenarios have already been proposed and analysed. Standard references are Lauritzen(1996), Spirtes et al. (2000), Pearl (2009) and Koller and Friedman (2009). Two famousand well-known settings are these where the interactions between the variables are linearand the noises are Gaussian and non-Gaussian respectively. The linear Gaussian settingis a classic approach since it simplifies several tasks and it is therefore a good model tostart thinking about the problem. Indeed, theory and fast algorithms exist and are wellestablished. In particular, we will look at the PC algorithm (Spirtes et al., 2000) andat the GES algorithm (Chickering and Boutilier, 2002). Among the advantages we havethe correspondence between partial correlation and conditional independence whereas thebiggest limitation, as we will see in Section 3.1, consists of the restricted identifiability ofthe model. In fact, several models describe the same joint probability distribution. Thelinear non-Gaussian setting on the other side does not have the identifiability problem but,in order to exploit the inference for this model, we need additional mathematical tools.We will focus on the LiNGAM algorithm (Shimizu et al., 2006) which uses independentcomponent analysis. There are more problems when we try to mix the two settings, thatis, if we allow for both, Gaussian and non-Gaussian noise. The idea to combine twoalready existing algorithms from the two frameworks has already been proposed and wewill illustrate it. Apart from that, our main work will consist of merging and implementingthe ideas developed in Loh and Buhlmann (2014) and in Nowzohour and Buhlmann (2014).We will cover both, the theoretical as well as the practical aspect behind the two papersby reviewing the theory and implementing a new algorithm which will be compared withexisting and well-performing algorithms.

In Chapter 2 we will introduce the basic terminology and notation commonly used ingraph theory and causality as well as in this thesis.

In Chapter 3 we will present the problems and the solutions which already exist for thelinear Gaussian and the linear non-Gaussian setting. We will start by addressing the prob-lem of identifiability in Section 3.1. In Section 3.2 we will introduce the PC algorithm andthe GES algorithm explaining the fundamental ideas behind them. Finally in Section 3.3we will do the same type of analysis for the LiNGAM algorithm which is the algorithmwe are going to consider for the linear non-Gaussian setting.

In Chapter 4 we will start by presenting the PCLiNGAM algorithm in Section 4.1. Thisalgorithm combines the PC algorithm and the LiNGAM algorithm and has been proposedin Hoyer et al. (2012). In Section 4.2 we will start with our main work by presentingthe theory and by explaining and discussing the assumptions we are going to make. Thesection, and with it the chapter, will end with a proof of the consistency of our approach.

In Chapter 5 we will enter the computational and empirical part of our work. In Section 5.1

3

we will present in greater detail our algorithms explaining also our problems and reasoningwe dealt with during the implementation. In Section 5.2 we will analyse, from an empiricalpoint of view, our algorithms. We start by discussing the choice of some key parametersand end by comparing accuracy and efficiency of our method with respect to the methodspresented in Chapter 3 and Section 4.1.

4 Introduction

Chapter 2

Basic concepts and notation

In this section we introduce the notation we are going to use throughout the thesis as wellas the basic concepts that are needed to understand the material. We start with someessential definitions which then allow us to the Markov properties and the faithfulnessassumption. Finally, we introduce the structure of the models we will consider.

2.1 Introduction

We start by giving the definition of the most fundamental object for our work.

Definition 2.1.1: A (undirected) graph G = (V,E) is an ordered pair, where V denotesthe set of the nodes, and E ⊂ V × V the set of the edges.

We will always assume that V = x1, . . . , xp if not stated differently. Beware that there issome puzzling potential as the elements of V are also random variables and therefore, sup-posed to be denoted through capital letters. We will, however, use lower case letters as itis the standard notation. An element of E will be denoted by (i, j), where i, j ∈ 1, . . . , p.If not stated differently, (i, j) will always be considered as an ordered pair and thus, itrepresents a directed edge. This also suggest that we will use directed graphs and, indeed,we will work with directed acyclic graphs.

Definition 2.1.2: A directed acyclic graph, denoted hereafter by DAG, is a directedgraph without any cycle, i.e., without a directed path from and to the same node.

The intuition should already be clear but we need of course a precise definition of a directedpath.

Definition 2.1.3: Given a DAG G = (V,E), a path from xi to xj is a sequence of nodes(xπ(1), . . . , xπ(q)), for some permutation π : V → V and with q > 2, such that π(1) = i,π(q) = j, and ∀k ∈ 1, . . . , q − 1, either (π(k), π(k + 1)) ∈ E or (π(k + 1), π(k)) ∈ E.Similarly, given a DAG G = (V,E) a directed path from xi to xj is a sequence of nodes(xπ(1), . . . , xπ(q)), for some permutation π : V → V and with q > 2, such that π(1) = i,

6 Basic concepts and notation

π(q) = j, and ∀k ∈ 1, . . . , q − 1, (π(k), π(k + 1)) ∈ E. The length of both, a simple anda directed path is equivalent to the number of edges, hence, q − 1.

To avoid possible confusion, let us recapitulate the notation used: xi is a node, (i, j) is adirected edge, whereas (xi, . . . , xj) is a directed path from xi to xj . So edges are describedby the index set of the variables and paths by the variables itself. The only exception,or abuse of notation, are directed paths of length 1, as they are also edges. However,this should not cause any trouble as edges and directed paths are mostly used in differentcontexts. Finally, we use the following multi-index notation to describe sets of nodes: xIis used as a short notation for the set xi, i ∈ I.

The next definitions are easy and quite intuitive, but nevertheless very useful to facilitatethe syntax when describing directed graphs.

Definition 2.1.4: Given a DAG G = (V,E), and a node xi ∈ V, we denote by

1. Pa(xi) the set of all j ∈ 1, . . . , p, such that (j, i) ∈ E. When referring to xPa(xi),we call this set the parents of xi.

2. Ch(xi) the set of all j ∈ 1, . . . , p, such that (i, j) ∈ E. When referring to xCh(xi),we call this set the children of xi.

3. Des(xi) the set of all j ∈ 1, . . . , p, such that there exists a directed path from xito xj . When referring to xDes(xi), we call this set the descendants of xi.

4. Anc(xi) the set of all j ∈ 1, . . . , p, such that there exists a direct path from xj toxi. When referring to xAnc(xi), we call this set the ancestors of xi.

5. Nd(xi) the set 1, . . . , p \ Des(xi). When referring to xNd(xi), we call this set thenon-descendants of xi.

6. Adj(xi) the set of all j ∈ 1, . . . , p, such that either (i, j) ∈ E or (j, i) ∈ E.

There are two other definitions that are also quite intuitive but this time they are aboutthe structure of a DAG.

Definition 2.1.5: The skeleton of a DAG G(V,E) is the undirected graph consisting ofthe same nodes and edges of G but without the orientations.

Definition 2.1.6: The moralized graph of a DAG G = (V,E), denoted byM(G), is theundirected graph obtained by connecting the parents of every node and then deleting theorientations of all edges.

The name“moralized”arises from the fact that we connect exactly those nodes representingtwo parents that are not linked. Such a triple, two non-adjacent parents and a child, isalso called an immorality. Therefore, once we connected all the unlinked parents, the DAGhas been “moralized”. From now on, as also common in the literature, we call such a triplea v-structure.

Definition 2.1.7: Given a DAG G = (V,E), a v-structure is a triple of nodes (xi, xj , xk)with xi, xj , xk ⊆ V and j, k ⊆ Pa(xi) but (j, k) /∈ E and (k, j) /∈ E.

2.1 Introduction 7

Example 2.1.8: We summarize the notation introduced above with an example. Con-sider Figure 2.1 where we can see the differences between the generating DAG, the skeletonand the moralized graph. As we will see in Section 2.1.2, every DAG induces a topolog-ical order (Definition 2.1.15) and we can easily see from Figure 2.1(a) that (1, 4, 3, 2, 5)is a valid topological order. We can also see that Pa(x5) = 2, 3, 4, Ch(x3) = 2, 5,Des(x1) = 2, 3, 5, Anc(x2) = 1, 3, and Nd(x4) = 1, 2, 3, 4. Moreover, (x5, x2, x4) isthe only v-structure in the DAG.

x1 x3 x2

x4 x5

(a) Original DAG.

x1 x3 x2

x4 x5

(b) Skeleton of theDAG.

x1 x3 x2

x4 x5

(c) Moralized graph ofthe DAG.

Figure 2.1: An example of DAG, skeleton and moralized graph.

2.1.1 Markov properties and the faithfulness assumption

As already said, the nodes in V represent random variables. This already suggests thatwe work with both, probabilities and graphs. In order to infer causal structures we needa fundamental property to link these two frameworks: the Markov property. In order toproperly define the Markov property, we first need to introduce the concept of d-separation.

Definition 2.1.9: Let a DAG G = (V,E), two nodes xi, xj ∈ V and a set S ⊂ Vcontaining neither xi nor xj be given. Then, we say that xi and xj are d-separated givenS, and denote it by xi d-Sepxj |S, if every path between xi and xj contains at least onetriple of nodes which corresponds to one of the situations illustrated by the solid lines inFigure 2.2.Similarly, two disjoint sets A,B ⊂ V are d-separated given a third set S, which has alsoto be disjoint from both, A and B, if for every element xi ∈ A and every element xj ∈ B,xi d-Sepxj |S.

The notion of d-separation is very central, in fact, we will promptly see that it is intrinsicto the two concepts that underlie the whole thesis. For this reason, it is a good idea tolook at an example in order to become proficient with this definition.

Example 2.1.10: Consider the DAG in Figure 2.3. Applying the first or the secondrule pictured in Figure 2.2(a) and (b) respectively, we can easily see that for instancex1 d-Sepx5|x4 but also that x1 d-Sepx5|x3, x4 and many others. Using the third rule wecan observe that x5 d-Sepx6|x4 and, also in this case, we can condition on bigger sets aslong as we do not violate the fourth rule. For instance, x5 d-Sepx6|x2, x3, x4 does alsohold. Using the fourth rule we can recognize that, for instance, x1 d-Sepx2|x4 does nothold because x4 is a descendant of x3 which in turn lies in the center of a v-structure. Infact, x1 d-Sepx2|∅ is the only way to d-seaprate x1 from x2.


xi

xv

sk

xw

xj

(a) At least an element of S is in the directpath from xi to xj .

xi

xv

sk

xw

xj

(b) At least an element of S is in the di-rect path between xj and xi.

xi

xv

sk

xw

xj

(c) At least an element of S is in the pathbetween xi and xj with no ingoing edges.

xi

xv

zk

xw

xj

zl

(d) None of the element of S are in thepath between xi and xj with two ingoingedges. Moreover, there is no directed pathfrom this node to a node in S.

Figure 2.2: The first three cases are the easiest. The dashed lines mean that there may beother nodes in between and we do not care about the orientation of the edges connectingthem (there are no arrowheads). Furthermore, xi and xv as well as xj and xw need not tobe different. Apart from that, we only require one node of S to lie on the path withoutbeing the center of a v-sructure. The fourth case tells us that none of the elements of Scan be involved in a v-structure. Moreover, also the descendants of the node in the centerof the v-structure have to lie outside of S.

As mentioned above, d-separation is a very fundamental concept which, for instance, ispart of the definition of the (global) Markov property. This property that we state as adefinition because it is something we will assume, is one of the most important conceptsin causality and in particular for this thesis.

Definition 2.1.11: A joint probability distribution P on V is said to satisfy the globalMarkov property with respect to a DAG G = (V,E) if for all disjoint subsets A, B, and Cof V ,

Ad-SepB|C ⇒ A⊥⊥B|C.

There are other Markov properties which, under a certain condition, are equivalent to theabove one. An example is the following one which naturally arises for DAGs:

Definition 2.1.12: A joint probability distribution P on V is said to satisfy the Markovfactorization property with respect to a DAG G = (V,E) if P has a density p and

p(x1, . . . , xp) =p∏i=1

p(xi|xPa(xi)).

These are the two Markov properties we are going to use most of the time and we will

2.1 Introduction 9

x2

x3 x4

x5

x6

x1

Figure 2.3: An example of a DAG from which we can read several d-separation statements.

assume them to be equivalent. For the other Markov properties and a rigorous treatmentincluding the necessary condition that needs to hold for the Markov properties to beequivalent, we refer the interested readers to Appendix A.

This is a good point to make clear that we will always assume the Markov propertiesthroughout the thesis. The same is true for the faithfulness assumption which is tightlyconnected to the global Markov property. In fact, it assumes the other direction of theimplication.

Definition 2.1.13: A joint probability distribution is said to be faithful to the DAGG = (V,E) if for all disjoint subsets A, B, and C of V ,

A⊥⊥B|C ⇒ Ad-SepB|C.

Although we will always make the faithfulness assumption, it is fair to remark that whilethe Markov property is considered a valid assumption, the faithfulness assumption is moreoften criticized. In Figure 2.4 we can see a case where this assumption does not hold:the two paths having the opposite effect on the target variable with exactly the samemagnitude cancel each other out. However, from a theoretical point of view, we can writethe effect on the target variable as a polynomial and, assuming that the coefficients arerandomly sampled from a continuous distribution, the effect will be almost surely non-zero. This is actually the classical argument in favour of the faithfulness assumption.Nevertheless, this does not always reflect reality and other assumptions like the strongfaithfulness assumptions are made. This assumption, for instance, considers the finitesample problem but it can be shown to have a positive probability to be violated. See alsoUhler et al. (2013) for a deeper insight. Another drawback of the faithfulness assumptionarises in the presence of hidden variables where we might obtain DAGs to which the trueunderlying distribution is faithful but with wrong (conditional) independence statements.Figure 2.5 illustrates this issue. In fact, the only independence statement when consideringonly the three observed variables is x1 ⊥⊥ x5|∅ which is also represented by the DAGpictured by Figure 2.5(a). The problem with this result is that a change in x1 or inx5 will definitively not cause a change in x3. Note that the underlying joint probabilitydistribution is Markov with respect to both figures, Figure 2.5 (b) and (c). In this case,the faithfulness assumption leads us to wrong causal implications. In this case, droppingthe assumption would allow us to recover a DAG without wrong causal statements whichis still a better result than a DAG with only wrong causal implications.


x1

x2

x4

x3

2 -1

1 2

Figure 2.4: This is a classical situation where the faithfulness assumption is violated.Indeed, x1 ⊥⊥ x4 but x1 and x4 are not d-separated given the empty set. The reason isthat the two paths cancel each other out.

x1

x2

x3

x4

x5

(a) True underlying DAG.

x1 x3 x5

(b) Markov and faithful DAG.

x1 x3 x5

(c) Markov but not faithful DAG.

Figure 2.5: The true underlying DAG consists of all 5 nodes but only the white nodeshave been observed. On the left we have a DAG with respect to which the true underly-ing joint probability distribution is Markov and faithful, but contains only wrong causalimplications. On the right we have a DAG with respect to which the true underlying jointprobability distribution is Markov but not faithful. This time we do not have any wrongcausal implications.

2.1.2 Structural equation models and properties of DAGs

First of all we introduce the definition of a structural equation model, denoted hereafter bySEM. We consider only recursive1 SEMs which means that there are no causal feedbacks.We will use the concept of acyclicty when referring to the absence of feedbacks. This alsomeans that we can represent a recursive SEM with a DAG. Formally, we have the followingdefinition.

Definition 2.1.14: (Loh and Buhlmann, 2014, p. 3069) We say that a random vectorX = (x1, . . . , xp) ∈ Rp satisfies a linear SEM if there is a permutation π : V → V suchthat for X = π(X) and ε = π(ε) it holds that

X = BX + ε, (2.1.1)

where B is a strictly lower triangular p× p matrix and ε is a p-dimensional random noisevector with εj ⊥⊥ x1, . . . , xj−1, ∀j ∈ 2, . . . , p.

1See also Bollen (1989, p. 81)

2.1 Introduction 11

Note that requiring the matrix to be strictly lower triangular implies that there will beno feedback, i.e. no cycles, and the strictness excludes an auto-regulation which is alsoa form of feedback. For DAGs such a permutation always exists. This is due to the factthat every DAG induces a topological order as stated in Lemma 2.1.16. If not differentlyspecified, we will assume that the variables are already ordered, hence, the permutationis the identity. We will see later that this can be done without loss of generality since ouralgorithm is not going to use this information. Nevertheless, this convention will simplifythe understanding of our algorithm since we will simulate the data in this way. It isimportant to keep in mind, though, that finding a correct order is, to some extent, thehardest part of a causal learning procedure. Therefore, we should always be careful whenmaking this assumption. Let us formalize these concepts.

Definition 2.1.15: (Loh and Buhlmann, 2014, p. 3069) Given a DAG G = (V,E), apermutation π : V → V is a topological order for G if π(i) < π(j) whenever (i, j) ∈ E.Lemma 2.1.16: Every DAG G = (V,E) induces a topological order.

Proof. We are going to construct the topological order π : V → V inductively. First notethat any DAG G = (V,E) has a node xi ∈ V with Pa(xi) = ∅. This is because of theacyclicity property. Start by any node x ∈ V and go to one of its parents. Repeat thiswith the selected parent and go on the same way until the selected node has no parents.If this does not happen, then after p steps, with p = |V |, we necessarily come back toan already visited node. This would imply that we have a cycle which is a contradiction.Therefore, there always exist at least one node without parents in a DAG. We call thesenodes root nodes. Let xi be one of these root nodes and let us set π(i) = 1. Now considerG1 = (V1, E1), where V1 = V \ xi and E1 = E \ All edges connected with xi. G1 isagain a DAG and thus, we can apply the same reasoning used above. Let now be xj thenext selected node and assume that xi1 , . . . , xik , for k < p, have already been selected anddeleted from G. Now, we aim to set π(j) = k + 1. Let us suppose that there was an edgebetween xj and at least one of the already deleted nodes. Since when they have beendeleted those edges did not have any parents, the only possibility is that the edge wasdirected from xi to xj , that is (i, j) for some i ∈ 1, . . . , k. Hence, from the construction,it follows that π(i) < π(j) if (i, j) ∈ E.

Example 2.1.17: In this example we show how to apply the construction method statedin the proof and we also show the non-uniqueness of the topological order. For this considerFigure 2.6.We have to start with x2 as it is the only node without parents. So π(2) = 1. Now lookingat Figure 2.6 (b), we see that we could take π(1) = 2 as well as π(3) = 2. Since wenever claimed that the topological order has to be unique, this does not create any kindof problem. Setting π(1) = 2 we quickly see that the other values of π are determined.Indeed we have π(3) = 3, π(4) = 4 and π(5) = 5. One can easily check that this is indeeda topological order for the DAG represented by Figure 2.6 (a).

Until now we started with a model, like a SEM, and we motivated that we could represent


x1

x2

x3

x4

x5

(a) Starting DAG.

x1 x3

x4

x5

(b) After the first step.

x3

x4

x5

(c) After the secondstep.

x4

x5

(d) After thethird step

x5

(e) Last step.

Figure 2.6: An example of the construction of a topological order for a DAG. Note thatin general, the topological order is not unique as we can see in the sub-figure (b).

such a model with a DAG. On the other side, a DAG also induces a family of jointprobability distributions.

Definition 2.1.18: Denote by PG the set of joint probability distributions generated bya DAG G = (V,E). In other words, following Nowzohour and Buhlmann (2014, p. 5),P =

∏pi=1 pi(xi − (B · X)i)2 where every pi is potentially any univariate continuous

density but can also be limited as we will see in Chapter 3.

This latter definition allow us to compare DAGs, in particular, it allows us to define whentwo DAGs are equivalent.

Definition 2.1.19: Two DAGs G1 = (V,E1) and G2 = (V,E2) are said to be equivalent,if they induce the same joint probability distribution.

This definition of equivalence is very intuitive but not straightforward to use. We will seein Section 3.1 that in the linear Gaussian setting, this definition can be simplified usingproperties of the DAG instead of properties of the underlying probability distribution.

2Note that here we heavily use the SEM structure as well as the Markov factorization property.

Chapter 3

Linear SEMs with AdditiveGaussian or non-Gaussian Noise

In this chapter, we look at two well-known special cases of linear SEMs: linear SEMswith additive Gaussian and non-Gaussian noise, respectively. Restricting the underlyingmodel is a common procedure when inferring causal structures. In Chapter 4, we allow amixture of both, Gaussian and non-Gaussian noises. For this reason, it is very importantfor comparisons to have well-performing algorithms in the two extreme cases. We willcompare our algorithm with the PC algorithm, with the GES algorithm and with theLiNGAM algorithm1. All these algorithms are well-known and our aim is to resume themost important ideas behind them.

First of all we discuss the problem of the identifiability of a given model in Section 3.1.In this first section we will see that Gaussian noise does, in general, not allow for fullidentification of the model whereas the non-Gaussian noise does. This is also the mainreason to split these two cases. In Section 3.2 we analyse the Gaussian case and introducethe PC algorithm and the GES algorithm, whereas in Section 3.3 we will look at thenon-Gaussian case and the corresponding algorithm called LiNGAM.

3.1 Identifiability of linear models

Before we distinguish between the two cases, we need to consider the identifiability prob-lem. In some settings, the underlying true DAG is not identifiable and therefore, thereis not a unique correct solution. This is for instance the case when working with linearSEMs and Gaussian distributed noises.

Example 3.1.1: Assume that we are given two variables that are related through the

1In Chapter 4 we will also introduce a fourth algorithm that we will use for comparisons. This is thePCLiNGAM algorithm which already allows for mixed noises, therefore, it does not fit in this section.

14 Linear SEMs with Additive Gaussian or non-Gaussian Noise

following linear SEM.

x1 = ε1

x2 = a · x1 + ε2

where ε1 ∼ N (0, σ21) and ε2 ∼ N (0, σ2

2). Our aim is to construct a new linear SEM of theform

x1 = a · x2 + ε1

x2 = ε2,

which represent the same joint distribution as the first one. In Figure 3.1 the two DAGsare represented.

x1 x2

(a) First SEMrepresented byG = (V,E).

x1 x2

(b) Second SEMrepresented byG = (V, E).

Figure 3.1: These two DAGs represent the two SEMs discussed in Example 3.1.1.

In this case, we have ε1 ∼ N (0, σ21) and ε2 ∼ N (0, σ2

2). For the first SEM it holds that:

E [x1] = E [x2] = 0

var(x1) = σ21, var(x2) = a2 · σ2

1 + σ22

cov(x1, x2) = a · σ21.

Hence,

(x1

x2

)∼ N2 (0,Σ) where Σ =

(σ2

1 a · σ21

a · σ21 a2 · σ2

1 + σ22

). Since the Gaussian distri-

bution is completely determined by the first two moments, it is enough to require thefollowing:

E [x1] = E [x2] = 0 (3.1.1)

var(x1) = var(a · x2 + ε1) = a2 · σ22 + σ2

1!

= σ21 (3.1.2)

var(x2) = var(ε2) = σ22

!= a2 · σ2

1 + σ22 (3.1.3)

cov(x1, x2) = cov(a · ε2 + ε1, ε2) = a · σ22

!= a · σ2

1. (3.1.4)

Equation 3.1.1 is trivial and we actually already inserted this result above when definingε1 and ε2. Solving Equation 3.1.4 we obtain

a =a · σ2

1

σ22

.

3.1 Identifiability of linear models 15

Inserting directly Equation 3.1.3 we obtain

a =a · σ2

1

a2 · σ21 + σ2

2

.

Note also that from Equation 3.1.2 we obtain

σ21 = σ2

1 − a2 · σ22 = σ2

1 −a2 · σ4

1

σ42

· σ22

= σ21 −

a2 · σ41

a2 · σ21 + σ2

2

=σ2

1 · σ22

a2 · σ21 + σ2

2

This shows that it is possible to find variances σ21 and σ2

2 such that(x1

x2

)∼ N2

(0, Σ

)where by construction, Σ = Σ. This means that the two DAGs generate the same jointprobability distribution and hence, both directions give a correct result, thus, the modelis not identifiable.

Based on the previos considerations, it is useful to consider another definition of equiva-lence.

Definition 3.1.2: Two DAGs G1 = (V,E1) and G2 = (V,E2) are considered to beMarkov equivalent if and only if for any triple (A,B, S) of disjoint subsets of V,

Ad-SepB|S in G1 ⇐⇒ Ad-SepB|S in G2.

We speak of the Markov equivalence class of a DAG G = (V,E) when we refer to the setof all DAGs which are Markov equivalent to G.

In other words, under the faithfulness assumption, two DAGs are Markov equivalent if theyshare the same conditional independence statements with respect to the global Markovproperty. This type of equivalence is easier to characterize and also easier to use. Thecharacterization comes from Verma and Pearl (1991) and reads as follows:

Proposition 3.1.3: Two DAGs G1 = (V,E1) and G2 = (V,E2) are Markov equivalentif and only if they share the same skeleton and the same v-structures.

In order to represent such an equivalence class, we need to introduce another type of graph.We will use completed partially directed acyclic graphs, CPDAGs hereafter. For us, anundirected edge will be considered as bi-directed.

Definition 3.1.4: Kalisch and Buhlmann (2007, pp. 615-616) A partially directedacyclic graph, hereafter denoted by PDAG, is a graph in which all edges are either di-rected or bi-directed and it is not possible to create a cycle using the direction of the


directed edges and any direction for the bi-directed edges.A completed partially directed acyclic graph representing a Markov equivalence class is apartially directed graph where every directed edge is also directed (and in the same direc-tion) in every DAG belonging to the Markov equivalence class. Moreover, it also meansthat both orientations of any bi-directed edge are represented by at least a DAG in theMarkov equivalence class.

The next result, which can be found for instance in Theorem 5.2.6 in Pearl (2009), clarifiesthe reason why we did all this work in order to represent Markov equivalence classes.

Theorem 3.1.5: For linear Gaussian SEMs, two DAGs are equivalent if and only if theyare Markov equivalent.

This allows us to represent the Markov equivalence class of a DAG in a simple way. At thesame time we are able to construct all the DAGs in a Markov equivalence class representedby a CPDAG. This seems a nice tool and in fact it is also a nice tool, but, the drawbackis again the limited identifiability of such models. Since a Markov equivalence class cancontain a lot of DAGs, it is impossible to gain a clear view on the structure of the causaldependencies only looking at a CPDAG. Fortunately, the linear Gaussian setting is alsothe only problematic case we will encounter in this thesis. In the linear non-Gaussiansetting the true underlying DAG is fully identifiable as shown in Shimizu et al. (2006).

3.2 Linear SEMs with additive Gaussian noise

In this section we deal with linear and multivariate Gaussian distributed variables. Thesetwo assumptions are standard and there are well-established results for them. For thisparticular setting we will present two algorithms: the PC and the GES algorithm.

3.2.1 The PC algorithm

The PC algorithm is based on (conditional) independence tests2. The algorithm actuallystarts by testing each pair of nodes for marginal independence and then, if no independencewas found, it continues with conditional independence tests. Even without knowing thedetails, we can take a look at an example in order to become familiar with how theseindependence tests look like. Consider the following example where we assume to have anoracle instead of a real test procedure.

Example 3.2.1: Let us consider the true DAG in Figure 3.2. We can consider severaldifferent independence tests. For instance, we can start with the marginal independencetests. In this case we can test x ⊥⊥ y, x ⊥⊥ z, and y ⊥⊥ z. Using the rules for d-separationstated in Definition 2.1.9, we can see that none of these three independence statementsis true. Now, we can look for conditional independences. In this case we test x ⊥⊥ y|z,

2Sometimes we speak about conditional independence tests although they may only be marginal inde-pendence tests. In these cases we tacitly assume that we are conditioning on the empty set.

3.2 Linear SEMs with additive Gaussian noise 17

x y z

(a) True underlyingDAG.

x y z

(b) Final result withoutorientations.

Figure 3.2: On the left we have the true underlying DAG whereas on the right can see theCPDAG which is returned from the PC algorithm.

x ⊥⊥ z|y, and y ⊥⊥ z|x and, using again the same rules as above, we can see that onlyx ⊥⊥ z|y holds. Therefore, we know that there is no edge between x and z. Those werealready all possible independence tests and hence, we are already done. The result is theCPDAG in Figure 3.2.

This was a very simple example but it already involved the most important ideas behindthe PC algorithm. Clearly, the PC algorithm does not have an oracle answering theconditional independence statements. The default implementation of the PC algorithm inR in the package pcalg uses the Fisher’s z-transform, i.e.

Z(ρ) =1

2log

(1 + ρ

1− ρ

),

to test for non-vanishing partial correlation between two nodes. In the linear Gaussiansetting, vanishing correlation is equivalent to conditional independence. As a reference weuse Kalisch and Buhlmann (2007) which establish the consistency of the PC algorithmin the high-dimensional setting. We adapted a version of the PC algorithm found in thispaper to write the pseudo-code presented in Algorithm 3.1. Moreover, the z-transformationcan be found on page 618. This paper is also the reference for the R package pcalg but,of course, the algorithm can also be found in more standard references like for instanceSpirtes et al. (2000).

The only essential parameter in the PC-algorithm is the coefficient α which serves astuning parameter for the conditional independence tests. The PC algorithm tests thenull-hypothesis H0 : Z(ρ) = 0 against the two-sided alternative-hypothesis H1 : Z(ρ) 6= 0,where ρ is the partial correlation coefficient between two nodes. To be more precise, let nbe the size of the separating sets, then, the algorithm checks each ordered pair of nodes(xi, xj), with |Adj(xi)| − 1 > n, for conditional independence given, potentially any, setof size n. Again, this is done by testing the hypothesis stated above and hence workingwith the partial correlation coefficients. The size of the separating sets is sequentiallyincreasing. First marginal independence is checked, then conditional independence givensets of size one, and so on until there is no node left with as many adjacent nodes asn + 1. All the separating sets found to be useful to separate two nodes are saved for thesecond part of the algorithm which involves the orientation of the edges. This secondpart orients any pair of edges satisfying a precise condition to form a v-structure. Thenecessary condition is pictured in Figure 3.3. Basically, whenever there is a triple as theone in Figure 3.3(b), we can orient the edges only if the node in the middle is not in theseparating set of the other two nodes.


The final step consists of applying some orientation rules. The standard reference forthese rules is Meek (1995a). Figure 3.4 illustrates these rules which can be proved to besound. The rules have to be read as follows: consider as starting point the graph with anunoriented edge instead of the red oriented edge. Then, orient this edge as suggested inred.

x1 xk xj

(a) Starting point.

xi xk xj

(b) Case 1: xk is notpresent in the separatingset.

xi xk xj

(c) Case 2: xk is presentin the separating set.

Figure 3.3: Rule for the orientation of the v-structures. If xk is not present in the sepa-rating set of xi and xj , we can orient the two edges as shown in figure (a). Otherwise, weonly know that this is not the right orientation and hence, we cannot orient the two edges.

x

z

y

(a) Rule 1

xi

xk

xj

(b) Rule 2

xi

xk

xl

xj

(c) Rule 3

xi

xk

xl

xj

(d) Rule 4

Figure 3.4: Orientation rules taken from Meek Meek (1995a). The rules have to be readas follows: consider as starting point the graph with an unoriented edge instead of the redoriented edge. Then, orient this edge as suggested in red.

It can be shown that the PC algorithm is consistent and outputs the Markov equivalenceclass of the true underlying DAG. The interested reader can find this result in Theorem5.1 in Spirtes et al. (2000).

3.2.2 The GES algorithm

The Greedy Equivalence Search (GES) algorithm has been introduced by Chickering(Chickering and Boutilier, 2002). The main idea is to take advantage of the result pro-vided in Theorem 3.1.5. In oder words, since a DAG is only a representative of a Markovequivalence class4 and each DAG in this Markov equivalence class is equivalent in a jointdistributional sense to all the other DAGs, we can directly work at a Markov equivalenceclass level instead of a DAG level.

The GES algorithm consists of a two-stage greedy search, where starting from the emptyDAG, i.e. the Markov equivalence class representing the empty DAG, we add edges in thefirst stage whereas we delete edges in the second one. Adding and deleting edges will cause

4It can also happen that the Markov equivalence class of a DAG contains only one DAG.

3.2 Linear SEMs with additive Gaussian noise 19

Algorithm 3.1: PC Algorithm

input : Vertex set V , independence test Ioutput: Pattern G′, separating sets S(·, ·)Create a clique G′ = (V,E′) with the vertices in V .Inizialise n = −1, G = G′

while ∃ xi, xj ∈ V , with (i, j) ∈ E′ such that |xAdj(x)G \ xj| ≥ n don = n+ 1while ∃ new ordered pair (xl, xm) such that (l,m) ∈ E′ with |xAdj(xl)G \ xm| > n,

and ∃ S ⊆ xAdj(xl)G \ xm with |S| = n to test for conditional independence doSelect such a new pair of vertices i, j.while Edge (l,m) has not been deleted or all subsets of xAdj(xl)G \ xm of size

n have been chosen. doChose a new S ⊆ xAdj(xl)G \ xm with |S| = n

if I(xl, xm, S)3 thenE′ = E′ \ (l,m), S(x, y) = S, S(y, x) = S

end

end

end

endfor xi, xj , xk ∈ V, (i, k) ∈ E′, (j, k) ∈ E′, (i, j) /∈ E′ and (j, i) /∈ E′ do

if xk /∈ S(x, y) thenOrient (i, k) and (j, k) such that they form a v-structure in xk.

end

endwhile Not all possible edges have been oriented do

Apply Meek’s orientation rules.endReturnG′, S(·, ·)

a change of the Markov equivalence class. To decide in which Markov equivalence classto go, GES uses a score function. In order to expose more details on the algorithm, weneed to introduce some notation. We will work with Markov equivalence classes and writeG ≈ H, where G = (V,EG) and H = (V,EH), if G and H encode the same conditionalindependence statements with respect to the global Markov property, that is, if G and Hare Markov equivalent. See also Definition 3.1.2. Moreover, we write G 6 H, if all theconstraints present in H are also present in G5. Define by D(G) the equivalence classcontaining the DAG G. From this, we say that

D ∈ D+ if ∃ G ∈ D, G ∈ D with G 6 G and |E \ E| = 1.

In other words, D+ contains all the equivalence classes that we can reach by adding an

5The less or equal sign can be interpreted as G having less or as many edges as H. The number ofconstraints or conditional independence statements with respect to the Markov property behaves differently.In fact, it increases.


edge to any of the representatives of the equivalence class D. D− is defined analogously.Indeed, we say that

D ∈ D− if ∃ G ∈ D, G ∈ D with G 6 G and |E \ E| = 1.

In other words, D− contains all the equivalence classes that we can reach by removing anedge to any of the representatives of the equivalence class D.

We still have to describe how we select the edge to add or delete respectively. This is doneusing a Bayesian scoring criterion. In particular,

SB(G,X) = log(p(G)) + log(p(X|G)) (3.2.1)

is used. In Equation 3.2.1, G is a DAG and X is the dataset representing the observationsof the random vector X, whereas p(G) and p(X|G) are the prior and the marginal likeli-hood, respectively. We are not going to discuss the possible choices of the prior and itsapproximations. The interested reader can find more insights in Chickering and Boutilier(2002, Chapter 4). With this score function we assign a score to any Markov equivalenceclass we encounter. Keep in mind that every representative of the Markov equivalence classhas the same score since by model assumption, Markov equivalence is the same as equiv-alence, see also Theorem 3.1.5. For instance, in the first stage we look at all equivalenceclasses that we can reach by adding an edge to a representative of the actual equivalenceclass. In turn, we compute the equivalence class of the new DAG and then compute itsscore. It can be shown that this two-stage procedure is asymptotically optimal and thatthe choice of the score function is valid. For both, we refer to Chickering and Boutilier(2002). Now, we are ready to look at the pseudo-code of the GES algorithm.

Algorithm 3.2: GES algorithm

input : Dataset X, Score functionoutput: CPDAG DInitialise G ∈ D as G = (V,E) where E = ∅.while An element of D+ has a higher score. do

Find the DAG G = (V, E) ∈ D+ which increases the score the most.Replace D with the equivalence class of G.

endwhile An element of D− has a higher score. do

Find the DAG G = (V, E) ∈ D− which increases the score the most.Replace D with the equivalence class of G.

endReturnD

3.3 Linear SEMs with additive non-Gaussian noise

This is the other “extreme” case of linear SEMs where results are already well-knownand understood. In fact, for the linear non-Gaussian setting it can be proven that the

3.3 Linear SEMs with additive non-Gaussian noise 21

underlying causal graph is fully identifiable. Since we will look at something between theGaussian and the non-Gaussian setting, also this part will play an important role whenwe will compare the performances. In this section there will be only one algorithm, theLiNGAM algorithm.

3.3.1 The LiNGAM algorithm

This time the reference is Shimizu et al. (2006). LiNGAM stand for Linear Non-GaussianAdditive Models and, as done for the PC algorithm and the GES algorithm, we will givea brief description of the ideas behind this algorithm.

The fundamental idea is to use Independent Component Analysis (ICA). All successivetransformations are much more straightforward. Using Hyvarinen et al. (2001) as a refer-ence, let us first define the model ICA deals with. Let X = (x1, . . . , xp)

′ be a p-dimensionalrandom vector. ICA aims to find a set of independent components s = (s1, . . . , sp)

T anda p× p matrix of coefficients A, such that

X = As. (3.3.1)

This is the model for ICA but we want to solve the problem represented by Equation 2.1.1.We can quickly see the connection between these two equations by noting that

X = BX + ε ⇐⇒ (1−B)X = ε

⇐⇒ X = (1−B)−1ε. (3.3.2)

Note that since B is strictly lower triangular, (1 − B) is indeed invertible and therefore,Equation 3.3.2 is well-defined. Coming back to ICA, we can now see how these twoequations relate. It is sufficient to set the X of Equation 3.3.1 equal to one observation,A = (1 − B)−1, and s = ε which is by definition a vector with independent components.Moreover, it can be shown that the model represented by Equation 3.3.1 can be estimated ifand only if the components in s are non-Gaussian, which is exactly what we are assuming.To be more precise, as one can see in Hyvarinen et al. (2001, Chapter 7) and in Comon(1994, Theorem 11, p. 294), one Gaussian component is allowed since all components butthe Gaussian one are distinguishable. So having only one Gaussian component makes thiscomponent distinguishable. The pseudo-code presented in Algorithm 3.3 is taken fromShimizu et al. (2006, p. 2007) and adapted using our notation. Now, we can analyseAlgorithm 3.3.

1. We start by centering the data which is a standard procedure when performing ICA.Then, we perform the ICA obtaining S and A where X is a n × p matrix with nobservations of the same p variables, S is a n × p matrix containing n times the pindependent components and A is a p×p matrix containing the coefficients. Finally,we define W = A−1. Remembering Equation 3.3.2, S will be a matrix containingthe independent errors of the SEM whereas A will contain the matrix B defining theSEM that interests us.


Algorithm 3.3: LiNGAM Algorithm

input : n× p data matrix Xoutput: Adjacency matrix B

1. Center the columns of X to obtain X. Use ICA to compute A and S such thatX = SA where S has the same size of X and contains the independent componentsin its columns. Finally define W = A−1.

2. Permute W , yielding W , minimizing∑pi=1 1/ |Wi,i| .

3. Divide each column of W by its diagonal element creating W ′.4. Estimate B by B = 1− W ′.5. Find a permutation P to be applied to both rows and columns of B, such that B is

as close as possible to a strictly lower triangular matrix. For instance computing∑i<j(B)i,j , where B = PBP T .

2. Coming now to step 2, let us recall that ICA returns W with the rows ordered ina random order. Here, we can take advantage of the structure of Equation 2.1.1.Indeed, we know that B is strictly lower triangular and therefore, there is a uniquepermutation for which (1−B) has no zero elements on the diagonal. For this reasonwe permute the rows of W penalizing small diagonal elements.

3. In the third step we simply scale the rows of W , yielding W ′, in order to have ones onthe diagonal. This is due to the equality which follows from combining Equation 3.3.1and Equation 3.3.2 yielding W = (1 − B). Since B is strictly lower triangular, wescale the magnitude of the coefficients to match with the diagonal of (1−B). Oftenunit variance is assumed, but in the case of SEMs this is of little interest and it isan arbitrary choice. The interested reader can find more insights in Hyvarinen et al.(2001) on page 154.

4. Now we can estimateB by reversing the equation found above. We obtain B = 1− W ′.

5. Finally, we permute our estimator again in order to achieve the structure we arelooking for, in this case, a strictly lower triangular matrix. This last step is necessarybecause the entries in our matrix which should be zero are only approximately zero.Hence, in the second step we may not have found a lower triangular matrix.

Chapter 4

Linear SEMs with Additive MixedNoise

This is our main chapter where we will present the three core points of our work. The firstone is that we will use a score-based approach. The second one concerns our restriction tolinear SEMs. The third one is the greedy approach. The inspiration for these three pointsis obtained from Loh and Buhlmann (2014) and Nowzohour and Buhlmann (2014). Thesetwo papers will also serve as references for the whole chapter. In Section 4.1 we expose thePCLiNGAM algorithm which combines the PC algorithm and the LiNGAM algorithm andallows for mixed Gaussian and non-Gaussian noise distribution. Afterwards, in Section4.2 we present the ideas as well as the theory behind our new algorithm.

4.1 The PCLiNGAM

One attempt of designing algorithms for a linear SEM with a mixture of both noise dis-tributions, Gaussian and non-Gaussian, has already been made by combining the PCalgorithm and the LiNGAM algorithm in Hoyer et al. (2012). The assumptions are thesame as the assumptions made for a linear SEM but this time, we allow for continuousrandom noise with arbitrary continuous density. The idea is simple: we start with thePC algorithm in order to find the Markov equivalence class of the true underlying DAG.Then, we use the idea behind the LiNGAM algorithm to orient all the edges that are stillnot oriented. At the end we perform a Gaussianity test for each of the noises, and, withthis information, we recover again the distributional equivalence class by making somedirected edges bi-directed as discussed in Section 3.2.

Now, let us analyse the process from a more theoretical point of view. In order to do this,we use Hoyer et al. (2012). Since our model is linear, it is enough to look for vanishingpartial correlation in order to infer conditional independence, hence, the PC algorithmis perfect for this aim. Still, the PC algorithm can not exploit the whole informationheld by the data. Indeed, as we saw in Section 3.1, for non-Gaussian noise, it is possible

24 Linear SEMs with Additive Mixed Noise

to fully identify the causal relations, hence, using an ICA score function as suggested inHoyer et al. (2012), we can orient any edge connecting to at least one node with non-Gaussian noise. At this point we have a fully oriented DAG. What we still have to dois to check which node has a Gaussian residual. This is due to the fact that in the laststep we oriented again all edges, also those between Gaussian nodes which should not beoriented, see Definition 4.1.2. Nevertheless, as illustrated in Section 3.1, independentlyof the orientation, the model can be fitted and hence, the residuals of a wrongly orientededge between two Gaussian nodes which should not have been oriented, is still Gaussian.Therefore, testing for Gaussianity will allow us to find out which variable is Gaussiandistributed and with this information and looking only at these Gaussian variables, we canobtain the true underlying distributional equivalence class. We are mixing the conceptsof Markov and distributional equivalence class as well as algorithms for Gaussian andalgorithm for non-Gaussian noises. Let us bring some order by defining a new type ofDAG.

Definition 4.1.1: (Hoyer et al., 2012, p. 3) An ngDAG is a pair (G,ng) where G is aDAG G = (V,E) and ng is a binary vector of length |V | where a 0 corresponds to a nodewith Gaussian noise and a 1 corresponds to a node with non-Gaussian noise.

Fundamentally, this definition contains the information about the d-separation in the DAGand the information about how to find the distributional equivalence class, which has notto be the same as the Markov equivalence class, in the vector ng. The distributionalequivalence class can be then represented by a particular type of PDAG which is calledngDAG pattern.

Definition 4.1.2: (Hoyer et al., 2012, p. 3) An ngDAG pattern representing a ngDAG(G,ng) is a PDAG obtained in the following way:

• Derive the CPDAG representing the Markov equivalence class of the DAG G.

• Orient any unoriented edges which emanate from, or terminate in, a node whichcorresponds to a 1 in ng. Orient it as it is already oriented in the DAG G.

• Finally, try to orient any unoriented edges using the rules pictured in Figure 3.4.

Note how this construction encloses the same steps used in the PCLiNGAM algorithm:first we derive the Markov equivalence class where in the PCLiNGAM algorithm we startby using the PC algorithm. Then, we orient the edges linking at least a non-Gaussian nodewhere in the PCLiNGAM algorithm we let the LiNGAM algorithm run on the CPDAGfound by the PC algorithm. Finally, we use the rules pictured in Figure 3.4 where in thePCLiNGAM algorithm we first test for Gaussianity and only then we do this procedure.Note also that if ng contains only zeros, we end up with the same result as the PCalgorithm since in the second step of Definition 4.1.2 nothing happens. Similarly, if ngcontains only ones, we end up with the same result as the LiNGAM algorithm since in thesecond step of Definition 4.1.2 we orient every edge and, consequently, nothing happens inthe last step. This last definition is not only important because it allows us to represent thedistributional equivalence class of this new, more involved, model represented by ngDAGs,

4.2 The score-based approach 25

it also provides us a full characterization of it.

Theorem 4.1.3: (Hoyer et al., 2012, p. 3) Two ngDAGs are distributional equivalent ifand only if they are represented by the same ngDAG pattern.

The interested reader can find the proof in the appendix of Hoyer et al. (2012).

4.2 The score-based approach

For this section we will follow the paper of Nowzohour and Buhlmann (2014). Let ussuppose that we deal with p random variables X = (x1, . . . , xp) and that we have n i.i.d.observations, or realizations. The goal is to recover the true underlying DAG D0 from thetrue probability distribution1 P 0(X). Of course, we do not know neither D0 nor P 0(X).For this reason, the idea is to look at all possible DAGs on p nodes and estimate for each ofthese DAGs the marginal densities p0

1, . . . , p0p. We will use a maximum likelihood approach

to assign a score to all the considered DAGs by looking at how well the density describethe n i.i.d. realizations.

As already mentioned previously, we assume the global Markov property and all its equiv-alent versions. This allows us to factorize the joint density with the Markov factorizationproperty. We obtain that

p0(x1, . . . , xp) = p0(x1) ·p∏i=2

p0(xi|xi−1, . . . , x1)

= p0(x1) ·p∏i=2

p0(xi|xPa(xi))

=p∏i=1

p0(xi|xPa(xi)).

We are considering Equation 2.1.1 from which we can also see that the randomness of thevariables X is actually encoded in the error vector ε. We can rewrite Equation 2.1.1 as

ε = X(1−B)

and hence we can write

p0ε(ε1, . . . , εp) = p0

ε(x1 − (BX)1, . . . , xp − (BX)p)

=p∏i=1

p0εi(xi − (BX)i).

Having this, we can estimate the joint probability distribution via the marginal probabilitydistributions. These marginal probability distributions are also involved when it comes to

1Note that since we assume the Markov property, as stated in Chapter 2, we implicitly assume that Phas a density. This is required by Theorem A.2.2.


computing the score. Given a DAG G = (V,E), we will score it using the following scorefunction:

Sn =1

n·n∑j=1

log(pn(xj1, . . . , xjp))−

|E|log(n)

, (4.2.1)

where n is the number of observations and p is an estimate of the density of the jth residual.

In Section 4.2.2 it will become clear why this is a good choice. We will actually show thatthe true underlying DAG D0 has the minimum score in the large sample size limit. To besure to find the best scoring DAG, we should compute the score of all possible DAGs withp nodes. However, even for small values of p, this is not feasible as the number of DAGsgrows super-exponentially in p. Therefore, we will use a greedy approach. At each step welook for the best edge to add in order to minimize the score. The idea is somehow similarto what the GES algorithm does in the sense that we also use a greedy approach and wealso use a score function to classify what is better and what is worse.

4.2.1 The linearity assumption

This section is based on Loh and Buhlmann (2014). The linearity assumption can playa big role in terms of computational efficiency. It is well-known that for additive noisemodels, hereafter ANMs, with multivariate Gaussian distributed noise, the zero entries ofthe inverse covariance matrix, or precision matrix, correspond to conditional independenceof the two variables given the rest. Loh and Buhlmann (2014) extended this result to linearANMs with arbitrary noise distribution.

4.2.1.1 Motivation

Consider a, not necessarily linear2, SEM, i.e., something of the form (Nowzohour andBuhlmann, 2014)

xi = fi(xPa(i)) + εi,

where i ∈ 1, . . . , p and ε ∼ N (µ,Σ), i.e. the errors follow a p-dimensional Gaussiandistribution. It is well-known that in this case there is a one-to-one correspondence betweenthe entries of the precision matrix, and the conditional independences. This result is statedas Proposition 5.2 in Lauritzen (1996) where the whole theory behind it can also be found.

Loh and Buhlmann (2014) showed that this is also true for almost any joint probabilitydistributions of the errors in case our DAG satisfies a linear SEM. Indeed, we have thefollowing result

Theorem 4.2.1: (Loh and Buhlmann, 2014, p. 3070) Assume that X satisfies a linearSEM and denote by G = (V,E) the corresponding DAG. Assume also that Σ = Cov(X)

2We dropped the linearity assumption, but we stick with the recursivity assumption, i.e. acyclicity.


exists and define3 θ = Σ−1. Then, for i, j ∈ 1, . . . , p, i 6= j,

(i, j) /∈M(G)⇒ Θi,j = 0, (4.2.2)

i.e., inverting the implication, the positions of the non-zero entries of Θ correspond to theposition of the edges in the moralized DAG of G.

Proof. (Loh and Buhlmann, 2014, p. 3070) Before we start with the main argument,let us first compute Θ explicitly. Define Ω = Cov(ε) = diag(σ2

1, . . . , σ2p) and compute

Cov(X) = Cov(BX + ε) = B · Cov(X) ·BT + Ω. Therefore, we get that

Σ = Cov(X) = (1p×p −B)−1Ω(1p×p −B)−T (4.2.3)

Now we compute Θ simply inserting Equation 4.2.3.

Θi,j = (Σ−1)i,j =((1p×p −B)TΩ−1(1p×p −B)

)i,j

=

1 −b2,1 . . . −bp,10

. . .. . .

......

. . .. . . −bp,p−1

0 . . . 0 1

σ−2

1 0 . . . 0

−b2,1 · σ−22

. . .. . .

......

. . .. . . 0

−bp,1 · σ−2p . . . −bp,p−1 · σ−2

p σ−2p

i,j

=

σ−2i +

∑pk=i+1 b

2k,i · σ

−2k for i = j

−σ−2i · bi,j +

∑pk=i+1 bk,j · bk,i · σ

−2k for i > j

(4.2.4)

Now we are ready for the main argument. Let us suppose that for i 6= j, (i, j) /∈ M(G).Assume without loss of generality4 that i > j. Since (i, j) /∈ E ⇒ bi,j = 0 by construction.Moreover, note that i and j cannot share a common child because we are working witha moralized DAG and thus, if this is the case, (i, j) would be in M(G). Hence, ∀k > i,bk,i and/or bk,j have to be equal zero. From the first consideration we get that the entryoutside the sum in the second line of Equation 4.2.4 is zero. From the second we get thatat least one of the bs in each addend has to be zero. Thus, ∀i > j, Θi,j = 0 as desired.

Note that this result is very similar to the global Markov property. Indeed, the globalMarkov property tells us that whenever two variables are marginally dependent, then thetwo nodes corresponding to these two variables are adjacent. Similarly, Theorem 4.2.1 as-serts that whenever an entry in the precision matrix is non-zero, i.e. the two correspondingrandom variables are not marginally independent5, the two nodes corresponding to thesevariables are adjacent in the moralized DAG.

3This is well-defined assuming independent, non-deterministic additive random errors, which is a rea-sonable assumption that we have already made. In this case the covariance matrix is positive definite andnot only positive semi-definite and hence, it is invertible.

4In this way we are still consistent with B being lower triangular and since in M(G) the edges areundirected, it does not matter whether we consider (i, j) or (j, i)

5There is some linear dependence as the partial correlation coefficient measures linear dependence, henceit suffices to exclude marginal independence.


The converse of Theorem 4.2.1 is not necessarily true and has to be assumed. In fact, it isa type of faithfulness assumption, see Remark 3 in Loh and Buhlmann (2014). Faithful-ness tells us that every conditional independence statement that holds for the underlyingdistribution is reflected also in the representing DAG. Here, we assume that a zero en-try in the precision matrix, which for a multivariate Gaussian distribution would implyindependence, is reflected by the two corresponding nodes not being adjacent, i.e. theycan be d-separated given some set. It is a quite necessary assumption in order to achieveuseful results and it can be argued, similarly as discussed in Section 2.1.1, that in thecase where the non-zero entries of B are a realisation of independent continuous randomvariables, the set where Assumption 1 does not hold, is a set of Lebesgue measure zero.This argumentation is broadly used also in standard references like Spirtes et al. (2000,p. 41) and Koller and Friedman (2009, p. 74). We omit the formal proof which can befound in Meek (1995b, Theorem 7). The point is that the values of the coefficients forwhich a non-constant polynomial is equal to zero have Lebesgue measure zero.

Assumption 1: (Loh and Buhlmann, 2014, p. 3071) Making the same assumption asin Theorem 4.2.1, we require that,

Θi,j = 0⇒ (i, j) /∈M(G),

i.e., the term outside the sum in the second line of Equation 4.2.4 is zero implying thatalso each bk,j · bk,i, for k > i, has to be zero.

This assumption simply rules out those special cases where the entries of B casually yielda zero entry in Θ without being zero itself. With this result and this assumption, we canaim to reduce the dimension of the space of the DAGs by ruling out the edges betweenthose nodes which have a zero entry in the precision matrix. We hope that this will allowus to achieve more accurate results or at least a computational speed up.

4.2.1.2 Estimation of the precision matrix

Now it comes to estimating the precision matrix. We will distinguish two cases: theclassical low-dimensional case, and the high-dimensional case. In both cases we haveto restrict the possible distributions of the errors. Nevertheless, we consider also jointdistributions with Gaussian and non-Gaussian marginals at the same time. We can showthat for sub-Gaussian distributions we can bound the maximum norm of the differencebetween the estimated and the true precision matrix with high probability. This alsoallows us to simply use a cut-off setting small values to zero. The high dimensional case isnot very different in the sense that we will also obtain a bound for the same norm and wewill also use a cut-off. The difference is that instead of computing the covariance matrixdirectly and then invert it, we will use the graphical LASSO.

Definition 4.2.2: (Loh and Buhlmann, 2014, p. 3079) A random variable x is sub-Gaussian with parameter σ2 if ∀λ ∈ R,

E[exp (x− E[x])] 6 exp

(σ2λ2

2

).


The bound is achieved by controlling the smallest entry in the true precision matrix, Θmin0 .

Let us introduce some notation. We denote by Σ0 and Θ0 the true covariance and precisionmatrices respectively. Σ and Θ will be their estimates. For the low dimensional case wewill use

Σ =1

n

n∑i=1

XTX and Θ = Σ−1

as Σ is invertible almost surely. For the high dimensional case we will use the graphicalLASSO, i.e.

Θ = arg minΘ0

tr(ΘΣ

)− log (det (Θ)) + λ

∑i 6=j|Θi,j |

,see also Loh and Buhlmann (2014, p. 3080). These two estimators have the desiredproperties. Indeed, we have the following two lemmas.

Lemma 4.2.3: (Loh and Buhlmann, 2014, p. 3080) Assume that the error terms are alli.i.d. sub-Gaussian random variables with parameter σ2. Then,

P(‖Θ−Θ0‖∞ 6 c0σ

2

√p

n

)> 1− c1e

−c2p

and thresholding Θ at level τ = c0σ2√

pn we recover supp (Θ0) if Θmin

0 > 2τ.

Similarly,

Lemma 4.2.4: (Loh and Buhlmann, 2014, p. 3080) Assume that the error terms are allsub-Gaussian random variables with parameter σ2. Moreover, assume that n > Cd log(p).Then,

P

‖Θ−Θ0‖∞ 6 c0σ2

√log(p)

n

> 1− c1e−c2 log(p)

and thresholding Θ at level τ = c0σ2√

log(p)n we recover supp (Θ0) if Θmin

0 > 2τ.

For the two proofs we refer to Vershynin (2010) for Lemma 4.2.3 and to Ravikumar et al.(2011) for Lemma 4.2.4.

4.2.2 Consistency results

The theory of this section is entirely based on Nowzohour and Buhlmann (2014). We willprove that our procedure is consistent under certain conditions. In particular, we willprove that the probability of the score of the true underlying DAG to be smaller or equalto the score of any other DAG goes to one under the assumptions listed in Assumption 2.But, before we look at them in detail, we need some additional concepts and definitions.

We will follow van de Geer (2000) to introduce some notational short-cuts that we needin order to make this section easier to read. Recall that in our setting we have p random


variables, X = (x1, . . . , xp) and let x1i , . . . , x

ni be the n independent observations of the

ith variable. Assume that P is the probability distribution of xi and that f is a functiondefined on the sample space of xi. Then, we will write

Pf =

∫f dP .

Similarly, for the empirical distribution Pn, we will write

Pnf =1

n

n∑j=1

f(xji ).

Definition 4.2.5: (Nowzohour and Buhlmann, 2014, p. 5) Given a DAG G = (V,E),the class of the induced joint probability distributions is defined as:

P =

p∏i=1

pi(xi − q(XPa(xi))

),

where pi is a univariate probability density function allowed from the model6 and q is apolynomial of degree 1 in each variable7.

We also need to specify the space over which we will search for the “best” density.

Definition 4.2.6: (Evans, 2010, p.258) The Sobolev space W r,s(Rn) is defined by

W r,s(Rn) := f ∈ Lr(Rn) : Dα(f) ∈ Lr(Rn), ∀|α| 6 s,

where r, s ∈ N∗ and α = (α1, . . . , αn) is a multi-index. Moreover, Dα denote the partial

derivative ∂|α|

∂xα11 ,...,∂xαnn

in the weak sense.

In particular, the Sobolev space we will use is a weighted version of this, namely, assuggested in Nowzohour and Buhlmann (2014, p. 5),

W r,s(Rn, 〈·〉β) := f ∈ Lr(Rn) : Dα(f · 〈·〉β) ∈ Lr(Rn), ∀|α| 6 s,

where 〈x〉β = (1 + ‖x‖22)β2 , for some β 6 0. Note that with β = 0 we obtain the original

Sobolev space of Definition 4.2.6. The assumptions are:

Assumption 2: (Nowzohour and Buhlmann, 2014, pp. 7-8)

1. The model should be identifiable8.

2. Causal minimality: G = (V,E) is said to be causally minimal with respect to P ifP is Markov with respect to G but P is not Markov with respect to G = (V, E) forany E ( E.

6In our case we will allow only sub-Gaussian distributed densities.7We should really only consider polynomials of degree 1 in each variable. Nevertheless, the proof of our

main result will also cover the case of an edge of zero magnitude.8In our model this assumption will be violated, but we will discuss this issue later. The interested

reader can found the original assumption in Nowzohour and Buhlmann (2014) and more insights aboutidentifiability in Peters et al. (2014).


3. Smoothness of the log-densities: For all DAGs G = (V,E) the log-densities in PG,restricted to their respective supports, are elements of a bounded Sobolev space.This means that ∃ r > 1, s > p, β < 0, C > 0 such that ∀p ∈ PG,∑

|α|6s

∥∥∥Dα(〈·〉β · 1p>0 · log(p)

)∥∥∥r< C.

4. Moment condition for the densities: Given r, s, p and β as in the previous point, forall DAGs G and all p ∈ PG, we require that

∃ γ > s− d

rsuch that

∥∥∥p · 〈·〉γ−β∥∥∥r<∞.

5. Uniformly bounded variance for the log-densities: For all DAGs G, we have

∀p0 ∈ PG,∃ K > 0 : supp∈PG

varp0 (log (p(x1, . . . , xp))) < K.

6. Closeness of the density classes: For all DAGs G, the induced density class PG is aclosed set with respect to the topology induced by the Kullback-Leibler divergence

dKL(p(x)‖q(x)) =

∫p (x) log

(p(x)

q(x)

)dx .

The first assumption implies that - since we are dealing with linear models - we wouldlike to avoid cases where the true underlying DAG can only be identified up to someequivalence class. Indeed, we analysed this problem in Section 3.1. To state it moreprecisely, the equivalence classes, as defined in Definition 2.1.19, should only contain asingle DAG. For our model this is a strong assumption and not really meaningful sincewe allow for a mixture of Gaussian and non-Gaussian distributed noises which can lead toseveral distribution equivalent DAGs as seen in Section 3.1. However, we will see later thatthis is not going to cause troubles. For the moment we continue with this identifiabilityconstraint.

As before, we assume faithfulness. For this reason, the second assumption is not necessarysince causal minimality is implied by faithfulness as stated in Lemma 4.2.7. We includedalso the causal minimality assumption for completeness and to point out that for thisparticular result, the faithfulness assumption can be weakened.

Lemma 4.2.7: If a joint probability distribution is faithful to a DAG G = (V,E), thenit is also causally minimal.

Proof. Assume that G = (V,E) is faithful with respect to P and that ∃ G = (V, E) suchthat G is Markov with respect to P and with E ( E. Let (i, j) ∈ E, (i, j) /∈ E. Thismissing edge in E will necessarily create a new d-separation statement9, i.e. xi d-Sepxj |S

9Since it is not possible to separate two adjacent nodes, but we can always d-separate two nodes whichare not adjacent.


for some suitable S ⊂ V \ xi, xj. Since P is Markov with respect to G and we havethat xi ⊥⊥ xj |S, this immediately leads to a contradiction to the fact that in G, xi and xjare adjacent, hence always dependent.

Although it has been already clarified in Chapter 2, note again that we are also assumingthe Markov condition. Indeed, we are tacitly using that G = (V,E) is Markov with respectto P as otherwise it could never be causally minimal and the statement would be wrong.We omitted the assumption because the Markov condition is such a central assumptionfor the whole work that it is implicit in every statement.

Assumptions 3 to 5 are technical assumptions that play a role in the following four lemmaswhich are auxiliary results that we need in order to prove the main theorem.

The last assumption is also technical and is needed to guarantees the existence of themaximizers, that we will use below, for each density class. In this way it also guaranteesthat the true distribution density p0 has a positive distance from the wrong classes ofdensities.

Before we start with the four lemmas we are going to use in the main proof, we still needsome notation.

Definition 4.2.8: (Nickl and Potscher, 2007, p.182) For two real-valued functions f andg, we write f(x) . g(x) if there exists a positive constant C ∈ R such that f(x) 6 C · g(x)for every x > 0.

Definition 4.2.9: (Nowzohour and Buhlmann, 2014, p. 15) For a Borel-measure µ anda positive and finite ε, we say that two µ-measurable functions fL, fU : Rp → R form an

ε-bracket for some Borel-measurable function f ∈ F if fL 6 f 6 fU and∥∥∥fU − fL∥∥∥

1,µ< ε

where ‖·‖ is the standard L1-norm with respect to µ.

Definition 4.2.10: Given the same functions and the same measure as in Definition 4.2.9and for some 1 6 r 6∞, we denote by N[](ε,F , ‖·‖1,µ) the Lr(µ)-bracketing number, whichis the minimal number of ε-brackets needed to cover F .

When referring to N[](ε,F , ‖·‖1,µ) we will write only N[] to simplify the notation. The

above definition means that there are pairs of functions fLi , fUi for i ∈ 1, . . . , N[] such

that ∀f ∈ F∃ 1 6 i 6 N[] such that fULi , fUi forms an ε-bracket for f.

Definition 4.2.11: Given the definition above, we define the bracketing entropy of Fby H[](ε,F , ‖·‖1,µ) = log(N[](ε,F , ‖·‖1,µ)).

Now we come to the first lemma.

Lemma 4.2.12: (Nowzohour and Buhlmann, 2014, p. 15) Assume the following:

1. ∃ 0 6 α < 1 such that H[](ε,F , ‖·‖1,p0) . ε−α, ∀ε > 0.

2. ∃ C ∈ R such that var(f(x1, . . . , xp)) < C, ∀f ∈ F .


Then F satisfies the following uniform law of large numbers, ULLN for short.

P

(supf∈F|(Pn − P 0

)f | > δn

)n→∞−−−→ 0,

where δn = clog(n) for some c > 0.

Proof. (Nowzohour and Buhlmann, 2014, pp. 15-16) Instead of looking directly at f , wecan also look at the bracketing functions. Given the δn−brackets fLi , fUi of f ∈ F , wehave the following two inequalities:

(Pn − P 0)f < (Pn − P 0)fUi + δn

and(Pn − P 0)f > (Pn − P 0)fLi − δn.

Having this, we can easily bound the absolute value thereof.

|(Pn − P 0)f | < maxi∈1,...,N[]

(|(Pn − P 0)fLi |, |(Pn − P 0)fUi |) + δn

Therefore,

supf∈F|(Pn − P 0)f | 6 max

f∈fLi ,fUi i∈I

(|(Pn − P 0)f |

)+ δn. (4.2.5)

We conclude the proof with the following series of inequalities:

P


)f | > 2δn

)6 P

(max

f∈fLi ,fUi i∈I

|(Pn − P 0

)f | > δn

)(4.2.6)

6 2N[](δn) maxf∈fLi ,f

Ui i∈I

P(|(Pn − P 0

)f | > δn

)(4.2.7)

. exp(δ−αn )C2

nδ2n

, (4.2.8)

where in Inequality 4.2.6 we have just inserted Inequality 4.2.5, in Inequality 4.2.7 we usedBoole’s inequality, whereas in the last inequality, Inequality 4.2.8, we used both assump-tions. The first one for the first term, and the second one in order to use Chebyshev’sinequality. Inserting δn = c

log(n) we obtain

P


)f | > 2δn

). exp

((c

log(n)

)−α) C2 log(n)2

nc2

∝ exp(log(n)αc−α − log(n)

)log(n)2

= exp(log(n) ·

(log(n)α−1c−α − 1

))log(n)2

= nlog(n)α−1c−α−1 log(n)2

n→∞−−−→ 0.


The convergence follows from the fact that α− 1 < 0 by assumption. Therefore, also theexponent of n will become negative for n large enough.

The next lemma, which is the second one in our list, is part of theorem 1 in Nickl andPotscher (2007, p. 182) and we are going to use it without reproducing the proof. Themain reason for this is that it is a proof based on entropy methods. These methods arenot really related to our topic and the result is for us only a tool, hence, not of primaryinterest.

Lemma 4.2.13: (Nowzohour and Buhlmann, 2014, p. 16) Suppose that F is a non-empty and bounded subspace of the weighed Sobolev space W r,s(Rp, 〈x〉β) for some β < 0.Furthermore, suppose that ∃ γ > s− p

r such that ‖〈·〉γ−β‖1,µ = ‖µ(x)〈x〉γ−β‖1 <∞ whereµ is a Borel measure on Rp. Then,

H[](ε,F , ‖·‖1,µ) . ε−ps .

Proof. See proof of Theorem 1 in Nickl and Potscher (2007, p. 182).

We will do a particular choice of both, the set of functions F and the Borel measure µ.Indeed, we will consider F = 1p>0 log(p)|p ∈ P i and µ = p0. Moreover, we define theentropy maximizer and the maximum likelihood of a class of funtion F i by

pi = arg maxp∈Fi

P (log(p)) (4.2.9)

and

pi = arg maxp∈Fi

Pn(log(p)) (4.2.10)

respectively. See also Nowzohour and Buhlmann (2014, p. 14).

Lemma 4.2.14: (Nowzohour and Buhlmann, 2014) Suppose that the following uniformlaw of large numbers holds:

P

(supp∈Pi|(Pn − P )1p>0 log(p)| > δn

)n→∞−−−→ 0.

Then,

P (|Pn(log(pin))− P (log(pi))| > δn)n→∞−−−→ 0.

Proof. From the definitions, it is obvious that

Pn(log(pin)) > Pn(log(pi)) = P (log(pi)) + (Pn − P )(log(pi)). (4.2.11)


Now define P in as the restriction of P in to the densities whose support contains the data.Thus, we obtain

Pn(log(pin)) = maxp∈P i

Pn(log(p))

= maxp∈P in

Pn(log(p))

= maxp∈P in

(P (log(p)) + (Pn − P )(log(p)))

6 P (log(pi)) + supp∈P in

(Pn − P )(log(p)).

(4.2.12)

The first equality follows by definition as if we insert the argmax in the argument of thefunction, we obtain the maximum. The second and third equalities are due to the factthat if the support of the density does not include the data, we would obtain −∞ for both,the maximum likelihood estimator and the maximal entropy estimator. Finally, in the lastinequality we split the sum inserting Equation 4.2.9 for the first term and substituting themax with the sup in the second term.

Now, from Equation 4.2.11 we obtain

−(Pn(log(pin))− P (log(pi))) 6 −(Pn − P )(log(pi))

= |(Pn − P )(log(pi))|6 sup

p∈P in|(Pn − P )(log(p))|

by noting that pi maximizes P (log(p)) and hence, the right hand side of the first inequalityis positive. Furthermore, from Equation 4.2.12 we directly obtain

Pn(log(pin))− P (log(pi)) 6 supp∈P in

(Pn − P )(log(p))

6 supp∈P in

|(Pn − P )(log(p))|.

Finally, combining these two latter results, we obtain

|Pn(log(pin))− P (log(pi))| 6 supp∈P in

|(Pn − P )(log(p))|

6 supp∈Pi|(Pn − P )1p>0 log(p)|

and thus, using the UNNL we assumed,

P (|Pn(log(pin))− P (log(pi))| > δn)

6 P ( supp∈Pi|(Pn − P )1p>0 log(p)| > δn)

n→∞−−−→ 0

as desired.


We are almost ready to begin with the proof of the main theorem. We only need thefollowing lemma.

Lemma 4.2.15: (Nowzohour and Buhlmann, 2014, p. 17) Let a, b, a′, b′ ∈ R and ε > 0.Assume either

a− b > ε and a′ − b′ 6 0

or

a− b < ε and a′ − b′ > 2ε.

Then,

|a− a′| > ε

2or |b− b′| > ε

2.

Proof. Assume that a− b > ε and a′− b′ 6 0. Then subtracting the two inequalities yields

ε < a− b− (a′ − b′) = |a− a′ − (b− b′)| 6 |a− a′|+ |(b− b′)|.

Hence, at least one of the two entries on the right hand side has to be bigger than ε2 .

Assume now that a−b < ε and a′−b′ > 2ε. Then, again by subtracting the two inequalitieswe obtain that

ε < a′ − b′ − (a− b) = |b− b′ − (a− a′)| 6 |b− b′|+ |(a− a′)|.

Hence, again, at least one of the entries on the right hand side has to be bigger than ε2 .

Theorem 4.2.16: Assume the first five conditions of Assumption 2 and define the pe-nalized likelihood score of the i-th DAG, Gi(V,Ei) in D by

Sin =1

n

n∑j=1

log(pin(xj1, . . . , xjp))− |Ei| · an,

where an = 1log(n) . Furthermore, denote Gi0 := G0. Then we have

P (Si0n 6 Sin)n→∞−−−→ 0,∀i 6= i0.

Proof. Assume that i 6= i0 and denote by δn = 1log(n) (|Ei| − |Ei0 |) the difference in penal-

ization between the two DAGs. Now we will consider two cases:

1. p0 ∈ P i.

2. p0 /∈ P i.

In the first case, pi = p0. Both classes of probabilities, P i and P i0 , contain p0. They aredifferent by assumption and since D0 is causally minimal with respect to p0, hence alsoto P i, Di has to have at least as many edges as D0. Note that the Markov assumption


ensures that any DAG representing a density in P i contains D0. This means that Di hasto have more edges than D0 and thus, δn > 010. Thus we have that

P (Si0n 6Sin) = P(Pn(log(pin))− Pn(log(pi0n )) > δn

)6 P

(|Pn(log(pi0n ))− P (log(p0))| > δn

4∨ |Pn(log(pin))− P (log(pi)) >

δn4|)

6 P


4

)+ P

(|Pn(log(pin))− P (log(pi)) >

δn4|)

n→∞−−−→ 0.

For the convergence we used Lemma 4.2.14, whereas for the first inequality we usedLemma 4.2.15 with δn

2 in order to have a strict inequality as requested in the first condition.

In the second case, p0 /∈ P i, and hence, P (log(p0)) > P (log(pi)). Therefore, ∃ δ > 0 such

that P (log(p0)) > P (log(pi)) + 4δ. Let N > 0 such that ∀n > N,|Ei0 |log(n) < δ. Then we have

that

P (Si0n 6 Sin) = P(Pn(log(pi0n )− Pn(log(pin))) 6 −δn

)6 P

(Pn(log(pi0n )− Pn(log(pin))) < δn

)6 P

(|Pn(log(pi0n ))− P (log(p0))| > δn ∨ |Pn(log(pin))− P (log(pi)) > δn|

)6 P


)+ P

(|Pn(log(pin))− P (log(pi)) > δn|

)n→∞−−−→ 0.

The convergence follows again because of Lemma 4.2.14, and the second inequality becauseof Lemma 4.2.15 with ε = 2δ.

This Theorem tells us that the true underlying DAg will have the smallest score among thewhole search space. Allowing for not fully identifiable models would not change is feature.The problem is that we may no longer have a unique DAG scoring as the true underlyingDAG. This follows directly from the fact explained in Example 3.1.1 where we saw thatsome models can be fitted in multiple directions. Since all those models induce the samejoint probability distribution, and the whole theory presented in this section depends onlyon it, we would have the same score. This, in turn, means that we would recover only onerepresentative of the ngDAG pattern representing the true underlying joint probabilitydistribution.

10This is however only possible allowing an edge with zero magnitude. Not allowing this, we have actuallyonly to deal with the second case.

Chapter 5

Simulations

We implemented our algorithm step-by-step resulting in several sub-algorithms each withits own purpose. In this chapter we start by describing these sub-algorithms in Section 5.1.We expose the pseudo-code of the most important ones and summarize the purposes ofthe others. Moreover, also all possible inputs are discussed in detail. In Section 5.2 westart by discussing and selecting two parameters which will play a central role when itcomes to performance and efficiency. Afterwards, we expose and discuss the results ofour simulations. The whole thesis but in particular this chapter is written in the spiritof transparency. Our aim is to provide a first analysis of this new algorithm highlightingboth, positive and negative aspects.

5.1 The algorithms

In this section we are going to explain how our algorithms work. We will present a pseudo-code of the most important ones and we will give the main idea for the others, which couldbe seen as nuisance algorithms. The pseudo-code is meant to provide an intuition of howan algorithm work. All the details regarding input and output parameters as well asinternal checks are discussed separately.

The two fundamental algorithms are edgeloop and compscoreDAG. In edgeloop, given astarting DAG, we insert an edge between two nodes and compute, with compscoreDAG, thescore of the new DAG. We do this for every possible edge and for every possible orientationwhich does not create a cycle. At the end we keep the DAG which scored the smallestscore.

edgeloop only requires a n × p matrix containing the data of the p nodes. Other inputoptions are:

• matrix: which is a p× p matrix with zeros and ones. It should be symmetric and itrepresents the edges to be checked. In other words, specifying this matrix, we can

40 Simulations

Algorithm 5.1: Edgeloop

input : Dataset, DAG, Tuning parametersoutput: The estimated DAG

while There are still edges to check doSelect such an edge, add it to the actual DAG and compute the score of the newDAG using compscoreDAG.

endSelect the best scoring DAG.if The new DAG scores better than the input DAG then

Check the score of the DAG which has the new added edge oriented in the oppositedirection.if The two scores are very similar. then

Recall edgeloop starting, one at a time, with both DAGs.Select the best scoring DAG.

end

elseReturn the input DAG.

endRecall edgeloop with the new DAG.

force the absence of some edges. It is meant to be used with the moralized graphestimation precedures discussed in Section 4.2.1.1.

• currentMatrix: currentMatrix represents the DAG we start from. It is meant tobe used only internally as the algorithm is recursive. It can also be used to forcethe presence of some edges. Therefore, this allows us to take into account priorknowledge.

• memory1 and memory2: These two vectors are really only for internal use. Thelinear regression function is stored in “memory1” and the corresponding score isstored in “memory2”. This should allow us to be more efficient.

• alpha coef: We decided not to use this parameter but it is still implemented. Whenspecified, this coefficient serves as a cut-off for too small coefficient in the linearregressions. In the case that one coefficient, but the intercept, in the linear regressionis below alpha coef, the score of this linear regression is set to infinity, thus weno longer consider this set of parents. The idea was that too small coefficientswould not have been identifiable anyway because the noise, together with a smallsample size, could easily produce a significantly large regression coefficient. Theperformance was quite good but at the end we judged it as too unrealistic for realworld applications. We also tried out the same but with respect to the p-value ofthe regression’s coefficients. This time the performance did not change significantly,therefore, we abandoned also this approach. Note also that using this coefficientwould cause the algorithm to stop being fully greedy based since some choices wouldno more be accepted.

5.1 The algorithms 41

• alpha ratio: When specified, this coefficient rules the splitting procedure. If a newedge is selected because it decreases the score the most and should therefore be addedto the DAG,1 we compute the score of the DAG having the selected edge oriented inthe opposite direction. If the two scores are very similar, as we remarked at the endof Chapter 4, we may be in presence of an edge between two nodes with Gaussiandistributed noise which should be bi-directed. For this reason we initialise two DAGscorresponding to the two orientations and we let the algorithm proceed with bothDAGs in a recursive way. Hence, we have a split.

• True DAG: This is a matrix representing the true underlying DAG which we use totest the accuracy of our algorithm when we run controlled simulations.

Algorithm 5.2: compscoreDAG

input : data, DAG, alphaoutput: Score

Inizialize Score=0.Using the DAG-matrix compute the linear regressions of every variable.if DAG contains a cycle. then

Set the score of this DAG to infinity.Return the score.

endfor Each variable in DAG. do

Estimate the density after the regression has been applied.Add to Score the log of the density evaluated in each observation of data.

endApply a penalization to Score.Return the Score.

For compscoreDAG we had the possibility to look at the code written for Nowzohour andBuhlmann (2014) from which we took inspiration. In compscoreDAG we want to com-pute the score of the DAG represented by “matrix”. compscoreDAG is normally called byedgeloop when we look for the best scoring DAG with an additional edge. It is very im-portant that“matrix”has as column’s and row’s names 1, . . . , p, in the order correspondingto the order of the nodes. By edgeloop it is already given as upper triangular matrix,therefore, possibly not in the ascending order from 1 to p. In any case, the algorithm workswith the matrix sorted in order to be upper triangular, thus, let us just assume that theinput is already an upper triangular matrix. It is also important that“data” has x1, . . . , xpas column names. Here is why: since the matrix is upper-triangular, we are going to passthrough all the column and we will check which entries of the column are equal to 1.The variables corresponding to the rows of these entries are parents of the column we areconsidering. With this information we compute a linear model, using the function lm in R.We then use the residuals of the linear regressions to estimate the densities of the nodes.For this we use the command density in R. Afterwards we use Equation 4.2.1 to sumup the scores of every single node and finally, we add the penalization. Note that at the

1Which is the DAG represented by “currentMatrix”.

42 Simulations

beginning, the algorithm checks DAG for acyclicity and, in the case where DAG is notacyclic, it will return a value of ∞. For the sake of completeness we would like to remarkthat the algorithm takes into account alpha coef if it is provided. Since we are not usingit as discussed above, we omit this step in the pseudo-code.

Algorithm 5.3: EquivClassSearch

input : DAG, data, alpha, G varoutput: Markov equivalence class

if G var is not specified then• Test at level α each variables for Gaussian residuals given the data and the DAG

structure.• List any edges present between two nodes found to be Gaussian.• Using Check_v_structure, eliminate from the list all those edges involved in a

v-structure.• Try to reverse the edges in the list by creating 2k DAGs, where k is the length of

the list, and by controlling that neither new v-structures nor cycles are created.• Update DAG by removing the orientation of all those edges that have been

reversed successfully.else

Do the same but taking G var as the result of the Gaussain test.endReturn DAG.

With EquivClassSearch we aim to construct the distributional equivalence class of a givenDAG. We proceed in a similar way as it is described in Definition 4.1.2 when it comes toconstructing the ngDAG pattern of an ngDAG. In fact, we need to know which nodes havea Gaussian distributed noise. For the Gaussianity test we use the function shapiro.test2

implemented in R and the level alpha3. Knowing the nodes with a Gaussian distributednoise, we select all those edges between two of them. From the theory based on the linearGaussian setting, an edge between two Gaussian nodes can be reversed if the Markovequivalence class does not change. In fact, for the case of mixed models, this is exactlywhat Theorem 4.1.3 states. In order to stay within the same Markov equivalence class, wecannot reverse any edge already involved in a v-structure. For this reason, we eliminatethese edges from the list and test all the others creating all possible configurations. Itis easy to see that the combinations are 2k, if k is the number of selected edges. To gothrough this list, we use a binary representation. In this way, we do not have to initializea potentially huge amount of memory which, in some cases4, can cause memory problems.(0, . . . , 0) represents the original DAG whereas (1, . . . , 1) represents the DAG where allselected edges have been reversed. Starting with the original DAG and successively adding

2For the moment there is not a special motivation for the choice of the test function.3We could also stay completely score-based avoiding to rely on statistical tests. Indeed, we could choose

to reverse only similar scoring edges. That is, edges which score similarly in both directions.4Doing this procedure in a naive way, i.e. initializing a k× 2k matrix, on a 8GB RAM machine will fail

with k = 27.


1, in a binary sense5, to the number corresponding to the current DAG ensures us to gothrough all possible DAGs. This representation is also directly used to set the entriesin the matrix representing the DAG to 0 and 1 respectively. All accepted DAGs aresaved and summed up at the end. The matrix resulting is a matrix with zeros if thecorresponding orientation was never present in any of the 2k DAGs and positive integers ifthe corresponding edges was present in at least one saved matrix. Setting all these positiveintegers to one gives us a representation of a PDAG that represents the distributionalequivalence class we were looking for. Instead of testing the residuals for Gaussianity, onecan provide this information using G var. However, in contrast to Definition 4.1.1, in ourcase a value of one corresponds to a node with a Gaussian distributed noise whereas avalue of 0 correspond to a node with a non-Gaussian distributed noise. We still have tospecify how the inputs have to be passed to the function. In fact, this time it is veryimportant that DAG is a upper triangular matrix with rows and columns named from 1to p in the appropriate order and the same order has to be applied also to G var. data onthe contrary only needs to have its columns named from x1 to xp.

Algorithm 5.4: morgraphsel

input : data, rhooutput: Estimated moralized graph Θ.

if The setting is low-dimensional then• Estimate the precision matrix Θ by directly inverting the covariance matrix.• Set all the entries of Θ which are below rho to 0 and all the others to 1.

else• Estimate a sparse inverse covariance matrix Θ using the graphical LASSO.• Set all the entries of Θ which are below rho to 0 and all the others to 1.

endReturn Θ.

With morgraphsel we implemented the result described in Section 4.2.1. We estimate theinverse covariance matrix, or precision matrix, with either the graphical LASSO or via thestandard approximation and matrix inversion, depending on the setting we are dealingwith. In both cases a cut-off is applied and too small entries are set to 0, whereas theothers to 1. The choice for the cuf-off can be critical and, for the moment, we tuned theparameter based on the empirical analysis exposed in Section 5.2.

SBGDS is essentially the algorithm we let run. The acronym SBGDS stands for Score-Based Greedy DAG Search which actually highlights what the algorithm does as well asone big difference between our algorithm and GES. As we can see from Algorithm 5.5,the algorithm may also perform other estimations by calling up other algorithms. Theseestimates are then used to analyse how our algorithm performs. At the moment it is onlypossible to run the other algorithms under standard parameters. In fact, as input we cangive:

5We add 1 to the first digit and, if it becomes 2, we set it as 0 and add one to the second digit and soon. We stop when the last digit becomes a 2.

44 Simulations

Algorithm 5.5: SBGDS

input : See the list in the discussionoutput: Estimated DAGs/CPDAGs, True CPDAG

• Estimate the true underlying DAG using edgeloop.• Compute the CPDAG of the estimate using the Gaussian tests.

if PC 6= 0 thenCompute an estimate using the PC algorithm.

endif GES 6= 0 then

Compute an estimate using the GES algorithm.endif LiNGAM 6= 0 then

Compute an estimate using the LiNGAM algorithm.endif PCLiNGAM 6= 0 then

Compute an estimate using the PCLiNGAM algorithm.endif G var is specified then

Compute the true underlying CPDAG true. Compute the performance tables foreach estimate using true.

endReturn true, the estimates, the tables

• data: This is the only mandatory input and should be a n× p matrix where p is thenumber of variables and n is the number of observations.

• matrix: This is supposed to represent an estimate of the skeleton of the true un-derlying DAG. We use the estimate of morgraphsel. One can also use this input toaccount for prior knowledge excluding some edges.

• alpha: The level used to run the Gaussianity tests. The default value is 0.01.

• alpha coef: See the discussion about edgeloop, Algorithm 5.1.

• alpha ratio: See the discussion about edgeloop, Algorithm 5.1.

• true: This represents the true underlying DAG if it is known. If it is not specified,no comparisons are possible.

• PC: If a value different from 0 is given, an estimate using the PC algorithm iscomputed.

• alpha pc: This is the tuning parameter used inside the function pc. This is also theonly parameter selectable for third party algorithms and the default value is 0.01.

• LiNGAM: If a value different from 0 is given, an estimate using the LiNGAM algo-rithm is computed.


• GES: If a value different from 0 is given, an estimate using the GES algorithm iscomputed.

• PCLiNGAM: If a value different from 0 is given, an estimate using the PCLiNGAMalgorithm is computed.

• G var: See the discussion about EquivClassSearch, Algorithm 5.3.

Other sub-algorithms written for this thesis are: Check_v_structures, Compute_Tables,normal_test, pclingam, and Random_DAG_Gen.

Check_v_structures returns a list of v-structures of the CPDAG given as input. Weuse it always for DAGs except for PCLiNGAM where we need it to handle also CPDAGoutputted from the PC algorithm.

Compute_Tables takes as input the estimates of SBGDS and outputs tables which count howmany edges have been estimated correctly and which ones. The output and the informationcontained are similar to those presented for instance in Table 5.1. We differentiate betweenbi-directed edges, correctly oriented edges, wrongly oriented edges, and no edges.

normal_test really does what one would expect it to do and the same is true for pclingamwhich we are not going to describe in detail since we have not implemented it directly.The only technical remark about normal_test concerns the sample size which has to liebetween 3 and 5000. However, one could easily substitute another Gaussianity test thatallows for larger sample sizes.

Random_DAG_Gen generates data according to a random DAG. In this sub-algorithms wehave the followings input parameters:

• p: Represents the number of nodes in the DAG. The default value is 10.

• N: Represents the number of observations to simulate from the DAG. The defaultvalue is 100.

• p edge: Represents the probability of having an edge between two arbitrary nodes.The default value is 0.1.

• p gauss: Represents the probability for a node to have a Gaussian distributed noise.The default value is 0.3.

• p rt: Represents the probability for a node to have a t-distributed noise. The defaultvalue is 0.4 whereas the degrees of freedom may vary. See the discussion below formore insights.

• p u: Represents the probability for a node to have an uniformly distributed noise.The default value is 0.4.

The pseudo-code of this sub-algorithm is not of much interest. On the contrary, knowingthe possible input parameters and how the data are generated is much more useful. Firstof all let us specify that when a node has a non-Gaussian distributed noise and neitherthe t-distribution nor the uniform distribution is used, then, a chi-squared is used. Upon

46 Simulations

advice6, we simulate the data with a higher variance of the noise in the root nodes and alower variance of the noise in the other nodes. This should guarantee a more detectablesignal among the DAG. We chose a standard deviation for the Gaussian distributed noisesbetween 2 and 2.5 for the root nodes and between 0.7 and 1.2 for the other nodes. Forthe t-distribution we allow 3 degrees of freedom for the root nodes and either 5 or 6 forthe other nodes. For the uniform distribution we set the length of the support between6 and 8 for the root nodes and between 3 and 4 for the other nodes. Finally, for the chi-squared distribution we allow between 2 and 3 degrees of freedom for the root nodes andbetween 0.3 and 0.7 degrees of freedom for the other nodes. Another important remarkregards the assumption of sub-Gaussianity made in Section 4.2.1 and in particular inLemmas 4.2.3 and 4.2.4. We are aware that we simulated the data from non-sub-Gaussianprobability distributions. In fact, only the Gaussian and the uniform distributions are sub-Gaussian. We did it because this assumption only affects the estimation of the moralizedgraph and not the consistency of the algorithm, but, and more importantly, we made thischoice because it is of more practical interest as we aim to allow for arbitrary probabilitydistributions.

Before we look at the results of the simulations we quickly need to specify the param-eters that we selected for the algorithms used for comparisons. We let run the PCalgorithm with indepTest = gaussCItest and with maj.rule = TRUE. We have beenadvised7 that with this rule turned on, the PC algorithm returns more reliable results.Unfortunately, also with this rule turned on, sometimes the PC algorithm returns an in-valid PCDAG, i.e. a CPDAG with either a cycle or with at least one bi-directed edgewhich cannot be oriented without creating a new v-structure or a cycle. For this reason,in PCLiNGAM, when such an invalid CPDAG is returned, we let run the PC algorithmagain with u2pd = "retry" which ensures to return a valid CPDAG. It is important tohave a valid CPDAG in the PCLiNGAM algorithm because we need to orient all thesebi-directed edges preserving the Markov and the DAG properties, see also the second stepin Definition 4.1.2. We ran the GES algorithm with the standard choice for the scorefunction, that is, GaussL0penObsScore which corresponds to the Bayesian InformationCriterion (BIC). The other algorithms does not need special parameters.

5.2 Results

In this section we explain how we ran the simulations and we analyse the result of thesesimulations. We will speak of default conditions which are the followings: the numberof observations is 100, the number of nodes is 10, and we expect 10 edges in the DAG.Moreover, the nodes with a non-Gaussian noise are generated with a probability of 0.4from a t-distribution, also with a probability of 0.4 from a uniform distribution, and theremaining 0.2 from a chi-squared distribution.

6In particular, Jan Ernest and indirectly Dr. Jonas Peters7In this case from Emilija Perkovic and indirectly from Prof. Dr. Marloes Maathuis

5.2 Results 47

5.2.1 Empirical analysis of the parameters

We want to understand the influence of morgraphsel.R and, in particular, the influenceof the cut-off parameter rho used in it. For this reason we compare some results withthe cut-off running from 0 to 2 with intervals of 0.2. As we can see in Figure 5.1, theincrements of 0.2 are fine enough to understand the behaviour of both, the performanceof the algorithm and its computational time. In Figure 5.1 the solid lines represent theperformance of the algorithm whereas the dashed lines represent the computational time.In the first graph we plotted the correct selections against the values of rho. With correctselections we mean the correctly selected edges and the correctly absent edges. Note thatsince with work with 10 nodes, the amount of correct selections can only reach 45, indeedit is

∑9i=1 i = 45. In the second graph we plotted the ratio of the correct selections against

the wrong selections whereas in the third graph we plot the correct selections against achoice of the wrong selections. In this choice of wrong selections we included:

• the edges absent when in the true DAG there is a link, i.e. either a bi-directed edgeor a oriented edge.

• the edges oriented in the wrong causal direction, i.e. both cases, when in the trueDAG there is an oriented edge and we orient it in the wrong direction, but also whenin the true DAG the two nodes are not linked and we orient it a direction which iscertainly wrong.

• the edges found to be bi-directed when in the true underlying DAG there is none.

In particular, we do not classify and error as severe if

• an edge is found in the right causal direction but in the true DAG there is none.

• an edge is oriented but in the true DAG it is bi-directed.

This choice is clearly subjective, but we think that it is interesting to consider it. Note alsothat we are able to know the “correct” causal direction because we simulate our data in theway that a node can only be influenced by preceding nodes, this means that, numberingthe nodes from 1 to p, a correct causal relation always goes towards a node with largerindex.

We note that the selection has to be done carefully. Indeed, we cannot allow for a too largecut-off, otherwise we end up with worse results. Unfortunately, we can not say that thisselection considerably helps improving the accuracy but from a computational efficiencypoint of view we are more lucky. Already with a cut-off of 0.4 we can gain more than 50%in terms of computational efficiency. 0.4 is a conservative choice, indeed, in all cases weperform slightly better than without performing the estimation of the moralized graph,i.e. with 0 as value of rho. For these reasons, we decided to use the value 0.4 for rho

when we will perform the estimation of the moralized graph. We did not perform thisanalysis for the high-dimensional case. We experienced less reliable estimates and alsosome unexpected results as the graphical LASSO algorithm in R returned non-symmetricmatrices for small values of λ. Moreover, a high-dimensional setting would have meant

48 Simulations

0.0 0.5 1.0 1.5 2.0

3536

3738

39

40

41

Rho

#co

rrec

tse

lect

ions

seco

nd

s

01

23

45

Performance with varying rho in morgraphsel.R

Gaussian

0%50%100%

0.0 0.5 1.0 1.5 2.0

45

67

89

10

Rho

Rig

ht

/W

rong

seco

nd

s

01

23

45Gaussian

0%50%100%

0.0 0.5 1.0 1.5 2.0

46

810

12

Rho

Sel

:R

ight

/W

ron

g

seco

nd

s

01

23

45Gaussian

0%50%100%

Figure 5.1: We plotted the mean performances (solid lines) and the mean computationaltimes (dashed lines) against the values of the parameter Rho. The first plot shows thenumber of correct selections. The second plot shows the ratio between the correct andthe wrong selections whereas in the last plot we can see the ratio between the correctselections and the choice of severe errors discussed at the beginning of Section 5.2.1.

5.2 Results 49

to work with much more than 10 nodes causing a discrete investment of time since thealgorithm is certainly not optimized in terms of speed. For these reasons we dropped theanalysis of the high-dimensional case.

Let us remark again that the aim of this estimation is primarily to reduce the computa-tional time without affecting the accuracy of the results and not to directly improve theaccuracy. For this latter aim we use the splitting procedure: the idea is to consider bothorientations of an edge in those cases where the scores of the corresponding DAGs arevery similar. The reasoning behind this procedure arises from the basic fact exposed inExample 3.1.1 or, if one prefers a more theoretical argument, from the result exposed inTheorem 4.1.3. Theoretically, if an edge has to be bi-directed, then independently of thechosen orientation the score will be the same whereas for the other configurations, this isnot the case. See also Theorem 4.2.16.

Now, we look at the splitting procedure in a similar way as we did for the analysis ofrho. We will look at the behaviour of the performance and at the behaviour of thecomputational time of the algorithm when the splitting limit, i.e. alpha_ratio, increases.As values we used 0, 0.0001, 0.001, 0.0025, 0.005, and 0.01. Also the choice of this parameteris very critical. This time we do not risk to obtain bad results, on the contrary, we canonly obtain better8 DAGs, but, if we split too much, the major problem is the feasibilityof the algorithm. With a too large parameter we will always split, resulting in turn in anexponential increase9 of the computational time.

Let us look at Figure 5.2 where again, the solid lines represent the performance of thealgorithm whereas the dashed lines represent the computational time. We can clearly seethe benefits of splitting: in all cases we observe a substantial increase in the performance.Unfortunately, we also observe a severe increase in the computational time. We shouldremark that only in the linear Gaussian setting the time really explodes, in the other casesit increases approximately linearly. This increase in the computational time goes handin hand with the increase of the presence of bi-directed edges. Indeed, with only half ofthe nodes having a Gaussian noise, we do not expect many bi-directed edges. This is dueto the fact that as soon as one edge connects to a non-Gaussian node, it is orientable.Hence, it is highly probable that the explosion of the time in the linear Gaussian settingcorresponds to the explosion of the bi-directed edges. We decided to use 0.0025 as valuefor alpha_ratio when we use the splitting procedure. Also in this case, we tried to stayconservative. With this choice we gain a lot in terms of performance but we also loose quitemuch in terms of computational efficiency. Our aim is to use both tools, the estimation ofthe moralized graph and the splitting procedure. We want to come up with a combinationof parameters that allows us to perform better with comparable times. In Figure 5.2 wealso plotted the performances with both in use, the splitting procedure and the estimationof the moralized graph. These performances are represented with an “x” in the graphs.We can see that we are often equal or better performing, only in the mixed case of thelast graph we observe a relevant decrease in the performance.

8With “better” we mean that the score of the DAG will be better, thus, for us, smaller.9Simplifying the problem, if without splitting we select k edges, which means that we would run our

routine k times, splitting we would need to do it 2k times.

50 Simulations

0.000 0.002 0.004 0.006 0.008 0.010

3839

4041

42

Rho

#co

rrec

tse

lect

ions

seco

nd

s

05

1015

20

Performance with varying alpha ratio

Gaussian

0%50%100%

0.000 0.002 0.004 0.006 0.008 0.010

56

78

910

1112

Rho

Rig

ht

/W

rong

seco

nd

s

05

1015

20

Gaussian

0%50%100%

0.000 0.002 0.004 0.006 0.008 0.010

1012

1416

18

Rho

Sel

:R

ight

/W

ron

g

seco

nd

s

05

1015

20

Gaussian

0%50%100%

Figure 5.2: We plotted the mean performances (solid lines) and the mean computationaltimes (dashed lines) against the values of the parameter alpha_ratio. The first plotshows the number of correct selections. The second plot shows the ratio between thecorrect and the wrong selections whereas in the last plot we can see the ratio betweenthe correct selections and the choice of severe errors discussed at the beginning of Section5.2.1. Moreover, we also plotted the performances (with xs) and the computational times(with triangles) for rho= 0.4 and alpha_ratio= 0.0025.

5.2 Results 51

This splitting procedure is a possible solution to account for the identifiability problem butwe are not claiming that it is the best. Currently, computing the best scoring DAG whensetting alpha_ratio to 0.0025 and having 100 observations take approximately between5 and 10 seconds on our computers10. This is actually quite slow for a DAG with only 10nodes, but we need to make some additional considerations. First of all we have to pointout that we are splitting a lot: we are surely splitting in a lot of cases where the true edge isnot bi-directed. Secondly, the implementation of the algorithm can be certainly improved.We cared much about the correctness of the algorithm which is undoubtedly the mostimportant point. Thirdly, we always also let the algorithm run without those performanceimprovements and the results are already promising. All in all, with 0.0025 as value foralpha_ratio, even if the algorithm slows down considerably, it remains fast enough torun a lot of simulations on a home computer and allows us to look at the benefits of thesplitting procedure. We also see that combining it with the estimation of the moralizedgraph allows us to speed the algorithm up without compromising the performance.

In Figure 5.2 we also plot the computational time when we used both tools, the estima-tion of the moralized graph and the splitting procedure. These times are represented bytriangles. These times are quite impressive, in fact, they are comparable with the timesscored without any tools. Remember that the splitting procedure causes a severe increaseof the computational time whereas the estimation of the moralized graph decreases it. Thesurprising result is that these times happen to be even smaller than the times scored bythe algorithm without the use of any tools. In fact, without Gaussian distributed noiseswe are faster, with a portion of 50% we are equally fast whereas with exclusively Gaussiandistributed noises we are approximately 50% slower.

Note that this selection of the parameters has been done explicitly for the standard case.Such an ad-hoc analysis is maybe harder to do in practice but not impossible. Indeed,both, alpha_ratio and rho, depend very much on the amount of observations and cantherefore be tuned also in real word applications. Indeed, with more observations we wouldhave more precise estimates of the precision matrix, hence, rho could be chosen smaller.Similarly, since we look at

min

∣∣∣∣1− Score1

Score2

∣∣∣∣ , ∣∣∣∣1− Score2

Score1

∣∣∣∣ ,with more observations we may have different values for the scores since we sum up morepoints but at the same time we should also perform better. In case of bigger values, thisproduces a smaller value and therefore, we also need a smaller value for alpha_ratio tonot oversplit. Similarly, in the other case, the converse holds. Nevertheless, in order tobe as transparent as possible, when we use these parameters we also let the algorithm runwithout using these techniques and we always compare the results of both methods.

10Which are standard home computers.

52 Simulations

0.0 0.2 0.4 0.6 0.8 1.0

2530

3540

Gaussianity

#co

rrec

tse

lect

ions

Performance under default conditions

PCGESLiNGAMPCLiNGAMSBGDSSBGDS+

Figure 5.3: We plotted the mean values of correct selections of all algorithms in thestandard setting against the concentration of nodes with Gaussian distributed noise. Inthe standard setting we have a sample size of 100 and 10 nodes.

5.2.2 Performance

Let us start by considering Figure 5.3. In this graph we plotted the correct selectionsagainst the expected concentration of nodes with Gaussian noise for all the algorithmswe consider. We can easily see that the GES algorithm is heavily under-performing andit performs even worse when the number of nodes with Gaussian noise increases. Allthis does not make any sense given that it is a well-known algorithm and should provideperformances comparable with the performances of PC algorithm. For all the simulationswe used the setting provided in the example which can be found in the help file of theges function in the pcalg package in R. We were not able to figure out the source of theproblem and also in all other plots we used to analyse and compare the performances ofall the algorithms, GES was always under-performing. We would like to stress again thatwe are not saying that the GES algorithm does not work well. Unfortunately, we cannotconsider it any further because of this performance issue. Our efforts have been heavilyfocused in the implementation of the SBGDS algorithm and we did not have the time toinvestigate the GES algorithm any further. On the other side, we are lucky enough tohave another algorithm explicitly designed to perform well in the linear Gaussian setting,the PC algorithm. This allows us to still compare our algorithm with well-establishedalgorithms arising from both settings, the linear Gaussian and the linear non-Gaussiansetting. In Figure 5.4 we can see a first comparison of all the algorithms. In the firstgraph we plotted again the correct selections against the expected concentration of nodeswith Gaussian noise. All the algorithms perform as expected. The performance of the PCalgorithm increases as the number of nodes with Gaussian noise increases. The opposite

5.2 Results 53

0.0 0.2 0.4 0.6 0.8 1.0

3638

4042

Gaussianity

#co

rrec

tse

lect

ions

Performance under default conditions

PCLiNGAMPCLiNGAMSBGDSSBGDS+

0.0 0.2 0.4 0.6 0.8 1.0

23

45

6

Gaussianity

#se

vere

erro

rs


Figure 5.4: We plotted the mean values of the performance of all algorithms in the standardsetting against the concentration of nodes with Gaussian distributed noise. The first plotrepresent the number of correct selections. Note that the plot of the wrong selections isequivalent to this plot with the y-axis mirrored since the wrong selections are (45−“correctselections”). In the second plot we plotted the severe errors as discussed at the beginningof Section 5.2.1.

happens with the LiNGAM algorithm whereas the PCLiNGAM algorithm appears tobe more robust throughout the range of concentrations of nodes with Gaussian noise asexpected. Looking at the basic version of our algorithm, i.e. without performing theestimation of the moralized graph and without splitting, we see a decreasing performanceas the number of nodes with Gaussian noise increases. One reason for this lies in thenature of the greedy search. With the increasing number of nodes with Gaussian noisewe should have an increasing number of bi-directed edges. This can cause some problemssince we search among DAGs and an oriented edge (instead of a bi-directed one) canprevent a future correct orientation of another edge because of the creation of a cycle orbecause of the creation of a v-structure. This issue is partially resolved by allowing forsplitting as this mimics the effect of an edge being bi-directed. Indeed, allowing for splittingconsiderably increases the performance of the algorithm. We achieve an almost constant

54 Simulations

LiNGAM< − > · − > < − · · ·

< − > 0 0 0 0· − > 0 5 1 4· · 0 0 0 34

LiNGAM< − > · − > < − · · ·

< − > 0 0 0 0· − > 0 5.08 0.75 4.13· · 0 0.56 0.32 34.16

PCLiNGAM< − > · − > < − · · ·

< − > 0 0 0 0· − > 0 5 1 3· · 0 0 0 35

PCLiNGAM< − > · − > < − · · ·

< − > 0 0 0 0· − > 0.35 5.08 0.93 3.60· · 0 0.24 0.07 34.73

SBGDS< − > · − > < − · · ·

< − > 0 0 0 0· − > 0 7 0.5 2· · 0 0 0 34

SBGDS< − > · − > < − · · ·

< − > 0 0 0 0· − > 0.15 6.99 0.82 2.00· · 0 0.75 0.22 34.07

SBGDS+< − > · − > < − · · ·

< − > 0 0 0 0· − > 0 8 0 1· · 0 0 0 35

SBGDS+< − > · − > < − · · ·

< − > 0 0 0 0· − > 0.14 7.41 0.75 1.66· · 0 0.23 0.10 34.71

Table 5.1: These tables represent the median (on the left) and mean (on the right) per-formance of the algorithms in the standard setting, that is, with a sample size of 100 with10 nodes. These tables are for the non-Gaussian case.

performance throughout the range of concentrations of nodes with Gaussian noise. Weremark again that the parameter have been chosen ad-hoc for this setting, therefore, thereal performance may be a little less promising. However, as mentioned before, a similarad-hoc selection procedure may be used when dealing with real data. On the other side,we should always perform at least as good as without using any additional tools.

The above considerations have been made considering the mean of the 100 runs. Lookingat the median gives us more information about the robustness of the algorithms. Letus look at some tables summarizing the performances of the algorithms. In Table 5.1we compare the performance in the setting where only nodes with a non-Gaussian noiseprobability distribution are present. We omitted the results of the PC algorithm since wedo not expect good result from it. Similarly, in Table 5.2 we compare the performance inthe setting where only nodes with a Gaussian noise probability distribution are present.For the same reason as above, we omitted the results of the LiNGAM algorithm.

It is worth mentioning that we expect a mean of 10 edges for each DAG meaning thatthe sum of the first two “diagonal elements” should be 10 for the perfect algorithm. Notealso that when it comes to errors having a median value which is smaller than the meanvalue is a good feature. Indeed, this indicates that in over 50% of the simulations we had

5.2 Results 55

PC< − > · − > < − · · ·

< − > 3 0 0 0· − > 1 2 0 3· · 0 0 0 35

PC< − > · − > < − · · ·

< − > 2.79 0.03 0.09 0.49· − > 1.53 1.48 0.15 3.29· · 0.18 0.11 0.04 34.82

PCLiNGAM< − > · − > < − · · ·

< − > 3 0 0 0· − > 1 2 0 3· · 0 0 0 35

PCLiNGAM< − > · − > < − · · ·

< − > 2.76 0.06 0.09 0.49· − > 1.42 1.55 0.19 3.29· · 0.18 0.11 0.04 34.82

SBGDS< − > · − > < − · · ·

< − > 2 0 0 0· − > 0 3 1 2· · 0 0 0 33.5

SBGDS< − > · − > < − · · ·

< − > 2.38 0.28 0.30 0.44· − > 0.68 2.67 0.78 2.32· · 0.38 0.91 0.48 33.38

SBGDS+< − > · − > < − · · ·

< − > 2 0 0 0· − > 0 4 0 1· · 0 0 0 35

SBGDS+< − > · − > < − · · ·

< − > 2.56 0.26 0.17 0.41· − > 0.44 4.39 0.41 1.21· · 0.09 0.28 0.22 34.56

Table 5.2: These tables represent the median (on the left) and mean (on the right) per-formance of the algorithms in the standard setting, that is, with a sample size of 100 with10 nodes. These tables are for the Gaussian case.

less errors than the mean value, in turn, this means that we usually perform better thanindicated by the mean values but also that we may perform quite badly when errors occur.However, this is something we expected, in fact, an error can easily start a chain of errorstrying to account for the first one. A similar reasoning holds for the correct selections butit goes the other way around: here bigger median values are welcomed.

Let us now analyse this first table. We will consider a particular sub-table as a matrixand we will indicate its entries mentioning rows and columns. First of all we can see thatalready with SBGDS we clearly outperform both, the LiNGAM algorithm as well as thePCLiNGAM algorithm. From the median tables we can see that all algorithm struggleto find all the edges, in fact we have high values in the entries of the second row, fourthcolumn of all sub-tables. Apart from that, we can see how our algorithm performs wellalso in the default version. All the values in the median table corresponding to errors aresmaller than in the mean table. For the enhanced version things look even better. In themedian we only have a missing edge but otherwise we do not commit errors. This, again,means that in over 50% of the cases we do not orient any edges in the wrong direction aswell as we do not connect any nodes which should not be connected.

In the Gaussian setting the situation is less evident. We still perform slightly better but

56 Simulations

for instance, with the default version of our algorithm the advantage is minimal. It alsohappens that the median value of an error entry is higher than its mean value, for that,see the two tables of the SBGDS algorithm, second row, third column. Nevertheless, theSBGDS+ algorithm performs quite good even outperforming the PC algorithm. Consid-ering that we are comparing our algorithm in situations that are more favourable to thecompeting algorithms and that nevertheless we are performing al least as good is alreadyvery promising. To have a better understanding of the performances, let us look at theboxplots of the correct selections11. As we will see, the boxplots will support the analysisbased on the tables above, but, they are easier and quicker to read.

PC LiNGAM PCLiNGAM SBGDS SBGDS+

3035

4045

#co

rrec

tse

lect

ions

Performances for non-Gaussian noise


30

3540

45

#co

rrec

tse

lect

ions

Performances for Gaussian noise

Figure 5.5: These boxplots represent the correct selections provided from the algorithmsin the standard non-Gaussian and in the standard Gaussian setting respectively. Standardsetting means that we have 100 observations and 10 nodes.

Both boxplots support that the standard version of our algorithm (SBGDS) is already

11Note that the boxplots as well as the plots of the wrong selections are the same as the plots of thecorrect selections but with the y-axis mirrored. Indeed the wrong selections are exactly (45−“correctselections”).

5.2 Results 57

competitive. Moreover, we can see that the PCLiNGAM algorithm is also performing verywell compared with PC algorithm and the LiNGAM algorithm. The PC algorithm and theLiNGAM algorithm perform well in the Gaussian and non-Gaussian setting, respectively,exactly as one would expect. It is maybe a little bit unexpected that the performance ofthe LiNGAM algorithm does not drop much in the Gaussian setting.

It is surely interesting to look at the performance of all the algorithms in some mixed casessince we hope to have an advantage. We will look at the case where 75% of the noisesare Gaussian distributed. We are not looking at the case where only 50% of the noisesare Gaussian distributed because in this case we do not have many bi-directed edges asalready pointed out in Section 5.2.1. We think that in the case where 75% of the noisesare Gaussian distributed the other algorithm could have some issues in performing well.Since above we compared our algorithm in settings favourable to the other algorithms,it is fair to compare the performances of the algorithms also in settings which are moredifficult to deal with.


3035

4045

#co

rrec

tse

lect

ions

Correct selections with 75% of Gaussian noise

Figure 5.6: These boxplots represent the correct selections provided from the algorithmsin a standard mixed setting. Standard setting means that we have 100 observations and10 nodes whereas with mixed, in this case, we mean that 75% of the noise is Gaussiandistributed and the remaining 25% is non-Gaussian distributed.

Figure 5.6 confirms again what we saw for instance in Figure 5.4, i.e. that our algorithmis more consistent throughout the different distributions of the noises. Moreover, we seeagain that already in the default setting we slightly outperform the other algorithms. Theenhanced SBGDS goes a step further sensibly increasing the number of correct selections.Although the differences are smaller, the same holds for the PCLiNGAM algorithm com-pared with the PC algorithm and the LiNGAM algorithm. As expected from theory, alsothe PCLiNGAM algorithm is more stable and performs at least as good as the other twoalgorithms.

58 Simulations

Until now we worked with a sample size of 100, but, it is certainly interesting to lookat the behaviour of all algorithms a different number of observations. In fact, from thetheory, all algorithm used for comparisons are consistent for their respective settings andshould therefore perform better when the number of observations increases. In Figure 5.7we plotted the performance of all algorithms for three number of observations: 25, 100,and 1000. All algorithms perform better as the number of observations increases, but, alittle bit surprisingly, our algorithm does not improve as much as the others, at least whenit comes to the default version. In fact, SBGDS improves notably increasing the numberof observations from N = 25 to N = 100 but increasing it from N = 100 to N = 1000does not help much. The same is true for the enhanced version of the SBGDS algorithm,although, it still performs very good compared to the other algorithms. Indeed, only theLiNGAM algorithm outperforms the SBGDS+ but only in the settings with a majority ofnon-Gaussian distributed noises. All other algorithms perform as one would expect. Theirperformance improves considerably, with the PC algorithm performing well in the Gaussiansetting, the LiNGAM algorithm performing very well in the non-Gaussian setting and alsosurprisingly well otherwise as we already saw above, and with the PCLiNGAM algorithmperforming well in all cases but also being outperformed by the LiNGAM algorithm in thenon-Gaussian setting with N = 1000. The low performance of the SBGDS algorithm inthe Gaussian setting can again be explained by the nature of the greedy search and bythe search space which consists of DAGs and not CPDAGs. For the SBGDS+ we recallthat the parameters are no longer optimal. With N = 25 we probably select too manyvariables, which is not really a problem, but, with N = 1000 we risk to reject too manyvariables which could cause a reduction in the performance. As discussed at the end ofSection 5.2.1, it is a priori not clear what happen with the splitting procedure.

Besides the varying sample size, we also looked at the performance with different jointprobability distributions. In particular, we looked at the sub-Gaussian setting since theestimation of the moralized graph assumes it. We did it allowing only Gaussian anduniform distributed noises. We also looked at the case where only the t-distribution wasallowed as non-Gaussian probability distribution. This case is potentially tricky as thet-distribution is very similar to the Gaussian distribution. We do not expose the results asthey are not very different from the results obtained with the standard setting. Especiallyfor the sub-Gaussianity issue, this result is not very surprising since this assumption affectsonly the estimation of the moralized graph and not the main essence of the algorithm.

Looking at the overall performance we can be very satisfied with the performances of ouralgorithm. In general we outperform the other algorithms also with the default version.With the enhanced version we notably performed better than the other algorithms. Withthe analysis of the median and with the help of the boxplots we also observe a solidconsistency of the good results. It seems that only with a large sample size the classicalalgorithms are able to fill the gap, but still, only in their respective settings.

5.2 Results 59

0.0 0.2 0.4 0.6 0.8 1.0

33

34

35

36

37

38

39

Gaussianity

Rig

ht

/W

rong


Performances with N=25,100,1000

0.0 0.2 0.4 0.6 0.8 1.0

3637

3839

4041

42

Gaussianity

#co

rrec

tse

lect

ions


0.0 0.2 0.4 0.6 0.8 1.0

3840

4244

Gaussianity

Rig

ht

/W

rong


Figure 5.7: We plotted the mean values of correct selections of all algorithms in thestandard setting against the concentration of nodes with Gaussian distributed noise forthree different sample sizes. In the first plot we have a sample size of 25, in the secondone of 100, and in the last one of 1000. All DAGs considered have 10 nodes.

60 Simulations

Chapter 6

Summary

We introduced the main concepts of causal inference starting with the basics and dis-cussing also the specific model we then considered, that is, explaining advantages anddisadvantages of linear structural equation models with additive noise. After this com-prehensive introduction we exposed some special cases and some well-known algorithmswhich already greatly deal within these special settings. In particular, we presented thePC algorithm, the GES algorithm and the LiNGAM algorithm. These algorithms as wellas these settings are very important for us since they represent the starting point of ourwork. In fact, we aimed to extend the settings in which, also from a theoretical point ofview, we should be allowed to operate. As a generalization of the above special settings,we introduced linear SEMs with mixed Gaussian and non-Gaussian distributed noise, con-sidering the PCLiNGAM algorithm which has been suggested in Hoyer et al. (2012) butwhose performance was still, at least for us, not tested yet. Then, we stated the ideasbehind our algorithm (SBGDS+) and the theory endorsing them. Finally, we analysedthe performance of our algorithm comparing it to that of the others in a very transparentway. The results are very promising since SBGDS+ outperformed every other algorithmin almost all settings under investigation, moreover, most of the times, the increase of theperformance was notably large.

6.1 Future Work

Since this new methodology seems promising, one should try to understand better how theparameters, in particular rho in morgraphsel.R and alpha ratio, affects the performancefor varying distributions and sample sizes. Also different approaches to improve the per-formance and, for instance, to deal with the not full identifiability of the model should beconsidered. In order to extend the range of applicability of the algorithm, one should lookat the performance in the high-dimensional setting as well as the non-linear setting. Thenon-linear setting should be easy to implement since it should be enough to substitutethe linear regressions with, for instance, polynomial regressions or even fitting general-

62 Summary

ized additive models. The high-dimensional settings is almost already implemented but acareful analysis is missing, in particular when it comes to the estimation of the moralizedgraph. Finally, in order to offer a competitive algorithm, we think that the efficiency ofthe algorithm has to be improved. In addition to the optimization of the current code, onemay also consider other approaches like for instance a dynamic programming approach.

Bibliography

Bollen, K. A. (1989). Structural Equations with Latent Variables. Wiley series in proba-bility and mathematical statistics. Applied probability and statistics section. Wiley.

Chickering, D. M. and C. Boutilier (2002). Optimal structure identification with greedysearch. Journal of Machine Learning Research 3, 507–554.

Comon, P. (1994). Independent component analysis, a new concept? Signal Process. 36 (3),287–314.

Evans, L. (2010). Partial Differential Equations. Graduate studies in mathematics. Amer-ican Mathematical Society.

Hoyer, P. O., A. Hyvarinen, R. Scheines, P. Spirtes, J. Ramsey, G. Lacerda, andS. Shimizu (2012). Causal discovery of linear acyclic models with arbitrary distribu-tions. CoRR abs/1206.3260.

Hyvarinen, A., J. Karhunen, and E. Oja (2001). Independent Component Analysis. NewYork: Wiley.

Kalisch, M. and P. Buhlmann (2007). Estimating high-dimensional directed acyclic graphswith the pc-algorithm. Journal of Machine Learning Research 8, 613–636.

Koller, D. and N. Friedman (2009). Probabilistic graphical models : principles and tech-niques. Adaptive computation and machine learning. Cambridge: MIT Press.

Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.

Loh, P. and P. Buhlmann (2014). High-dimensional learning of linear causal networks viainverse covariance estimation. Journal of Machine Learning Research 15 , 3065–3105.

Meek, C. (1995a). Causal inference and causal explanation with background knowledge. InProceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95,San Francisco, CA, USA, pp. 403–410. Morgan Kaufmann Publishers Inc.

Meek, C. (1995b). Strong completeness and faithfulness in bayesian networks. In Proceed-ings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, SanFrancisco, CA, USA, pp. 411–418. Morgan Kaufmann Publishers Inc.

Nickl, R. and B. M. Potscher (2007). Bracketing metric entropy rates and empirical central

64 BIBLIOGRAPHY

limit theorems for function classes of besov- and sobolev-type. Journal of TheoreticalProbability 20, 177–199.

Nowzohour, C. and P. Buhlmann (2014). Score-based causal learning in additive noisemodels.

Pearl, J. (1993). [bayesian analysis in expert systems]: Comment: Graphical models,causality and intervention. Statistical Science 8 (3), 266–269.

Pearl, J. (2009). Causality: Models, Reasoning and Inference (2nd ed.). New York, NY,USA: Cambridge University Press.

Peters, J., J. M. Mooij, D. Janzing, and B. Scholkopf (2014). Causal discovery withcontinuous additive noise models. Journal of Machine Learning Research 15, 2009–2053.

Ravikumar, P., M. J. Wainwright, G. Raskutti, and B. Yu (2011). High-dimensionalcovariance estimation by minimizing `1-penalized log-determinant divergence. ElectronicJournal of Statistics 5, 935–980.

Shimizu, S., P. O. Hoyer, A. Hyvarinen, and A. Kerminen (2006). A linear non-gaussianacyclic model for causal discovery. Journal of Machine Learning Research 7, 2003–2030.

Spirtes, P., C. Glymour, and R. Scheines (2000). Causation, Prediction, and Search (2nded.). MIT press.

Uhler, C., G. Raskutti, P. Buhlmann, and B. Yu (2013). Geometry of the faithfulnessassumption in causal inference. The Annals of Statistics 41 (2), 436–463.

van de Geer, S. (2000). Empirical processes in M-estimation. Cambridge University Press.

Verma, T. and J. Pearl (1991). Equivalence and synthesis of causal models. In Proceedingsof the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI ’90, NewYork, NY, USA, pp. 255–270. Elsevier Science Inc.

Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices.In Y. Eldar and G. Kutyniok (Eds.), Compressed Sensing, Theory and Applications.Cambridge University Press.

Appendix A

Equivalence of the MarkovConditions

We closely follow Lauritzen (1996). In Section A.1 we first state and prove equivalenceof the Markov properties for undirected graphs with respect to separation which is ananalogous concept as the one of d-separation but applicable to undirected graphs. Then,in Section A.2 we state the Markov properties for DAGs and show their equivalence withrespect to separation. Finally, in Section A.3 we show that separation in the moralizedDAG is equivalent to d-separation in the DAG.

A.1 Equivalence of the Markov conditions for undirectedgraphs

We start by introducing some terminology used for undirected graphs. Undirected graphsare just graphs as defined in Definition 2.1.1 where E is a set of (not ordered) pairs. Wecall they undirected to stress that there is a difference with the rest of the thesis andthereofre, to avoid confusion. Sometimes we will also use undirected paths, hence, we alsoneed a proper definition although it is very similar to Definition 2.1.3.

Definition A.1.1: Given a graph G = (V,E), an undirected path linking xi and xj isa sequence of nodes (xπ(1), . . . , xπ(q)), for some permutation π : V → V and with q > 2,such that π(1) = i, π(q) = j, and ∀k ∈ 1, . . . , q − 1, (π(k), π(k + 1)) ∈ E. The length ofa undirected path is equivalent to the number of edges, hence, q − 1.

Similarly as in Definition 2.1.4, we can define the set of the adjacent nodes also for undi-rected graph.

Definition A.1.2: Given an undirected graph G = (V,E) and a node xi ∈ V, we denoteby Adj(xi) the set of all those j ∈ 1, . . . , p such that (j, i) ∈ E. We say that a node inthis set is adjacent to xi.

66 Equivalence of the Markov Conditions

Now, we also need a different concept of separation.

Definition A.1.3: (Lauritzen, 1996, p. 6) A set S is said to separate two disjoint setsA and B if S is disjoint from A and B and contains at least an element of every pathbetween an arbitrary element of A and an arbitrary element of B.

Now, we have what is needed to state the Markov properties for undirected graphs.

Definition A.1.4: (Lauritzen, 1996, pp. 32-35) A probability distribution P over V issaid to satisfy

1. the pairwise Markov property if

∀xi, xj ∈ V : i, j ∈ 1, . . . , p, i 6= j, xi /∈ xAdj(xj) ⇒ xi ⊥⊥ xj |V \ xi, xj.

2. the local Markov property if

∀x ∈ V, x ⊥⊥ V \ xAdj(x) ∪ x|xAdj(x).

3. the global Markov property if for any triple of disjoint sets A,B and S,

ASepB|S ⇒ A ⊥⊥ B|S.

4. the Markov factorization property if P factorizes according to G, that is, for alldisjoint decomposition in complete subsets W1, . . . ,Wk of V, ∃ functions ψ1, . . . , ψk,ψi : Wi → R>0 such that P has a density of the form

f(x) =k∏i=1

ψi(x) (A.1.1)

with respect to a product measure µ = ⊗ki=1µWi which is also assumed to exist.

The Markov conditions are of course different and standard examples showing the differ-ences between them can be found in Lauritzen (1996), but, with the assumption that thejoint density p of P exists, is positive and continuous then they are equivalent. To provethis result we need only two additional lemmas.

Lemma A.1.5: (Lauritzen, 1996, p. 29) For continuous density and well-defined condi-tional densities we have that

xi ⊥⊥ xj |xk ⇐⇒ f(xi, xj , xk) = h(xi, xk)k(xj , xk)

for suitable h and k.

Proof. We actually have

f(xi, xj , xk) = f(xi, xj |xk)f(z) = f(xi|xk)f(xj |xk)f(xk) = h(xi, xk)k(xj , xk).

Hence, reading the expression in one or in the other direction, gives the respective directionof the proof.

A.1 Equivalence of the Markov conditions for undirected graphs 67

Lemma A.1.6: (Lauritzen, 1996, p. 29) If a joint density on the set V, with respect to aproduct measure, is positive and continuous, then for xi, xj and xk three random variables,we have that

xi ⊥⊥ xj |xk and xi ⊥⊥ xk|xj ⇒ xi ⊥⊥ xj , xk.

Proof. (Lauritzen, 1996, pp. 29-30) Assume that the random variables xi, xj and xkhave a continuous and positive joint density p(xi, xj , xk) > 0, moreover, assume also thatxi ⊥⊥ xj |xk and xi ⊥⊥ xk|xj . Then, using the first conditional independence statement andLemma A.1.5, p(xi, xj , xk) = p(xi|xj , xk)p(xj , xk) = f(xi, xk)g(xj , xk) for positive andcontinuous f and g. Similarly, using the second conditional independence statement andLemma A.1.5, p(xi, xj , xk) = k(xi, xj)l(xj , xk) for positive and continuous k and l. Thus,p(xi, xj , xk) = f(xi, xk)g(xj , xk) = k(xi, xj)l(xj , xk) and hence,

k(xi, xj) =f(xi, xk)g(xj , xk)

l(xj , xk).

Since the left hand side is independent of xk we can fix xk = x0k and set π(xi) = f(xi, x

0k)

and ρ(xj) =g(xj ,x

0k)

l(xj ,x0k), which gives k(xi, xj) = π(xi)ρ(xj). Hence, we obtain

f(xi, xj , xk) = k(xi, xj)l(xj , xk) = π(xi)ρ(xj)l(xj , xk)

which prove the result.

We proved this result because the implied condition is exactly what we are going to usein order to achieve the equivalence between the Markov properties. Indeed, we showthe equivalence of the four properties defined in Definition A.1.4 by showing the chain4. ⇒ 3. ⇒ 2. ⇒ 1. ⇒ 4. assuming that P has density p which is positive and continuouswith respect to a product measure.

Theorem A.1.7: Let G = (V,E) be a DAG and let P be a probability measure on Vwhich has a positive and continuous density p with respect to a product measure µ. Then,the four properties defined in Definition A.1.4 are equivalent.

Before we start with the proof, note that without loss of generality, we can assume that onlymaximal cliques appear in Equation A.1.1. Maximal cliques are a possible decompositionin complete subsets, thus a valid choice.

Proof of Theorem A.1.7. (Lauritzen, 1996, pp. 35-37) As we said, we begin with 4.⇒ 3.“4. ⇒ 3.” Let A,B and S be any triple of disjoint subsets of V. Let A be the connectedcomponent in G = (VS , E), where VS = V \ S, which contains A. Define B = V \ A ∪ S.Since A and B are separated by S1 their elements are in different connected components

1Remember that separation implies having at least a node in the separating set between every elementsof A and B for every possible path. There is no “free” separation because of v-structures as in d-separation.


of G, therefore, any clique of G is contained in either A ∪ S or B ∪ S. Denote by CA thecliques contained in A ∪ S. Then,

p(x) =k∏i=1

ψi(x) =∏i∈CA

ψi(x)∏i∈CcA

ψi(x) = f(xA∪S)g(xB∪S).

Hence, A ⊥⊥ B|S follows from Lemma A.1.5. Applying the indicator functions, which aremeasurable functions, on both, f and g, we obtain2 that A ⊥⊥ B|S.“3. ⇒ 2.” For this implication it is enough to note that for any x ∈ V , V \ x ∪ xAdj(x)and xAdj(x) are disjoint subsets and that

x SepV \ x ∪ xAdj(x)|xAdj(x)

in G.“2.⇒ 1.” First note that xj ∈ V \ xi ∪Adj(xi) as xi /∈ Adj(xj). Hence,

Adj(xi) ∪ V \ xi ∪Adj(xi) \ xj = V \ xi, xj.

In particular, note that the union is a disjoint union, from this fact and the local Markovproperty it follows that xi ⊥⊥ V \ xi ∪ Adj(xi|V \ xi, xj since the only differencewith the local Markov property’s expression is the set we condition on which is a supersetof xAdj(xi). xj is contained in V \ xi ∪xAdj(xi, hence, xi ⊥⊥ xj |V \ xi, xj holds alsotrue.“1.⇒ 4.” This last part is known as Hammerslay-Clifford theorem. Let us start by assum-ing that P satisfies the pairwise Markov property with respect to G = (V,E) and fixing anarbitrary element xk ∈ V. Note that taking the logarithm on both parts of Equation A.1.1we obtain a new equation to work with, namely log(f(x)) =

∑W :W⊂V φW (x), where

φW (x) =

log(ψW (x)) for W a complete subset of V

0 otherwise.

For all W ⊂ V we define HW (x) = log(f(xW , x∗W c)), where for all i ∈W xi = xi and for all

i /∈W xi = xk. Note that since xk is fixed, the function H depends on x only through xW .Let us also define φW (x) =

∑U :U⊂W (−1)|W\U |HU (x). Also in this case the dependence of

the function φ on x is only through xW . Applying the generalized Mobius inversion formula(See Lauritzen, 1996, p. 239) we obtain that log(f(x)) = HV (x) =

∑W :W⊂V φW (x). If φW

is zero whenever W is not a complete subset of V, then p(x) would actually factorize as wewish and the theorem would be therefore proved. Take two nodes xi and xj in W such thatxi /∈ xAdj(xj), i.e., W is not complete, and define U = V \ xi, xj and U = W \ xi, xj.Now we can write

φW (x) =∑

Z:Z⊂U

(−1)|U\Z|(HZ(x)−HZ∪xi(x)−HZ∪xj (x) +HZ∪xi,xj(x)). (A.1.2)

2Since independence is preserved under measurable function.

A.2 Equivalence of the Markov conditions for DAGs under separation 69

Now, let us compute

HZ∪xi,xj(x)−HZ∪xi(x) = log

(p(xi, xj , xZ , x

∗U\Z)

p(xi, x∗j , xZ , x∗U\Z)

)(A.1.3)

= log

(p(xi|xj , xZ , x∗U\Z)p(xj , xZ , x

∗U\Z)

p(xi|x∗j , xZ , x∗U\Z)p(x∗j , xZ , x∗U\Z)

)(A.1.4)

= log

(p(xi|xZ , x∗U\Z)p(xj , xZ , x

∗U\Z)

p(xi|xZ , x∗U\Z)p(x∗j , xZ , x∗U\Z)

)(A.1.5)

= log

(p(x∗i |xZ , x∗U\Z)p(xj , xZ , x

∗U\Z)

p(x∗i |xZ , x∗U\Z)p(x∗j , xZ , x∗U\Z)

)(A.1.6)

= log

(p(x∗i , xj , xZ , x

∗U\Z)

p(x∗i , x∗j , xZ , x

∗U\Z)

)(A.1.7)

= HZ∪xj (x)−HZ(x) (A.1.8)

where the equality in Equation A.1.4 is just the chain rule for probability. The equal-ity in Equation A.1.5 is the crucial point as we used the local Markov property sinceZ ∪ U \ Z = V \ xi, xj. Note also that for a complete set W this would not have beenpossible as xi would have been adjacent to xj . In the fourth line, Equation A.1.6, we justreplaced a ratio yielding 1 with another ratio yielding 1. In the last two line the argumentsare the same but in the opposite order. What we have achieved is that all the addend inEquation A.1.2 are zero when W is not a complete set. Thus, the proof is completed.

A.2 Equivalence of the Markov conditions for DAGs underseparation

Now, we can state the Markov properties for DAGs and show their equivalence with respectto separation. After this is done, we link separation and d-separation in order to have thedesired result. A difference in DAGs is that the pairwise Markov property is not going tobe equivalent with the other three. It is therefore insufficient as an assumption. For anexample we refer to Lauritzen (1996, p. 51).

Definition A.2.1: (Lauritzen, 1996, pp. 46-47, 50) A probability distribution P is saidto satisfy

1. the pairwise Markov property with respect to the DAG G = (V,E) if for any two xiand xj such that xi /∈ xAdj(xj) and xj ∈ xNd(xi) ⇒ xi ⊥⊥ xj |xNd(xi) \ xj.

2. the local Markov property with respect to the DAG G = (V,E) if for any xi ∈ V,xi ⊥⊥ xNd(xi) \ xPa(xi)|xPa(xi).


3. the global Markov property with respect to the DAG G = (V,E) if for any threedisjoint subset A,B and S of V such that ASepB|S in M(G) where G is the(smallest) ancestral graph of the set A ∪B ∪ S it follows that A ⊥⊥ B|S.

4. the Markov factorization property with respect to the DAG G = (V,E) if for anyxi ∈ V there exists a non-negative functions ki : Xi × XPa(xi) → R>0 such that∫ki(xi, xPa(xi))µi(dxi) = 1 and P has density p with respect to a product measure

µ, where p(x) =∏xi∈V ki(xi, xPa(xi)).

Theorem A.2.2: (Lauritzen, 1996, p. 51) Let G = (V,E) be a DAG and let P be adistribution function on V which admits a density p with respect to a product measure µ.Then all but the first definition stated in Definition A.2.1 are equivalent.

Proof. (Lauritzen, 1996, p. 51) We will prove the chain 4.⇒ 3.⇒ 2.⇒ 4.“4.⇒ 3.” This implication follows from the fact that if P satisfies the Markov factorizationproperty with respect to the DAG G = (V,E), P satisfies the Markov factorization prop-erty with respect toM(G), indeed, for any xi ∈ V, xi∪xPa(xi) is complete inM(G) andwe can therefore set ψxi∪xPa(xi)

= ki. Hence, from Theorem A.1.7, P satisfies the globalMarkov property.“3. ⇒ 2.” For this implication it is enough to note that for any xi ∈ V, xi ∪ Nd(xi) is anancestral set and that xi SepNd(xi)\Pa(xi)|Pa(xi) besides the fact that these three setsare disjoint.“2. ⇒ 4.” For this last implication we will use an induction argument on the size of |V |.For |V | = 1 there is nothing to prove as p(x1) is already factorized being a marginal.Now we proceed with the induction step p− 1→ p. Pick a node without outgoing edges3

and denote it without loss of generality4 by xp. Let kp = p(xp|x1, . . . , xp−1) be the con-ditional density given V \ xi. Using the local Markov property, ki = p(xp|xPa(xp)). Thenp(X) = p(x1, . . . , xp) = p(xp|x1, . . . , xp−1)p(x1, . . . , xp−1) = kip(x1, . . . , xp−1). From theinduction hypothesis p(x1, . . . , xp−1) factorizes and thus, we are done.

A.3 Equivalence of the Markov conditions for DAGs underd-separation

There is only one last step left. We have described the equivalence with respect to sepa-ration but not with respect to d-separation. In particular, we are interested in the globalMarkov property with respect to d-separation. For this we have the following lemma.

Lemma A.3.1: (Lauritzen, 1996, p. 48) Let G = (V,E) be a DAG and let A, B and S bethree disjoint subsets of V. Then Ad-SepB|S ⇐⇒ ASepB|S in the (smallest) moralized

3Such a node has to exist in a DAG since otherwise we could construct a directed cycle. See also theargumentation in the proof of Lemma 2.1.16.

4The order does not play a role here. We are not using any property of the topological order of theDAG.

A.3 Equivalence of the Markov conditions for DAGs under d-separation 71

graph of A ∪B ∪ S.

Proof. Assume that A and B are not d-separated by S, then one possibility is that thereare only directed paths without v-structures and S does not contain one node of everypath. In this case also in the moralized graph, the same path would not be blocked.Another possibility is to have a not blocked path because of a v-structure. This wouldimply that either the v-structure node or one of its descendants are in the set S and atthe same time, none of the other non-v-structure nodes in the path should be in the set S.If it is one of the descendants node to lie in S then in the moralized graph there would beno nodes blocking the path. If it is exactly the v-structure node that lies in S, then oncethe sub-graph will have been moralized, there will be for sure an edge between the parentsof the v-structure node. In this way also this kind of paths will not be blocked. The lastimportant point to make this arguments sound is to check if all those path are includedin the ancestral graph of A∪B ∪ S. Indeed, any path outgoing either A or B will enterthe other part, i.e. B or A respectively, or will meet somewhere creating a v-structure.Since the path is not blocked, this v-structure or one of its descendants will lie in S. Inthis way, all the nodes in this path will be ancestors of the set S.

A

C

B

S

(a) DAG with the possible unblockedpaths.

A

C

B

S

(b) Moralized graph with the correspond-ing unblocked paths.

Figure A.1: In (a) we have represented the three possible situations for unblocked pathsin a DAG. In (b) we have pictured the corresponding situation once the DAG has beenmoralized. It is straightforward to see that all the possible paths are also unblocked in themoralized DAG.

Also for the converse direction we will work with the negation of the claim. Assumetherefore that A is not separated from B by S in the moralized graph of the minimalancestral graph containing A ∪ B ∪ S. Hence, there is a path between A and B notpassing through S. This path may contain both, edges present also in the DAG and edgesmarrying the parents of a v-structure in the DAG. If the path contains only edges alsopresent in the DAG and no v-structure, the path is unblocked. The path could also containonly edges also present in the DAG and one (or more) v-structure. In this case the v-structure are surely not in S since the path is not blocked in the undirected graph butsince this node has to be an ancestral of one of the three sets, there has to be a directedpath from the v-structure nodes to either A, B or S. In the case where the path from thev-structure yields to either A or B, there is an alternative unblocked path in the DAG byconstruction. If the path yields to S, the v-structure has a descendant in S and thereforethe path is unblocked in the DAG. In case the path contains edges which are not present


A B

S

C

(a) Moralized graph with the first possi-bilities of unblocked paths.

A

C

B

S

(b) DAG with the corresponding possibil-ities of unblocked paths.

A B

S

C

D E

(c) Moralized graph with the other possi-bilities of unblocked paths.

A B

S

C

D E

(d) DAG with the corresponding possibil-ities of unblocked paths.

Figure A.2: In (a) we have the starting point in the moralized DAG yielding to severalscenarios represented in (b). The blue path is the simplest and do not need any comment.For the red solid path, we have represented the three possibilities with red dashed edges(or paths). Only one has to be there, yielding either to a v-structure with descendantin S or to a directed path without v-structures. In (c) the path we are considering isrepresented by the red solid edges. Nevertheless, we should be aware that there are alsothe two black edges as in the DAG, figure (d), we have a v-structure in C. Consideringone of the red dashed edges (or paths) at a time we can use the reasoning above and thus,we are done.

in the DAG, there is a v-structure outside the path. Since in the DAG the married edgeis not present, we can only pass through the v-structure. If this v-structure is in S orhas a descendant in S then the alternative path in the DAG is also unblocked. If not,then there has to be a direct path from this v-structure to either A or B because also thisv-structure has to be an ancestral node of one of the three sets. This create again at leastan alternative unblocked path.

Note that in Figure A.1 and Figure A.2 we considered simple nodes instead of set of nodesin order to simplify the notation. The reasoning is the same also in case we would havesets of nodes instead of simple nodes.

A Score-Based Method for Inferring Structural Equation ...€¦ · We implement and analyse a new...

Documents

Transcript of A Score-Based Method for Inferring Structural Equation ...€¦ · We implement and analyse a new...