Computing Integrated Information · Computing Integrated Information Stephan Krohn 1 and Dirk...

35
φ φ φ φ φ φ φ

Transcript of Computing Integrated Information · Computing Integrated Information Stephan Krohn 1 and Dirk...

Computing Integrated Information

Stephan Krohn1 and Dirk Ostwald1,2

1Computational Cognitive Neuroscience LaboratoryDepartment of Education and Psychology, Freie Universität Berlin

2Center for Adaptive RationalityMax-Planck Institute for Human Development, Berlin

Abstract

Integrated information theory has received considerable attention asa novel research paradigm for the study of consciousness in recent years.Integrated information theory essentially proposes that quantitative con-sciousness is identical to integrated information, quanti�ed by a measurecalled φ. However, this measure has mainly been developed in discretestate systems, impeding its computation based on experimental data. Inthe current work, based on the most recent instantiation of the theory(�Integrated information theory 3.0�), we make progress towards overcom-ing this limitation by making the following novel contributions: First, weformulate φ with respect to the general language of probabilistic models.Second, we derive a constructive rule for the decomposition of a systeminto two disjoint subsystems, central to the de�nition of φ. Third, wevalidate this general formulation by evaluating φ in a previously estab-lished discrete state example system. Fourth, we show that the generalformulation improves parsimony, as it eschews unde�ned conditional dis-tributions and the need for �virtual elements�. Fifth, we transfer φ tothe context of continuous linear Gaussian systems, a versatile model classsuitable for the modelling of experimental data. Finally, by evaluating φin parallel with the system state evolution, we explore the temporal dy-namics of integrated information in the di�erent model classes. In sum,the present formulation o�ers the prospect of computing integrated in-formation from experimental data and may serve as a starting point forthe further development of integrated information theory in an explicitprobabilistic framework.

1 Introduction

Integrated information theory (Tononi, 2004, 2005, 2008, 2012; Oizumi et al.,2014; Tononi, 2015; Tononi et al., 2016) has recently received considerable at-tention as a novel research paradigm for the study of the physical substrates ofconsciousness. Integrated information theory essentially proposes that quanti-tative consciousness, i.e., the degree to which a system is conscious, is identicalto its state-dependent level of integrated information, which can be quanti�edin a measure called �φ�. Integration here means that the information generatedby the system as a whole is in some measurable sense more than the informa-tion generated by its constituent elements and intuitively corresponds to �nding

1

an index of the system's functional irreducibility. While this approach is notundisputed (e.g., Aaronson, 2014; Cerullo, 2015), the idea of measuring inte-grated information has by now gained widespread popularity in the cognitiveneuroscience literature and beyond (e.g., Balduzzi and Tononi, 2008, 2009; Decoet al., 2015; Koch et al., 2016; Tegmark, 2016).

Ultimately, the aim of integrated information theory is to evaluate a measureof quantitative consciousness based on empirical data, such as electroencephalo-graphic (EEG) or functional magnetic resonance imaging (fMRI) recordings.However, to date the theory underlying the integrated information measure φ isprimarily developed with respect to speci�c examples, usually low-dimensionaldiscrete state systems that implement logical operations. For the computationof φ for a wide variety of systems and based on di�erent data types, a su�cientlygeneral formulation of φ is required. While there have been some recent e�ortsto make the computation of φ feasible for empirical data sets (Barrett and Seth,2011; Oizumi et al., 2016; Tegmark, 2016), no such e�ort exists for the latestinstantiation of integrated information theory, developed in Oizumi et al. (2014)and referred to as �integrated information theory 3.0�.

With the current work, we aim to achieve a step in the direction of theexperimental applicability of integrated information theory 3.0. Our approachis based on the fundamental idea that the evaluation of integrated informationfrom empirical data can be achieved by estimating the parameters of a proba-bilistic model from the data, and applying the measure φ to the thus estimatedsystem model (e.g., Ostwald et al., 2010, 2014; Ostwald and Starke, 2016). Tocompute φ for such a model, we here make the following novel contributions:First, we formulate φ with respect to the general language of probabilistic mod-els, by which we mean joint probability distributions over random entities (e.g.,Efron and Hastie, 2016; Gelman et al., 2014; Barber, 2012; Murphy, 2012).Second, we derive a constructive rule for the decomposition of a system intotwo disjoint subsystems, central to the de�nition of φ. Third, we validate thisgeneral formulation by evaluating φ in a previously established discrete stateexample system. Fourth, we show that the general formulation improves parsi-mony, as it eschews unde�ned conditional probability distributions and the needfor �virtual elements� as introduced by Oizumi et al. (2014). Fifth, we trans-fer φ to the context of continuous linear Gaussian systems (LGS), a versatilemodel class suitable for the modelling of real-life experimental data (e.g., Roweisand Ghahramani, 1999). Finally, by evaluating φ in parallel with the systemstate evolution, we explore the temporal dynamics of integrated informationin the di�erent model classes. With respect to the re�nements of integratedinformation theory 3.0, we limit ourselves to the treatment of φ (�small phi�)throughout. We eschew a detailed discussion of Φ (�capital phi�), because thedevelopment of the latter essentially corresponds to a reiteration of the frame-work for φ over all possible subsets of a system. For a detailed di�erentiationof the various �phi� measures proposed in the context of integrated informationtheory 3.0 and their relation to the φ discussed herein, please see Appendix A.Note that henceforth, we shall use the abbreviation �IIT� to refer exclusively tointegrated information theory 3.0 as developed in Oizumi et al. (2014), unlessnoted otherwise.

We proceed as follows. In Section 2 Computing φ : Theory, we �rst discussthe scope of integrated information and provide an explicit technical de�nitionof φ in the subsections System model and De�nition of φ. In a hierarchical fash-

2

ion, we then detail the conceptual and technical underpinnings of the constituentelements of this de�nition in the subsections E�ect and cause repertoires andSystem decomposition. Having established a general de�nition of φ, we then putthis de�nition to work in the Section 3 Computing φ : Applications. To validateour approach, we �rst apply the general framework in a discrete state systemsetting, rooted in the development of IIT (Oizumi et al., 2014). Subsequently,we apply the same general framework to the model class of linear Gaussian sys-tems and implement the evaluation of φ in Gaussian example systems. Finally,we discuss our approach for computing φ with respect to related work in theliterature and potential future developments.

Notation and Implementation

A few remarks on our notation of probabilistic concepts are in order. To denoterandom variables/vectors and their probability distributions, we use an appliednotation throughout. This means that we eschew a discussion of a randomentity's underlying measure-theoretic probability space model (e.g., Billings-ley, 2008), and focus on the random entity's outcome space and probabilitydistribution. For a random variable/vector X, we denote its distribution byp(X), implicitly assuming that this may be represented either by a probabil-ity mass or a probability density function. To denote di�erent distributions ofthe same random variable/vector, we employ subscripts. For example, pa(X) isto indicate a probability distribution of X that is di�erent from another prob-ability distribution pb(X). In the development of φ, stochastic (conditional)dependencies between random variables are central. To this end, we use thecommon notation that the statement p(X|Y ) = p(X) is meant to indicate thestochastic independence of X from Y and the statement p(X|Y,Z) = p(X|Z)is meant to indicate the (stochastic) conditional independence of X on Y givenZ (Dawid, 1979; Geiger et al., 1990). Finally, we use N(X;µ,Σ) to denotethe Gaussian probability density function of a d-dimensional random vectorX with expectation parameter µ ∈ Rd and positive de�nite covariance matrixparameter Σ ∈ Rd×d, Id denotes the d× d identity matrix, and 1d denotes a d-dimensional vector of all ones. All Matlab code (The MathWorks, Inc., Natick,MA, United States) developed to implement our formulation of IIT and gener-ate the technical �gures herein is available from the Open Science Framework(https://osf.io/hb4a5/).

2 Computing φ : Theory

System model

IIT models the temporal evolution of a system by a discrete time multivariatestochastic process (Cox and Miller, 1977)

p(X1, X2, ..., XT ). (1)

In the probabilistic model (1),Xt, t = 1, ..., T denotes a �nite set of d-dimensionalrandom vectors. Here, the limitation to a �nite set of discrete time-points isprimarily motivated by the eventual goal to apply the concepts of IIT in a data-analytical setting, not by inherent constraints of IIT. Each random vector Xt

3

comprises random variables xti , i = 1, 2, ..., d (d ∈ N) that may take on valuesin one-dimensional outcome spaces X1,X2...,Xd, such that

Xt = (xt1 , xt2 , ..., xtd)T (2)

may take on values in the d-dimensional outcome space X :=∏d

i=1 Xi. We as-sume X ⊆ Rd throughout. IIT further assumes that the stochastic process ful�lsthe Markov property, i.e., that the probabilistic model (1) factorizes accordingto

p(X1, X2, ..., XT ) = p(X1)

T∏t=2

p(Xt|Xt−1), (3)

and that the ensuing Markov process is time-invariant, i.e. that all conditionalprobability distributions p(Xt|Xt−1) on the right-hand side of eq. (3) are identi-cal (Figure 1A). We will refer to p(Xt|Xt−1) as the system's transition probabilitydistribution in the following. Finally, IIT assumes that the random variablesconstituting Xt are conditionally independent given Xt−1, i.e., that the condi-tional distribution p(Xt|Xt−1) factorizes according to

p(xt1 , xt2 , ..., xtd |Xt−1) =

d∏i=1

p(xti |Xt−1). (4)

De�nition of φ

Based on the assumptions of eqs. (1), (3) and (4), IIT de�nes the integratedinformation measure φ of a system state X ∈ X as follows:

φ : X → R, X 7→ φ(X) := min {φe(X), φc(X)} , (5)

where φe : X → R and φc : X → R are de�ned as

φe(X) := mini∈I

{D(pe(Xt|Xt−1 = X)||p(i)

e (Xt|Xt−1 = X))}

(6)

andφc(X) := min

i∈I

{D(pc(Xt−1|Xt = X)||p(i)

c (Xt−1|Xt = X))}

, (7)

respectively. In (7) and (6),

• φe(X) and φc(X) are referred to as integrated e�ect information and in-tegrated cause information of the system state X,

• pc(Xt−1|Xt = X) and pe(Xt|Xt−1 = X) are conditional probability distri-butions that are constructed from the transition probability p(Xt|Xt−1)of the stochastic process as detailed below and are referred to as the e�ectrepertoire and cause repertoire of X, respectively,

• p(i)e (Xt|Xt−1 = X) and p

(i)c (Xt−1|Xt = X) are �decomposed variants� of

the e�ect and cause repertoires, that result from the removal of potentialstochastic dependencies in the system's transition probability distributionas detailed below,

4

𝑎𝑡−1 𝑏𝑡−1 𝑐𝑡−1

𝑎𝑡 𝑏𝑡 𝑐𝑡

𝑎𝑡−1 𝑏𝑡−1 𝑐𝑡−1

𝑎𝑡 𝑏𝑡 𝑐𝑡

𝑝𝑐𝑒 𝑋𝑡−1, 𝑋𝑡 ≔ 𝑝𝑐𝑒(𝑎𝑡−1, 𝑏𝑡−1, 𝑐𝑡−1, 𝑎𝑡, 𝑏𝑡 , 𝑐𝑡)

𝑝𝑐𝑒𝑖𝑋𝑡−1, 𝑋𝑡 ≔ 𝑝𝑐𝑒 𝑎𝑡−1, 𝑏𝑡−1, 𝑏𝑡, 𝑐𝑡 𝑝𝑐𝑒(𝑎𝑡 , 𝑐𝑡−1)

𝑝𝑒 𝑋𝑡|𝑋𝑡−1 ≔ 𝑝(𝑎𝑡|𝑎𝑡−1, 𝑏𝑡−1, 𝑐𝑡−1)𝑝(𝑏𝑡|𝑎𝑡−1, 𝑏𝑡−1, 𝑐𝑡−1) 𝑝 𝑐𝑡 𝑎𝑡−1, 𝑏𝑡−1, 𝑐𝑡−1

𝑝𝑒𝑖𝑋𝑡|𝑋𝑡−1 ≔ 𝑝(𝑎𝑡|𝑐𝑡−1)𝑝(𝑏𝑡|𝑎𝑡−1, 𝑏𝑡−1) 𝑝(𝑐𝑡|𝑎𝑡−1, 𝑏𝑡−1)

Π1𝑖≔ (𝑎𝑡−1, 𝑏𝑡−1, 𝑏𝑡 , 𝑐𝑡) Π2

𝑖≔ (𝑎𝑡 , 𝑐𝑡−1)

B

C

𝑋𝑡−1

𝑝(𝑋𝑡−1|𝑋𝑡−2)

𝑡 − 1

𝑝(𝑋𝑡|𝑋𝑡−1) 𝑝(𝑋𝑡+1|𝑋𝑡)

𝑋𝑡 𝑋𝑡+1

𝑡 + 1𝑡

A

Figure 1: System model and system decomposition. IIT models the system ofinterest by a time-invariant �rst-order Markov process, depicted in Panel A as graphi-cal model (e.g., Bishop, 2006). Nodes denote random vectors and directed links denotethe stochastic dependence of the child node on the parent node. Panels B and C dis-play the exemplary decomposition of a three-dimensional system with state vectorXt := (at, bt, ct) as graphical model. Here, nodes denote the constituent random vari-ables variables of the random vectors Xt−1 and Xt. Panel B depicts the unpartitionedsystem, in which all potential stochastic dependencies of the elements are visualized.The constituent random variables of Xt are conditionally independent given Xt−1 (cf.eq. (4)), and the joint distribution pce(Xt−1, Xt) is invoked by the assumption of anuncertain marginal distribution pu(Xt) for each t = 2, ..., T . Panel C shows an exem-plary decomposition of the system, which is based on the bipartition of (Xt−1, Xt) into

the subsets Π(i)1 = {at−1, bt−1, bt, ct} (gray inset) and Π

(i)2 = {at, ct−1}. In the factor-

ized joint distribution pce(Xt−1, Xt) the directed links across the partition boundaryare removed, while the links within each partition remain.

5

• I is an index set, the elements of which index the �decomposed variants�of the e�ect and cause repertoires, and

• D : P × P → R+, (p1, p2) 7→ D(p1||p2) denotes a divergence measure be-tween (conditional) probability distributions over the same random entity,with P indicating the set of all possible distributions of this entity. In prac-tice, D corresponds to the the earth mover's distance for discrete state sys-tems (Levina and Bickel, 2001; Mallows, 1972), and the Kullback-Leiblerdivergence (Kullback and Leibler, 1951) for continuous state systems.

We next discuss the intuitive and technical underpinnings of the constituents ofthe de�nition φ by eqs. (5) to (7) in further detail.

E�ect and cause repertoires

Intuitively, IIT aims to quantify how a given system state constrains both futureand previous system states. Formally, this is achieved by constructing the e�ectand cause repertoire conditional probability distributions as follows.

The e�ect repertoire is de�ned as the conditional distribution of Xt givenXt−1 and thus corresponds to the stochastic process' transition probability dis-tribution

pe(Xt|Xt−1) := p(Xt|Xt−1). (8)

To construct the cause repertoire, IIT implicitly de�nes a joint distributionpce over Xt−1 and Xt by multiplication of the Markov transition probabilitydistribution p(Xt|Xt−1) with a marginal distribution pu(Xt−1), i.e.

pce(Xt−1, Xt) := pu(Xt−1)p(Xt|Xt−1). (9)

Here, the marginal distribution pu(Xt−1) is meant to represent a maximum ofuncertainty about Xt−1, and for the case of a �nite outcome space X , corre-sponds to the uniform distribution over all states. Based on the joint distribu-tion of (9), the cause repertoire is then de�ned as the conditional distributionof Xt−1 given Xt:

pc(Xt−1|Xt) :=pce(Xt−1, Xt)∑

Xt−1pce(Xt−1, Xt)

(10)

Note that also the e�ect repertoire may be recovered from (10) by conditioningon Xt−1.

System decomposition

IIT's central concept of integrated information is based on the intuition that in-tegrated information is �information generated by the system above and beyondthe information generated by its parts� (Oizumi et al., 2014). To evaluate inte-grated information, IIT �rst considers all possible ways to decompose a systeminto two subsystems that do not in�uence each other. IIT then aims to identifythe system decomposition which, for a given system state, is most similar to theactual system in terms of the divergence between the system state's e�ect andcause repertoires (cf. eqs. (6) and (7)). In the following, we �rst specify whatthis intuitive notion of integration means in technical terms with respect to the

6

system model of eqs. (1), (3) and (4). We then present a formal rule for theevaluation of the necessary decompositions.

In technical terms, the �system� to be decomposed corresponds to the col-lection of random variables and their conditional dependencies that de�ne thediscrete time multivariate stochastic process (cf. eq. (1)). Because of the pro-cess' time-invariant Markov property (cf. eq. (3)), the relevant random variablesare the constituents of two time-adjacent random vectors Xt−1 and Xt. As seenabove, based on an uncertain marginal distribution over Xt−1, one may de�ne ajoint distribution pce(Xt−1, Xt) of these vectors for each t = 2, ..., T . Note thatthe joint distribution pce(Xt−1, Xt) can equivalently be regarded as a joint dis-tribution over the set of all constituent random variables of the random vectorsXt−1 and Xt,

(Xt−1, Xt) := {xt−11, xt−12

, ..., xt−1d, xt1 , x12

, ..., xtd}. (11)

IIT then uses the intuitive appeal of graphical models (Lauritzen, 1996;Jordan, 1998) to introduce the idea of �cutting a system� into two independentparts. Technically, cutting the graphical model of pce(Xt−1, Xt) corresponds to(a) partitioning the set of random variables in eq. (11) into two disjoint subsetsand (b) removing all stochastic dependencies across the boundary between theresulting random variable subsets while retaining all conditional dependencieswithin each subset (Figure 1B and 1C). Notably, there are k := 22d−1−1 uniqueways to bipartition a set of cardinality 2d (Appendix B). This corresponds tok ways of cutting the corresponding graphical model and thus induces a set

of k di�erently factorized joint distributions p(i)ce (Xt−1, Xt), i = 1, ..., k, which

form the basis for the decomposed e�ect and cause repertoires p(i)e (Xt|Xt−1)

and p(i)c (Xt−1|Xt) in the de�nition of φ (cf. eqs. (6) and (7)).

We next formalize the construction of p(i)ce (Xt−1, Xt) for i = 1, ..., k. To this

end, �rst recall that a partition of a set S is a family of sets P with the properties

∅ /∈ P,⋃

M∈PM = S, and if M,M ′ ∈ P and M 6= M ′, then M ∩M ′ = ∅. (12)

Let Π(i) denote a bipartition of the set of random variables (Xt−1, Xt), i.e.

Π(i) :=(

Π(i)1 ,Π

(i)2

), (13)

where

Π(i)1 ,Π

(i)2 ⊂ (Xt−1, Xt),Π

(i)1 ∩Π

(i)2 = ∅ and Π

(i)1 ∪Π

(i)2 = (Xt−1, Xt). (14)

Let further

pce(Π(i)1 ) =

∑Π

(i)2

pce(Xt−1, Xt) and pce(Π(i)2 ) =

∑Π

(i)1

pce(Xt−1, Xt) (15)

denote the marginal distributions of pce(Xt−1, Xt) (cf eq. (9)) of the random

variables contained in Π(i)1 and Π

(i)1 , respectively. Then the elements of the set

of factorized variants of the joint distribution pce(Xt−1, Xt) are given by

p(i)ce (Xt−1, Xt) := pce(Π

(i)1 )pce(Π

(i)2 ) for i = 1, 2, .., k. (16)

7

In summary, the measure of integrated information φ rests on a standardprobabilistic model approach to dynamical systems - a multivariate stochasticprocess that ful�ls the Markov property (cf. eqs. (1), (3) and (4)). On this back-ground, the integrated information of a system state is de�ned by the minimumof the state's integrated e�ect and cause information. These quantities are de-�ned as the minima of the divergences between the state conditional probabilitydistributions (evaluated directly in terms of the system's transition probabilitydistribution (cf. eq. (8)) and from an approximation of the joint distributionover adjacent time points (cf. eq. (9)) and variants of these distributions thatresult from the removal of stochastic dependencies in the system's transitionprobability distribution (cf. eq. (16)). In the following sections, we show howthis general de�nition of φ can be applied in the context of both discrete andcontinuous state systems.

3 Computing φ : Applications

In the current section, we consider two di�erent types of probabilistic models andevaluate φ in exemplary instances of each type. We �rst apply the theoreticaldevelopment of φ to a system with discrete state space which is de�ned non-parametrically by the explicit de�nition of the transition probability distributionfactors. This system corresponds to the exemplary system discussed in (Oizumiet al., 2014) and serves as a validation of our formulation for computing φ.Subsequently, we consider the class of linear Gaussian systems, which constitutea highly versatile model class in the analysis of empirical data, such as EEG orfMRI measurements.

3.1 Discrete state systems

In discrete state systems, the random variables that model the system's elementstake on a �nite number of states with a certain probability mass. Discrete statesystem constitute the main model class that has been employed throughoutthe development of integrated information theory (Balduzzi and Tononi, 2008,2009; Oizumi et al., 2014). As an exemplary discrete state system, we considera system presented in Oizumi et al. (2014). This system is three-dimensional,and, in concordance with Oizumi et al. (2014), we denote its state vector byXt = (at, bt, ct) (Figure 2A). The system is de�ned in terms of the marginalconditional distributions of its component variables (cf. eq. (4)). Speci�cally,the variables at, bt and ct may take on values in {0, 1}, such that the outcomespace X is de�ned as {0, 1}3, and implement logical operations on the state oftheir predecessors at−1, bt−1 and ct−1. As shown in Figure 2B, at implementsa logical OR, bt implements a logical AND, and ct implements a logical XORoperation. Note that in this case, the relevant distributions of eq. (3) correspondto probability mass functions, which can be represented on the implementationallevel by high-dimensional numerical arrays.

As discussed above, the e�ect repertoire pe(Xt|Xt−1) of the system corre-sponds to the product of the marginal conditional distributions and is identicalwith the system's forward transition probability matrix (cf. eq. (4)). This dis-tribution is shown in Figure 2C. The joint distribution pce(Xt−1, Xt) is derivedby multiplication of the transition probability with a maximally uncertain dis-

8

tribution over past states pu(Xt−1) (cf.eq. (9)). In this example, the maximallyuncertain distribution is given by the uniform distribution over past states, i.e.pu(Xt = X∗t ) := |{0, 1}3|−1 for all X∗t ∈ {0, 1}3 (cf. Figure 2D). From the en-suing joint distribution pce(Xt−1, Xt) = p(at−1, bt−1, ct−1, at, bt, ct), the causerepertoire pc(Xt−1|Xt) can be evaluated by conditioning on Xt. The resultingdistribution is shown in Figure 2E. Note that there are some unde�ned entries(displayed in red). These unde�ned entries correspond to system states thatcannot have been caused by any previous state due to the constraints placed bythe logical operations of the system variables.

Figure 2: Evaluation of φ in an exemplary discrete state system. The systemis identical to that presented in Oizumi et al. (2014) (e.g., Figures 1 and 4 therein).Panel A shows the system comprising three random variables that implement thelogical operations OR, AND, and XOR. Panel B visualizes the corresponding marginalconditional probability distributions, with black tiles indicating a probability mass of0 and white tiles indicating a probability mass of 1. The product of these marginalconditional probability distributions yields the conditional distribution p(Xt|Xt−1)depicted in panel C, i.e. the state transition probability matrix, which is identicalto the e�ect repertoire pe(Xt|Xt−1). By multiplication with a maximally uncertaindistribution over past states, i.e. pu(Xt−1), the joint distribution pce(Xt, Xt−1) ofpanel D is obtained. Here, dark gray tiles indicate a probability mass of 0.125. Forthe current example, pu(Xt−1) corresponds to the uniform distribution over systemstates. The joint distribution forms the basis for all k factorizations in the calculation ofφ. Finally, conditioning pce(Xt, Xt−1) on Xt yields the cause repertoire pc(Xt−1|Xt)as shown in panel E. Here, white tiles indicate a probability mass of 1, gray tilesa probability mass of 0.5, and red tiles represent unde�ned entries. These entriescorrespond to states of Xt that cannot have been caused by any of the states of Xt−1

due to the logical structure of the network.

Decomposed e�ect and cause repertoires

To compute φ, the decomposed e�ect and cause repertoires associated with the

factorized joint distributions p(i)ce (Xt−1, Xt), i = 1, ..., k (cf. eq. (16)) have to

9

be evaluated by conditioning these distributions on Xt−1 or Xt, respectively.Notably, for each bipartition of (Xt,1, Xt) the factorization of the joint distri-bution pce(Xt−1, Xt) (cf. eq. (9)) also induces a factorization of the respectivemarginal distributions of the constituent random variables of Xt−1 or Xt, whichneeds to be accounted for. In the following, we �rst specify how the resultingdecomposed e�ect and cause repertoires can be evaluated in a well-de�ned andgeneral manner and then illustrate this process in the example system underconsideration.

Let Θ ⊂ (Xt−1, Xt) denote the set of random variables that the factorized

joint distribution p(i)ce (Xt−1, Xt) (cf. eq. (16)) is to be conditioned on, i.e. Θ :=

(Xt) or Θ := (Xt−1). Let further

p(i)ce (Θ) =

∑(Xt−1,Xt)\Θ

pce(Xt−1, Xt) (17)

be the marginal distribution of eq. (9) of the random variables in Θ. De�ne

Θ(i)1 = Θ ∩Π

(i)1 and Θ

(i)2 = Θ ∩Π

(i)2 . (18)

Note that because(

Π(i)1 ,Π

(i)2

)is a partition of (Xt−1, Xt), it follows that

Θ(i)1 ∪Θ

(i)2 = Θ and Θ

(i)1 ∩Θ

(i)2 = ∅, (19)

and that, possibly, Θ(i)1 = ∅ or Θ

(i)2 = ∅. By de�nition, the marginal distribu-

tions of pce(Θ) with respect to Θ(i)1 and Θ

(i)2 are given by

pce

(i)1

)=∑Θ

(i)2

pce(Θ) and pce

(i)2

)=∑Θ

(i)1

pce(Θ). (20)

Finally, de�ne

pce(Θ(i)) :=

p(Θ

(i)1 ) , if Θ

(i)2 = ∅

p(Θ(i)2 ) , if Θ

(i)1 = ∅

p(Θ(i)1 )p(Θ

(i)2 ) , else .

(21)

Based on eq. (15) and eq. (21), the decomposed e�ect and cause repertoires inthe de�nition of φ (cf. eqs. (6) and (7)) are then de�ned as

p(i)e (Xt|Xt−1) =

pce(Π(i)1 )pce(Π

(i)2 )

pce(Θ(i))for Θ = (Xt−1) (22)

and

p(i)c (Xt−1|Xt) =

pce(Π(i)1 )pce(Π

(i)2 )

pce(Θ(i))for Θ = (Xt). (23)

Note that for the evaluation of φc and φe, the random variables constitutingXt−1 and Xt on the right-hand side of eqs. (22) and (23) have to be set to thevalues de�ned by the system state X.

Example

10

To illustrate the above, we consider the exemplary system of Figure 2. Here,the concatenated state vector over two adjacent time-points is given by (cf.eq. (11))

(Xt−1, Xt) = {at−1, bt−1, ct−1, at, bt, ct} (24)

One of the k = 26−1 − 1 = 31 bipartitions of eq. (24) (which we label here asi := 1) is given by

Π(1)1 = {at−1, bt−1, bt, ct} and Π

(1)2 = {ct−1, at}. (25)

Note that this corresponds to the partition depicted in Figure 1. Hence, witheq. (15)

pce(Π(1)1 ) = pce(at−1, bt−1, bt, ct) and pce(Π

(1)2 ) = pce(ct−1, at). (26)

For the decomposed e�ect repertoire corresponding to the bipartition as givenin eq. (25), we have Θ = Xt−1, and thus obtain

Θ(1)1 = {at−1, bt−1} and Θ

(1)2 = {ct−1} (27)

Because Θ(1)1 6= ∅ and Θ

(1)2 6= ∅, we have (cf. eq. (20) and eq. (21))

pce(Θ(1)) = pce(at−1, bt−1)pce(ct−1), (28)

and �nally

p(1)e (Xt|Xt−1) =

pce(at−1, bt−1, bt, ct)pce(ct−1, at)

pce(at−1, bt−1)pce(ct−1)

= pce(bt|at−1, bt−1)pce(ct|at−1, bt−1)pce(at|ct−1)

. (29)

Further, for the decomposed cause repertoire resulting from eq. (25), we haveΘ = (Xt), and thus obtain with eq. (18)

Θ(1)1 = {bt, ct} and Θ

(1)2 = {at}. (30)

Because Θ(1)1 6= ∅ and Θ

(1)2 6= ∅, we have

pce(Θ(1)) = pce(bt, ct)pce(at) (31)

and �nally

p(1)c (Xt−1|Xt) =

pce(at−1, bt−1, bt, ct)pce(ct−1, at)

pce(bt, ct)pce(at)

= pce(at−1, bt−1|bt, ct)pce(ct−1|at).(32)

For further illustration of this constructive process, a more exhaustive ex-ample is provided in the Supporting Information (Figures S1 and S2). There,the e�ect and cause repertoires corresponding to the seven bipartitions of atwo-dimensional system are considered in detail.

11

Computation of φ

Based on the above, we are now in the position to compute φ for arbitrary sys-tem states in the exemplary system. As an example, Figure 3 depicts the theevaluation of φ(X) for the system state X = (1, 0, 0). Speci�cally, Figure 3Adepicts the probability mass functions underlying the evaluation of integratede�ect information φe((1, 0, 0)) and Figure 3B depicts the probability mass func-tions underlying the evaluation the integrated cause information φc((1, 0, 0)).The right panel of 3A depicts the decomposed variant of the e�ect repertoirewhich results in the minimum EMD φe((1, 0, 0)) = 0.25 between the originaland decomposed e�ect repertoire variants (cf. eq. (6)). Notably, for the currentsystem there are 26−1− 1 = 31 possible ways to bipartition the graphical model

of pce(Xt−1, Xt), and hence 31 versions of p(i)e (Xt|Xt−1). For the system state

X = (1, 0, 0), the ensuing partitions and distributions are shown in Figure S3.Likewise, the left panel of Figure 3B depicts the cause repertoire of the systemstate X = (1, 0, 0), which directly follows from the distribution shown in Figure2E. The right panel of 3B depicts a decomposed variant of the cause repertoirewhich results in a minimum EMD of φc((1, 0, 0)) = 0.5 between the originaland decomposed e�ect repertoire variants (cf. eq. (7)). As shown in FigureS4, there are in fact seven partitions and ensuing distributions, for which thisminimum EMD is attained. Evaluating the minimum (cf. eq. (5)) then resultsin an integrated information of

φ((1, 0, 0)) = 0.25 (33)

for the current system and system state of interest. Notably, this value is iden-tical to that reported in Figure 6 of Oizumi et al. (2014).

000 100 010 110 001 101 011 1110

0.5

1p(XtjXt!1 = (1; 0; 0))A

000 100 010 110 001 101 011 1110

0.5

1p(16)(XtjXt!1 = (1; 0; 0))

?e((1; 0; 0)) = 0:25

000 100 010 110 001 101 011 1110

0.5

1p(Xt!1jXt = (1; 0; 0))B

000 100 010 110 001 101 011 1110

0.5

1p(14)(Xt!1jXt = (1; 0; 0))

?c((1; 0; 0)) = 0:5

Figure 3: State-speci�c evaluation of φ in an exemplary discrete state sys-tem. The Figure depicts the probability mass functions required for the evaluation ofφ for the system state X = (1, 0, 0) in the discrete state system introduced in Figure 2.Panel A viualizes the relevant distributions required for the evaluation of φe((1, 0, 0))(cf. eq. (6)) and Panel B visualizes the relevant distributions required for the evalu-ation of φc((1, 0, 0)) (cf. eq. (7)). The minimum of these two values is de�ned as theintegrated information φ((1, 0, 0)) = 0.25 (cf. eq. (5)) of the system state.

12

As a more exhaustive example, we considered the behaviour of φ as theexemplary system evolves over time. To this end, we precomputed φ for each ofthe eight system states of the example system, and monitored the state evolutionfor each possible initial state X1 ∈ X for t = 1, ..., 30. Figure 4 summarizes theresults of this simulation. Panel A depicts the evolution of the system state as afunction of the initial state (subpanels). Notably, after less than 10 time steps,the system is either locked in an all-o� state (for initial state X1 = (0, 0, 0))or oscillates between states (1, 0, 0) and (0, 0, 1) (for all other initial states).Panel B depicts the corresponding evaluation of φ(Xt). Mirroring the temporalevolution of the system state, φ(Xt) takes on the value 0 for all time points forinitial state X1 = (0, 0, 0) or oscillates between 0.25 and 0 for all other initialstates.

On empty conditional distributions and virtual elements

In contrast to the approach introduced of Oizumi et al. (2014), we consider thedecomposition of a system at the level of the joint distribution pce(Xt−1, Xt) andnot at the level of the e�ect and cause repertoires pe(Xt|Xt−1) and pc(Xt−1|Xt).In this section, we discuss the advantages of our formulation for computing φand relate it to the approach discussed in Oizumi et al. (2014). To this end, weconsider a �xed bipartition Π = (Π1,Π2) of (Xt−1, Xt) (cf. eq. (13)), where weomit the superscript (i) for notational convenience. We further use

Π1t−1 := Xt−1 ∩Π1 and Π2

t−1 := Xt−1 ∩Π2 (34)

to denote the elements of Xt−1 that are part of the bipartition subsets Π1 andΠ2, respectively, and

Π1t := Xt ∩Π1 and Π2

t := Xt ∩Π2 (35)

to denote the elements of Xt that are part of the bipartition subsets Π1 and Π2,respectively.

Firstly, Oizumi et al. (2014) consider the decompositions of the e�ect andcause repertoire in the forms

p(i)e (Xt|Xt−1) := p(Π1

t |Π1t−1)p(Π2

t |Π2t−1) (36)

andp(i)c (Xt−1|Xt) := p(Π1

t−1|Π1t )p(Π2

t−1|Π2t ) (37)

(see for example eq. (13) in the Supplementary Methods of Oizumi et al.(2014)). However, because depending on the bipartition any of the sets Π1

t ,Π2

t , Π1t−1, or Π2

t−1 may be empty (consider for example the case of of a two-dimensional random vector Xt and a bipartition of the form Π1 = {xt−11

} andΠ2 = {xt−12

, xt1 , xt2}), this approach can result in unde�ned system decomposi-tions. Our formulation eschews this, because here only non-empty distributionsare multiplied (cf. eq. (12)) and conditioned on (cf. eq. (21)).

Secondly, Oizumi et al. (2014) use the (non-probabilistic) concept of �virtualelements� to induce a maximal uncertain distribution over the system stateXt−1. To this end, the authors de�ne random variable sets

XV 1t−1 := Xt−1 \Π1

t−1 and XV 2t−1 := Xt−1 \Π2

t−1 (38)

13

𝜙𝑐 𝜙𝑒 𝜙

A

B

Figure 4: Evaluation of the temporal dynamics of φ in the discrete stateexample system.Panel A shows the evolution of the example system over T = 30time points based on all possible states of X1. After less than 10 time steps, the systemis either locked in an all-o� state (for X1 = (0, 0, 0)) or oscillates between system states(1, 0, 0) and (0, 0, 1) (for all other states). Panel B shows the parallel evaluation ofφ(Xt), again for all possible states of X1.

14

and assume a uniform distribution pu(XV 1t−1, X

V 2t−1) when evaluating the cause

and e�ect repertoires and their decompositions (see for example Figure 1 in theSupplementary Methods of Oizumi et al. (2014)). However, because

(XV 1t−1, X

V 2t−1) = (Π2

t−1,Π1t−1) = Xt−1 (39)

this assumption is equivalent to the assumption of eq. (9) in our formulation,which follows standard probabilistic reasoning.

3.2 Linear Gaussian Systems

As a further application of the general framework for the evaluation of φ, weconsider the broad class of linear Gaussian systems (LGSs) (Roweis and Ghahra-mani, 1999). In contrast to the discrete state systems discussed above, LGSsare directly applicable to the continuous data recordings of brain imaging tech-niques such as EEG or fMRI. Evaluating φ in these kinds of probabilistic modelsthus paves the way for the computation of φ based on experimental data sets.

LGSs are basic building blocks of a wide variety of standard models in signalprocessing and machine learning and take the general form (cf. eq. (1) andeq. (3))

p(X1, X2, ..., XT ) = p(X1)

T∏t=2

p(Xt|Xt−1), (40)

wherep(Xt|Xt−1) = N(Xt;AXt−1,Σt|t−1). (41)

In eq. (41), A ∈ Rd×d denotes the state transition matrix, which maps the stateof Xt−1 onto the conditional expectation of Xt given by µt|t−1 := AXt. Further,

Σt|t−1 ∈ Rd×d denotes the state disturbance matrix, which, embedding the as-sumption of conditional independence of the state random variables (cf.eq. (4)),takes spherical form

Σt|t−1 := σ2t|t−1Id, σ

2t|t−1 > 0. (42)

Note that in their structural formulation,

Xt = AXt−1 + εt with p(εt) = N(εt; 0,Σt|t−1), (43)

LGSs are sometimes also referred to as autoregressive processes (e.g., Tegmark,2016)).

E�ect and cause repertoires

For an LGS, the e�ect repertoire is given by

pe(Xt|Xt−1) = N(Xt;AXt−1,Σt|t−1) (44)

and thus fully speci�ed in terms of the system's transition and disturbancematrices. Because the concept of a uniform distribution is not meaningful inthe context of continuous state variables, the speci�cation of the cause repertoirein an LGS is more intricate. A standard option from the theory of Bayesianstatistics is to use a zero-centred Gaussian with high variance for the uncertaindistribution, i.e. to set

pu(Xt−1) := N(Xt−1; 0,Σt−1) (45)

15

with Σt−1 := σ2t Id and large σ2

t (Broemeling, 1984; Gelman et al., 2014). Basedon the theory of multivariate Gaussian distributions, the cause repertoire underthis assumption evaluates to

pc(Xt−1|Xt) = N(Xt−1;µt−1|t,Σt−1|t), (46)

where

µt−1|t = Σt−1|t(AT Σt|t−1Xt

)and Σt−1|t =

(Σ−1

t +AT Σ−1t|t−1A

)−1

. (47)

For a summary of these results from the theory of multivariate Gaussian distri-butions, please refer to Appendix C. In the limit of σ2

t →∞, and thus maximaluncertainty for the distribution pu(Xt−1), eq. (47) simpli�es to

µt−1|t = Σt−1|t(AT Σt|t−1Xt

)and Σt−1|t =

(AT Σ−1

t|t−1A)−1

(48)

In other words, one may disregard Σ−1t in eq. (47), because for very large σ2

t ,the precision Σ−1

t approaches zero (e.g., Friston et al., 2002)). Note, however,that depending on the form of A, the covariance matrix Σt−1|t in (47) may beunde�ned, resulting in an unde�ned value of φ. One of these cases is an all-zeromatrix A, which is intuitive because there are no possible system transitions atall (and thus one might say that �there is no system to begin with� (Oizumiet al., 2014).

System decomposition

To compute φ for a system state X ∈ R, the system under consideration has tobe decomposed as discussed in Section 2 Theory. For an LGS, the crucial modelcomponent that induces stochastic dependencies between system state vectorsXt−1 and Xt is the system transition matrix A. In the following, we discusshow the system transition matrix has to be manipulated in order to construct

the set of factorized probability distributions p(i)ce (Xt−1, Xt) (cf. eq. (16)). As

described above, using the standard theory of multivariate Gaussians, this set ofdistributions can then directly be employed to compute the decomposed e�ectand cause repertoires.

We �rst note that for Gaussian random variables, uncorrelatedness (zero co-variance) is equivalent to stochastic independence (Lauritzen, 1996). For LGSs,the covariances between the constituents of the random vectors Xt−1 and Xt

are encoded in the o�-diagonal block matrices of the vectors' joint distributioncovariance matrix (cf. Appendix C), i.e.

cov(Xt−1, Xt) = AΣt−1 ∈ Rd×d. (49)

Setting elements in the covariance matrix cov(Xt−1, Xt) to zero thus correspondsto factorizing the joint distribution pce(Xt−1, Xt). However, to compute φ, awell-de�ned mapping from the desired factorization (corresponding to a bipar-tition of the set (Xt−1, Xt)) to the form of the ensuing decomposed transitionmatrix A(i) is required. We next discuss a constructive scheme that achievesthis.

16

Let n := 2d. We �rst note that by enumerating each random variable in

(Xt−1, Xt) from 1 to n, each bipartition Π(i) =(

Π(i)1 ,Π

(i)2

)of (Xt−1, Xt) can

equivalently be represented by a set of two index sets

I(i) :=(I

(i)1 , I

(i)2

), i = 1, ..., 2n−1 − 1, (50)

where I(i)1 , I

(i)2 ⊂ Nn and I

(i)1 and I

(i)2 comprise the indices of the elements in

Π(i)1 and Π

(i)2 , respectively. With the properties of bipartitions, it then follows

immediately that

I(i)1 6= ∅, I(i)

2 6= ∅, I(i)1 ∩ I

(i)2 = ∅, and I(i)

1 ∪ I(i)2 = Nn. (51)

Let In denote the set of all possible bipartition index sets I(i). Using this indexset representation of the bipartitions de�ned in eq. (13), we then de�ne themapping from bipartitions to system transition matrix variants by setting

f : In → Rn×n,Π(i) 7→ f(Π(i)) := cov(Xt−1, Xt)(i), (52)

where with

A(i) := vI(i)1vTI(i)1

+(

1d − vI(i)1

)(1d − vI(i)

2

)T(53)

and vI(i)m∈ Rd for m = 1, 2 given by

vI(i)m

:=(vI(i)m

)1≤j≤d

:=

{1 , j ∈ I(i)

m

0 , j /∈ I(i)m ,

(54)

we instantiate the decomposed system covariance matrix by setting

cov(Xt, Xt−1)(i) := A(i) ◦ Σt−1, (55)

◦ denoting the Hadamard matrix product. Note that in the case of an unde�nedΣt, the necessary system decomposition is still well-de�ned in terms of theparameters of the cause repertoire (cf. eq. (47)). To illustrate the mapping ofeqs. (52) to (55), we visualize all ensuing covariance structures cov(Xt−1, Xt)

(i)

for each bipartition of (Xt−1, Xt) in a three-dimensional LGS in Figure S5. Notethat the components of Xt−1 are indexed by 1, 2, 3 and the components of Xt

are indexed by 4, 5, 6 (cf. eq. (50)). Black tiles indicate zero elements and whitetiles indicate arbitrary elements of cov(Xt−1, Xt)

(i).

Computation of φ

Based on the above, we are now in the position to evaluate φ for exemplaryLGSs. We here limit ourselves to some exploratory simulations in the class oftwo-dimensional (planar) LGSs (e.g., Meiss, 2007; Hirsch et al., 2012; Arnold,2013). While the exhaustive evaluation of φ over state space is no longer feasible,we can compute φ for exemplary state trajectories. To this end, we considerplanar LGSs and study how exemplary parameter changes and their associatedstate trajectories variations qualitatively a�ect φ. The simulations are visualizedin Figure 5. Each of the six subpanels A - F instantiates a di�erent parametersetting. Within each subpanel, the left panel visualizes the system evolution in

17

Figure 5: Evaluation of φ in exemplary linear Gaussian systems. Within each�gure panel A-F, the left panel visualizes the system evolution in its two-dimensionalstate space, while the right side displays the trajectory of each system component andthe evolution of φ. Panel A shows an exemplary LGS, de�ned by the parameters of(56). Increasing the variance in the disturbance matrix by some orders of magnituderesults in a scaling e�ect of the system trajectories (B), without a considerable e�ecton φ. In contrast, altering the transition matrix has pronounced e�ects on φ. Settingthe diagonal (C) or o�-diagonal (D) elements of A in(56) to zero results in markedlydi�erent state trajectories and the vanishing of φ = 0 for all observed states. PanelsE and F demonstrate that the system parameters may also interact. Here, the statetransition matrix of (57) increases the sensitivity of φ to alterations of the disturbanceand precision matrix, as shown in panel F, where precision is decreased by four ordersof magnitude.

18

R2 for t = 1, ..., 200 time steps. The upper right two panels visualize the axesprojections of the system state, reminiscent of empirical data recording channels(e.g. EEG electrodes or fMRI voxel time-courses). The lowermost panel depictsφ as a function of the temporal evolution of the system state.

Panel A of Figure 5 visualizes the state evolution of a system with statetransition, state disturbance, and marginal precision matrices

A :=

(0.5 −10.5 0.5

),Σt|t−1 :=

(102 00 102

), and Σ−1

t :=

(10−1 0

0 10−2

), (56)

respectively with initial state X0 := (1, 1)T . Changing the state disturbancematrix by six orders of magnitude to Σt|t−1 := 108I2 only a�ects the scale ofthe trajectory, but does not a�ect φ, as shown in Figure 5B. Likewise, for thissystem, di�erent choices of the marginal precision matrix Σt leave φ unchanged(data not shown). Naturally, changing the state transition matrix has profounde�ects on both the system state trajectory and the observed values of φ. InFigure 5C and D, a system with parameters as in (56) is depicted, where thediagonal and o�-diagonal entries of the state transition matrix A were set tozero, resulting in the vanishing of φ over the simulation interval. Finally, thestate transition and marginal precision parameters may also interact in theire�ect on φ: Panels E and F of Figure 5 visualize simulations of a planar systemwith state and state disturbance matrices

A :=

(1 1−1 1

)and Σt|t−1 :=

(102 00 102

). (57)

Changing the marginal precision matrix from Σ−1t := 10−2 (Figure 5 E) to

Σ−1t := 10−6 (Figure 5F) in this system has quantitative e�ects on φ, reducing

its peaks to a plateau-like behaviour, while retaining its qualitiative properties.In summary, based on the above formulation, φ can be evaluated in LGSs,

as demonstrated by simulations, shows a rich temporal diversity, and exhibitsa dependence on the LGS parameter settings. In the further development ofIIT, the class of planar LGSs may form a useful analytical test bed for thequalitative and quantitative dependence of φ on a system's parameterization,because, at least in the deterministic scenario, planar dynamical systems arewell understood (e.g., Meiss, 2007; Hirsch et al., 2012).

4 Discussion

In the present work, we have developed a general probabilistic formulation ofintegrated information φ, starting from the most recent instantiation of IIT(Oizumi et al., 2014). All necessary mathematical operations in the derivationof φ are based on a system's joint distribution pce(Xt−1, Xt) over two adjacentpoints in time. We presented a constructive rule for the system decompositioninto two subsystems, which corresponds to a �exible factorization of this jointdistribution. Our formulation is readily applied to non-parametric discrete statesystems, as validated in the exemplary system from IIT 3.0, in which we alsoshow that φ mirrors state evolution over time. As a consequence of our formu-lation, unde�ned conditional distributions are eschewed and virtual elementsare not necessary. Furthermore, to make progress towards the computation of

19

φ based on experimental data, we applied our formulation to continuous vari-ables in the versatile model class of LGSs. For the Gaussian case, we approachthe problem of a maximum uncertainty distribution over past states by de�n-ing a prior distribution with near-zero precision, in line with standard Bayesianstrategies. We showed that, for LGSs, the system decomposition correspondsto mapping all possible partitions to covariance matrix forms of system statesat adjacent time points. Finally, we explored the qualitative e�ect of di�erentLGS parameters on integrated information and evaluate the evolution of systemstates and φ over time, revealing non-trivial temporal dynamics. In summary,our formulation is - at least in principle - general enough to be applicable todi�erent model classes and investigate the dynamics of these models. In thefollowing, we start by discussing several related studies and then turn to someopen questions regarding the computation of integrated information.

The theoretical development of IIT has so far mainly rested on discretestate systems (Tononi, 2004, 2008; Balduzzi and Tononi, 2008, 2009; Tononi,2012; Oizumi et al., 2014). However, this model class is not suitable in mostempirical settings, because neuroimaging and other functional data are usuallynot limited to discrete states. In consequence, a number of e�orts have recentlybeen targeted at the evaluation of φ in for continuous state (Barrett and Seth,2011; Oizumi et al., 2016; Mediano et al., 2016; Tegmark, 2016).

Barrett and Seth (2011) were among the �rst in this regard and propose ameasure of integrated information that does allow for the evaluation of integra-tion in time-series data. The main di�erence to the formulation presented hereis that Barrett and Seth focus on an earlier version of the theory - integratedinformation theory 2.0 (Balduzzi and Tononi, 2008) - which is based on entropy-related measures, while we start explicitly from the most recent instantiation ofIIT. As has been pointed out by Tegmark (2016) and Oizumi et al. (2016), how-ever, the integrated information measure proposed by Barrett and Seth (2011)can become negative. This somewhat complicates its interpretation, althoughthe interesting question has been raised of whether �negative integration� couldre�ect redundancy in a system (Barrett, 2015). Note that φ as presented hereinis well-bounded by zero because both the EMD and the KL divergence cannotbe negative (Levina and Bickel, 2001; Cover and Thomas, 2012). Introducingtheoretical requirements for the lower and upper bounds for φ was one of themain motivations for recent work by Oizumi et al. (2016), who also evaluatetheir measure of information integration in monkey electrocorticogram data.The major di�erences to our work are that the authors develop their measurebased on IIT 2.0 and involves �atomic partitioning�, which we discuss in moredetail below.

An impressive collection of di�erent mathematical options to measure in-formation integration was presented recently by Tegmark (2016). Similaritiesbetween our approach and that of Tegmark (2016) include the application of φto Markov process system models and and the ensuing construction of cause ande�ect repertoires. We also use the Kullback-Leibler Divergence for the Gaussiancase, because the EMD is computationally infeasible for continuous variables.We di�er, however, in the constructive algorithm for the system decompositionand in that we do de�ne φ as presented in IIT 3.0 for the continuous case byusing a (near-)zero precision distribution over past states. Importantly, we con-sider all possible partitions in the system decomposition, both symmetrical andasymmetrical, which we discuss in more detail below.

20

In summary, while there is recent interest in the evaluation of integratedinformation based on experimental data, so far there has been no convergenceon a gold-standard measure. This is of course closely related to the fact thatthe theoretical framework of integrated information itself, and speci�cally inits application to conciousness research, remains under development. For theremainder of the discussion we focus on three aspects that in our opinion leaveroom for further consolidation of the theory.

The normalization problem

Some concerns have been raised regarding the necessity of normalization in IIT.First of all, it is useful to distinguish what needs to be normalized. The Scholar-pedia entry for IIT introduces a normalization constant for the cause repertoires,i.e. pc(Xt−1|Xt), in order to ensure that �the repertoire sums to one� (Tononi,2015). A similar normalization constant is not introduced for the e�ect reper-toire pe(Xt|Xt−1). However, it is the very nature of conditional probabilitydistributions that they sum to one, so the question is why normalization forpc(Xt−1|Xt) is required in the �rst place. In our formulation, normalization isnot necessary to ensure unity. However, normalization may become necessary ifone introduces conditional distributions over the empty set of variables, as de-tailed above. Essentially, this corresponds to the case distinction of whether themarginal distribution for conditioning factorizes or not (cf. eq. (21)). Whetherthis is the case in turn depends on the particular system decomposition, i.e.

whether or not there are variables for conditioning in both subsets Π(i)1 and Π

(i)1

according to bipartition i.A second question is whether φ itself needs to be normalized. The rationale

behind this is that φ is thought to get small whenever one of the subsets in abipartition is small. Since φ is de�ned over the factorization that makes theleast di�erence compared to the unfactorized system (the minimum informationpartition), φ may be biased to asymmetrical partitions. Indeed, in previousformulations of IIT, this issue was explicitly addressed by introducing a nor-malization factor favouring balanced partitions (Balduzzi and Tononi, 2008),while IIT 3.0 does not account for this. In fact, Tononi states in his response toAaronson that �normalization is not necessary in IIT 3.0� (Tononi, 2014). How-ever, it remains unclear why this should be the case. Since the normalizationissue is tightly linked to the partition problem, we will treat the latter in detailbelow.

The partition problem

The partition problem is of interest both on theoretical grounds and in the at-tempt to make the calculation of φ feasible for experimental data. As shown inAppendix B, the number of unique bipartitions of any set with cardinality n isgiven by an identity of the Stirling number of the 2nd kind. The problem is thatthis number grows exponentially with n, thus seriously compromising the com-putational tractability for sets containing more than about 15− 20 dimensions.There have been several approaches to this problem. Tegmark (2016) limitshimself to symmetrical bipartitions. While this has the advantage of excludinguneven partitions by de�nition, thus circumventing the issue of normalization,most partitions in a large set of elements will be symmetrical (e.g., Figure 6).

21

This approach thus concedes generality but does not gain much computationaltractability. It is furthermore unclear how this will fare in networks containingan odd number of elements as these have no strictly symmetrical bipartitions.In the study discussed above, Oizumi et al. (2016) have used atomic partitions,i.e., the complete factorization of conditional distributions over all variables, tocalculate a modi�ed measure of integrated information in primate electrocor-ticogram data. The authors concede that atomic partitioning and to a lesserextent also symmetrical partitioning will overestimate information integrationbecause it tends to maximize rather than minimize the informational di�erencethe partition makes to the system. Thus, the question is which partitioningapproach is best when the theoretical analysis of all bipartitions is no longerfeasible. Our formulation does take all theoretically possible partitions into ac-count and is applicable to continuous data, but su�ers from the same limitationsof combinatorial explosion as previous endeavours. While we do not yet o�era de�nite remedy to this problem, there are a couple of outlooks that may beworth discussing:

First, consider Figure 6, in which we depict the subset sizes of all bipartitionsof a set with cardinality 18. The bulk of the k = 131071 possible bipartitionscomprises 24310 symmetrical bipartitions (i.e. |Π1/2| = 9), while there are only18 maximally asymmetrical bipartitions (|Π1/2| = 1) and 153 submaximallyasymmetrical bipartitions (|Π1/2| = 2). If one is willing to accept the currentde�nition of φ as the (unnormalized) minimum over all bipartitions, we maymake a virtue of necessity and limit ourselves not to symmetrical bipartitionsbut rather to more asymmetrical bipartitions, as they will tend to result in theminimal information partition. We could simply threshold the cardinality of thesubsets we wish to consider, thus governing how many of the more symmetricalpartitions to take into account as well. Note that maximally asymmetrical parti-tions also reduce computations to linear time. As opposed to atomic partitions,they are bipartitions, however, and will tend to under- rather than overestimateintegration. However, such an approach raises the question to what extent thede�nition of φ then provides additional information compared to informationthat is contained in a system's transition probability distribution. As an ex-ample, consider the de�nition of φe. φe is de�ned as the minimum divergencebetween the distribution pc(Xt|Xt−1 = X) of the original system and the distri-

butions p(i)c (Xt|Xt−1 = X) upon removal of possible stochastic dependencies of

the original system. Naturally, this divergence becomes minimal if these distri-butions are most similar, which in turn is the case if the transition probabilitydistributions of the original system and the transition probability distributionof the decomposed system variant are most similar. In other words, the min-imization procedure primarily identi�es the transition probability distributionof the original system. However, this was assumed to be known from the begin-ning. It may be thus more straight-forward to de�ne a measure of informationintegration directly on the (estimated) transition probability distribution of asystem, rather than having to evaluate all possible decompositions of a systemonly to identify the original system transition probability distribution. Thiscould also alleviate the problem of having to de�ne a maximum uncertaintydistribution over past states in the Gaussian case, for which we could simplytake the marginal distribution. Note that a similar observation regarding theminimization procedure has been raised by Tegmark (2016).

22

A second approach is to use our formulation to start evaluating φ on amacroscopic scale, i.e. over merely a few brain regions of interest containinglots of neural elements, thus circumventing combinatorial explosion by scaling.Indeed, such large-scale �hot zones" for the neural correlates of consciousnesshave been identi�ed in posterior cortical zones in recent years (e.g. Koch et al.,2016). While this disregards the recently developed concept of causal emer-gence on di�erent spatio-temporal scales (Hoel et al., 2013; Tononi et al., 2016),it is a start, and similar approaches are successfully used in dynamic causalmodelling (Friston et al., 2003) as well as graph-theory and mean-�eld-basedmeasures (Deco et al., 2015) that try to capture the behaviour of large-scalebrain networks.

A third approach could be to �nd an estimate of which partitions are likelyto result in a great di�erence to the original system and then discard these. Forexample, in a complex system, one could analyse the network structure based ongraph theoretical measures and then reduce the number of partitions that haveto be evaluated by discarding all partitions that cut through the connections(i.e. introduce stochastic independence) of a hub node. Since hub nodes are byde�nition strongly connected and can thus be assumed to �make a di�erence tothe system" (Oizumi et al., 2014), such an approach could both substantiallyreduce the computational load (because many partitions would a�ect the hub)and serve as a theoretically and biologically plausible approximation.

0 2 4 6 8 10 12 14 16 18

Subset cardinality

0

1

2

3

4

5

Occ

urr

ence

s

#104

k = 131071

Figure 6: Frequency histogram of the subset sizes of all possible bipartitionsof a set with cardinality 18. The distribution is completely symmetrical (eachsubset with cardinality 1 corresponds to another with cardinality 17 as they belongto the same partition, and so on until |Π1/2| = 9). Symmetrical bipartitions make upthe majority of all possible partitions while maximally asymmetrical partitions growlinearly with n.

State-dependency and temporal evolution of φ

Finally, we consider the state-dependency and the temporal dynamics of φ.Recent work by Virmani and Nagaraj (2016) aims at bridging the gap betweenIIT and the perturbational complexity index, a practical measure based on TMSstimulation with simultaneous EEG recordings that has been successfully ap-plied in the clinical quanti�cation of consciousness (Casali et al., 2013). To this

23

end, the authors derive a measure of compression-complexity which is calculatedby a maximum entropy perturbation of each node in an atomic partition. Whilethis is certainly a promising approach, their measure shows minimal dependencyon the current system state, similar to the measure proposed by Barrett andSeth (2011). Based on empirical studies (Koch et al., 2016), however, it seemslikely that any measure that quanti�es consciousness should indeed be state-dependent (cf. (Tegmark, 2016)). On theoretical grounds, state-independencewould also violate the selectivity postulate that IIT propoposes as a prerequisitefor information (e.g. Figure 3 in (Oizumi et al., 2014)). Moreover, insensitivityto system states raises the question of whether a measure in fact representsintegration or rather the capacity to integrate (Barrett and Seth, 2011). Inthe present work, state-dependency is preserved because the conditional distri-butions over Xt−1 and Xt+1 are always evaluated for Xt = X (cf. eqs. (22)and (23)). Furthermore, since it is the state at time t that de�nes the causeand e�ect repertoire, the decomposition of which in turns de�nes φ, it is notsurprising that φ essentially mirrors the state evolution in simple discrete statesystems like the one evaluated herein. In the exemplary linear Gaussian system,the temporal dynamics of φ are more complex. Here, φ shows a rich diversityover the oscillatory state trajectories of the system components, and it will be in-teresting to investigate under which circumstances φ assumes high values in suchsystems. In this regard, a recent study by Mediano et al. (2016) and colleagueshas related integrated information to metastability in a system of coupled Ku-ramoto oscillators (Kuramoto, 2012). Investigating the temporal dynamics ofφ in such systems appears to us as a promising approach, since metastabilityhas been suggested as a potential mechanistic explanation for the emergence oflarge-scale functional brain networks (Deco et al., 2015) and because it o�ers apotential link to the dynamic causal modelling framework which has long beenextended to the analysis of phase-coupled data (Penny et al., 2009).

Conclusion

To conclude, with the general probabilistic formulation of φ presented herein,we hope to make a contribution to the traceability, parsimony, and generality ofintegrated information theory. Our formulation is readily applicable to discretestate systems and generalizes the application of integrated information theory todi�erent probabilistic model classes, such as linear Gaussian systems - therebyenhancing the potential transfer of integrated information theory to the contextof functional neuroimaging data analysis. We thus consider our approach as astep towards the evaluation of integrated information from experimental datawith the prospect of investigating its temporal dynamics and as a basis forthe further development of the theory of integrated information in an explicitprobabilistic framework.

24

Appendices

Appendix A

The re-conceptualization of integrated information theory as �integrated infor-mation theory 3.0� by Oizumi et al. (2014) has introduced various nuances andre�nements to measuring integrated information. This has resulted in a varietyof �phi� measures, which we summarize here and relate to the measure φ asdiscussed in the main text.

In Oizumi et al. (2014), ϕ denotes the minimum earth movers distance(EMD) (Levina and Bickel, 2001; Mallows, 1972), over all possible bipartitionsof a system, referred to as �minimum information partition� (MIP). It is thusalso denoted by ϕMIP . Since we are exclusively interested in the MIP, we omitthe superscript henceforth. In the temporal direction past, ϕ corresponds to allfactorizations of the �cause repertoire� and thus re�ects �integrated cause infor-mation�, denoted by ϕcause. Similarly, the minimum EMD over all factoriza-tions of the �e�ect repertoire� represents integrated e�ect information, denotedby ϕeffect. The minimum of ϕcause and ϕeffect is denoted by ϕcause−effect inOizumi et al. (2014) and corresponds to the measure we refer to as φ. Note thatwe chose this variant notation for unique reference to our formulation.

In addition to φ, Oizumi et al. (2014) introduce a number of additional�phi� measures, which we list here for completeness. Firstly, ϕmax

cause and ϕmaxeffect

are the respective maxima over all ϕcause and ϕeffect, which are assessed overthe powerset of all possible conditional distributions of a set of elements. Seefor example Figure 8 in Oizumi et al. (2014). Together, these de�ne ϕmax asthe minimum of ϕmax

cause and ϕmaxeffect. Secondly, Φ (or more precisely ΦMIP )

is a weighted sum of all EMDs corresponding to ϕmaxcause and ϕmax

effect comparedto the �unconstrained past and future repertoires�. In Oizumi et al. (2014),this measure is also referred to as �integrated conceptual information�. Finally,Φmax is the maximum of all ΦMIP over the powerset of all original sets to beconsidered. In a network containing n nodes, the greatest value for ΦMIP maythus be found over a subset with cardinality c < n.

Appendix B

Finding the number of partitions of a set containing a �nite number of elementsis a well-known problem in combinatorics (Cameron, 1994). The number of allbipartitions of a �nite set can be evaluated by the Stirling number of the 2nd

kind,

Sn,m =1

m!

m∑i=0

(−1)m−i

(m

i

)in, (58)

which counts the number of possible partitions of a set of cardinality n ∈ N intom ∈ N non-empty disjoint subsets. For the purposes of IIT, bipartitions are

25

required. Thus, m = 2 and we have

S (n, 2) =1

2!

2∑i=0

(−1)2−i

(2

i

)in

=1

2!

(((−1)2−0

(2

0

)0n

)+

((−1)2−1

(2

1

)1n

)+

((−1)2−2

(2

2

)2n

))

=1

2

(((−1)2 (1) 0n)+

((−1)1 (2) 1n)+

((−1)0 (1) 2n))

=1

2(0 + (−2) + 2n)

= −1 +2n

2

= 2n−1 − 1

(59)

In the main text, we use the denotation k := 22d−1 − 1 for a set of cardinalityn = 2d.

Appendix C

In this Appendix, we collect a number of well-known results from the theoryof multivariate Gaussian distributions. We state these results in theorem form.For proofs, see (Bishop, 2006; Barber, 2012) and (Murphy, 2012).

Theorem: Gaussian joint, conditional, and marginal distributions.

Given a Gaussian marginal density

p(z) = N(z;µz,Σz),where z, µz ∈ Rp,Σz ∈ Rp×z (60)

and a Gaussian conditional distribution density

p(y|z) = N(y;Aµz,Σy|z),where y ∈ Rn, A ∈ Rn×p,Σy|z ∈ Rn×n (61)

the joint density of y and z

p(y, z) = p(y|z)p(z) (62)

is given by

p(y, z) = N

((yz

);µy,z,Σy,z

), with (y, z)T , µy,z ∈ Rn+p,Σy,z ∈ Rn+p×n+p

(63)where

µy,z =

(Aµz

µz

)and Σy,z =

(Σy|z +AΣzA

T AΣz

ΣzAT Σz

)(64)

Further, the conditional distribution

p(z|y) =p(y, z)

p(y)(65)

is given by the Gaussian probability density

p(z|y) = N(z;µz|y,Σz|y), (66)

26

where

µz|y = Σz|y(Σ−1z µz +AT Σ−1

y|zy) ∈ Rp and Σz|y = (Σ−1z +AT Σ−1

y|zA)−1 ∈ Rp×p,

(67)and the marginal distribution

p(y) =

∫p(y, z)dz (68)

is given by the Gaussian probability density

p(y) = N(y;µy,Σy), (69)

whereµy = Az ∈ Rn and Σy = Σy|z +AΣzA

T ∈ Rn×n. (70)

References

Aaronson, S. (2014). Why i am not an integrated information theorist (or, theunconscious expander).

Arnold, L. (2013). Random dynamical systems. Springer Science & BusinessMedia.

Balduzzi, D. and Tononi, G. (2008). Integrated information in discrete dy-namical systems: Motivation and theoretical framework. PLoS Comput Biol,4(6):e1000091.

Balduzzi, D. and Tononi, G. (2009). Qualia: the geometry of integrated infor-mation. PLoS Comput Biol, 5(8):e1000462.

Barber, D. (2012). Bayesian Reasoning and Machine Learning. CambridgeUniversity Press.

Barrett, A. B. (2015). Exploration of synergistic and redundant informationsharing in static and dynamical gaussian systems. Phys. Rev. E, 91:052802.

Barrett, A. B. and Seth, A. K. (2011). Practical measures of integrated infor-mation for time-series data. PLoS Comput Biol, 7(1):e1001052.

Billingsley, P. (2008). Probability and measure. John Wiley & Sons.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.

Broemeling, L. D. (1984). Bayesian Analysis of Linear Models. Statistics: ASeries of Textbooks and Monographs. Taylor & Francis.

Cameron, P. J. (1994). Combinatorics: topics, techniques, algorithms. Cam-bridge University Press.

27

Casali, A. G., Gosseries, O., Rosanova, M., Boly, M., Sarasso, S., Casali, K. R.,Casarotto, S., Bruno, M.-A., Laureys, S., Tononi, G., and Massimini, M.(2013). A theoretically based index of consciousness independent of sensoryprocessing and behavior. Science Translational Medicine, 5(198):198ra105�198ra105.

Cerullo, M. A. (2015). The problem with phi: A critique of integrated informa-tion theory. PLoS Comput Biol, 11(9):e1004286.

Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. JohnWiley & Sons.

Cox, D. and Miller, H. (1977). The Theory of Stochastic Processes. Sciencepaperbacks. Taylor & Francis.

Dawid, A. P. (1979). Conditional independence in statistical theory. Journal ofthe Royal Statistical Society. Series B (Methodological), pages 1�31.

Deco, G., Tononi, G., Boly, M., and Kringelbach, M. L. (2015). Rethinkingsegregation and integration: contributions of whole-brain modelling. NatureReviews Neuroscience, 16(7):430�439.

Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference, volume 5.Cambridge University Press.

Friston, K., Glaser, D., Henson, R. N. A., Kiebel, S., Phillips, C., and Ash-burner, J. (2002). Classical and bayesian inference in neuroimaging: applica-tions. Neuroimage, 16(2):484�512.

Friston, K., Harrison, L., and Penny, W. (2003). Dynamic causal modelling.Neuroimage, 19(4):1273�1302.

Geiger, D., Verma, T., and Pearl, J. (1990). Identifying independence inbayesian networks. Networks, 20(5):507�534.

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2014). Bayesian dataanalysis, volume 2. Chapman & Hall/CRC Boca Raton, FL, USA.

Hirsch, M. W., Smale, S., and Devaney, R. L. (2012). Di�erential equations,dynamical systems, and an introduction to chaos. Academic press.

Hoel, E. P., Albantakis, L., and Tononi, G. (2013). Quantifying causal emer-gence shows that macro can beat micro. Proceedings of the National Academyof Sciences, 110(49):19790�19795.

Jordan, M. I. (1998). Learning in graphical models, volume 89. Springer Science& Business Media.

Koch, C., Massimini, M., Boly, M., and Tononi, G. (2016). Neural correlates ofconsciousness: progress and problems. Nat Rev Neurosci, 17(5):307�321.

Kullback, S. and Leibler, R. A. (1951). On information and su�ciency. Theannals of mathematical statistics, 22(1):79�86.

Kuramoto, Y. (2012). Chemical oscillations, waves, and turbulence, volume 19.Springer Science & Business Media.

28

Lauritzen, S. L. (1996). Graphical models, volume 17. Clarendon Press.

Levina, E. and Bickel, P. (2001). The earth mover's distance is the mallowsdistance: some insights from statistics. In Proc. Eighth IEEE Int. Conf.Computer Vision ICCV 2001, volume 2, pages 251�256 vol.2.

Mallows, C. (1972). A note on asymptotic joint normality. The Annals ofMathematical Statistics, pages 508�515.

Mediano, P. A. M., Farah, J. C., and Shanahan, M. (2016). Integrated infor-mation and metastability in systems of coupled oscillators. arXiv preprintarXiv:1606.08313.

Meiss, J. D. (2007). Di�erential dynamical systems, volume 14. Siam.

Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Oizumi, M., Albantakis, L., and Tononi, G. (2014). From the phenomenologyto the mechanisms of consciousness: Integrated information theory 3.0. PLoSComput Biol, 10(5):e1003588.

Oizumi, M., Amari, S.-i., Yanagawa, T., Fujii, N., and Tsuchiya, N. (2016). Mea-suring integrated information from the decoding perspective. PLoS ComputBiol, 12(1):e1004654.

Ostwald, D., Kirilina, E., Starke, L., and Blankenburg, F. (2014). A tutorial onvariational bayes for latent linear stochastic time-series models. Journal ofMathematical Psychology, 60:1�19.

Ostwald, D., Porcaro, C., and Bagshaw, A. P. (2010). An information theoreticapproach to eeg-fmri integration of visually evoked responses. Neuroimage,49(1):498�516.

Ostwald, D. and Starke, L. (2016). Probabilistic delay di�erential equationmodeling of event-related potentials. Neuroimage, 136:227�257.

Penny, W., Litvak, V., Fuentemilla, L., Duzel, E., and Friston, K. (2009). Dy-namic causal models for phase coupling. J Neurosci Methods, 183(1):19�30.

Roweis, S. and Ghahramani, Z. (1999). A unifying review of linear gaussianmodels. Neural computation, 11(2):305�345.

Tegmark, M. (2016). Improved measures of integrated information.

Tononi, G. (2004). An information integration theory of consciousness. BMCNeurosci, 5:42.

Tononi, G. (2005). Consciousness, information integration, and the brain. ProgBrain Res, 150:109�126.

Tononi, G. (2008). Consciousness as integrated information: a provisional man-ifesto. The Biological Bulletin, 215(3):216�242.

Tononi, G. (2012). Integrated information theory of consciousness: an updatedaccount. Arch Ital Biol, 150(4):293�329.

29

Tononi, G. (2014). Why scott should stare at a blank wall and reconsider (or,the conscious grid).

Tononi, G. (2015). Integrated information theory. Scholarpedia, 10(1):4164.

Tononi, G., Boly, M., Massimini, M., and Koch, C. (2016). Integrated infor-mation theory: from consciousness to its physical substrate. Nature ReviewsNeuroscience.

Virmani, M. and Nagaraj, N. (2016). A compression-complexity measure ofintegrated information. arXiv preprint arXiv:1608.08450.

30

Supporting Information

𝑝(1|3,4) 𝑝(2|3,4){1,2,3,4}

3 4

1 2

𝑝(1) 𝑝 2 3,41 {2,3,4}

3 4

1 2

𝑝(2)𝑝(1|3,4)2 {1,3,4}

3 4

1 2

𝑝 1 𝑝 2,3,4

𝑝(3,4)

3 {1,2,4}

3 4

1 2

𝑝 3 𝑝(1,2,4)

𝑝 3 𝑝(4)𝑝(1|4) 𝑝(2|4)

𝑝(1|3) 𝑝(2|3)4 {1,2,3}

3 4

1 2

𝑝 4 𝑝(1,2,3)

𝑝 3 𝑝(4)

𝑝(1|3) 𝑝(2|4)1,3 {2,4}

3 4

1 2

𝑝 1,3 𝑝(2,4)

𝑝 3 𝑝(4)

𝑝(1|4) 𝑝(2|3)1,4 {2,3}

3 4

1 2

Partition

𝑝 1,4 𝑝(2,3)

𝑝 3 𝑝(4)

𝑝(1) 𝑝(2)1,2 {3,4}3 4

1 2

𝑝 1,2 𝑝(3,4)

𝑝(3,4)

Decomposition Rule Conditional

1

2

3 4

𝑝(1,2|3,4)

Effect Repertoire

GaussianGraph

Unpartitioned System

𝑝 2 𝑝 1,3,4

𝑝(3,4)

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

Figure S1: Exhaustive system decomposition (e�ect repertoire). Visualiza-tion of all k = 24−1 − 1 = 7 unique bipartitions of a two-dimensional system withXt := {1, 2} and Xt−1 := {3, 4}. For the e�ect repertoire (�rst row), the joint distri-bution over {1, 2, 3, 4} is conditioned on {3, 4}. The graphical model for every partitionis depicted in the left column alongside the bipartition of {1, 2, 3, 4} in the remainingrows. The decomposition rule as presented in the main text is displayed along withthe ensuing conditional distribution in the third and fourth columns. For the Gaussiancase, the e�ects of the system decomposition on the covariance structure cov(Xt−1, Xt)of the joint distribution are shown with black tiles indicating zero elements.

.

31

{1,2,3,4}

3 4

1 2

1 {2,3,4}

3 4

1 2

2 {1,3,4}

3 4

1 2

3 {1,2,4}

3 4

1 2

4 {1,2,3}

3 4

1 2

1,3 {2,4}

3 4

1 2

1,4 {2,3}

3 4

1 2

Partition

1,2 {3,4}3 4

1 2

Decomposition Rule Conditional

1

2

3 4

𝑝 3,4 1,2

Cause Repertoire

GaussianGraph

Unpartitioned System

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

1

2

3 4

𝑝 1,2,3,4

𝑝(1,2)

𝑝(3,4|2)

𝑝(3,4|1)

𝑝 1 𝑝 2,3,4

𝑝 1 𝑝(2)

𝑝 2 𝑝 1,3,4

𝑝 2 𝑝(1)

𝑝 3 𝑝(1,2,4)

𝑝(1,2)𝑝(3)𝑝(4|1,2)

𝑝(4)𝑝(3|1,2)𝑝 4 𝑝(1,2,3)

𝑝(1,2)

𝑝(3|1) 𝑝(4|2)𝑝 1,3 𝑝(2,4)

𝑝 1 𝑝(2)

𝑝(4|1) 𝑝(3|2)𝑝 1,4 𝑝(2,3)

𝑝 1 𝑝(2)

𝑝 3 𝑝(4)𝑝 1,2 𝑝(3,4)

𝑝(1,2)

Figure S2: Exhaustive system decomposition (cause repertoire). Visualiza-tion of all k = 24−1 − 1 = 7 unique bipartitions of a two-dimensional system withXt := {1, 2} and Xt−1 := {3, 4}. For the cause repertoire (�rst row), the joint distri-bution over {1, 2, 3, 4} is conditioned on {1, 2}. The graphical model for every partitionis depicted in the left column alongside the bipartition of {1, 2, 3, 4} in the remainingrows. The decomposition rule as presented in the main text is displayed along withthe ensuing conditional distribution in the third and fourth columns. For the Gaussiancase, the e�ects of the system decomposition on the covariance structure cov(Xt−1, Xt)of the joint distribution are shown with black tiles indicating zero elements.

32

0

1&(1) = ffat!1; bt!1; ct!1; at; btgfctgg

0

1&(2) = ffat!1; bt!1; ct!1; at; ctgfbtgg

0

1&(3) = ffat!1; bt!1; ct!1; atgfbt; ctgg

0

1&(4) = ffat!1; bt!1; ct!1; bt; ctgfatgg

0

1&(5) = ffat!1; bt!1; ct!1; btgfat; ctgg

0

1&(6) = ffat!1; bt!1; ct!1; ctgfat; btgg

0

1&(7) = ffat!1; bt!1; ct!1gfat; bt; ctgg

0

1&(8) = ffat!1; bt!1; at; bt; ctgfct!1gg

0

1&(9) = ffat!1; bt!1; at; btgfct!1; ctgg

0

1&(10) = ffat!1; bt!1; at; ctgfct!1; btgg

0

1&(11) = ffat!1; bt!1; atgfct!1; bt; ctgg

0

1&(12) = ffat!1; bt!1; bt; ctgfct!1; atgg

0

1&(13) = ffat!1; bt!1; btgfct!1; at; ctgg

0

1&(14) = ffat!1; bt!1; ctgfct!1; at; btgg

0

1&(15) = ffat!1; bt!1gfct!1; at; bt; ctgg

0

1&(16) = ffat!1; ct!1; at; bt; ctgfbt!1gg

*

0

1&(17) = ffat!1; ct!1; at; btgfbt!1; ctgg

0

1&(18) = ffat!1; ct!1; at; ctgfbt!1; btgg

0

1&(19) = ffat!1; ct!1; atgfbt!1; bt; ctgg

0

1&(20) = ffat!1; ct!1; bt; ctgfbt!1; atgg

0

1&(21) = ffat!1; ct!1; btgfbt!1; at; ctgg

0

1&(22) = ffat!1; ct!1; ctgfbt!1; at; btgg

0

1&(23) = ffat!1; ct!1gfbt!1; at; bt; ctgg

0

1&(24) = ffat!1; at; bt; ctgfbt!1; ct!1gg

0

1&(25) = ffat!1; at; btgfbt!1; ct!1; ctgg

0

1&(26) = ffat!1; at; ctgfbt!1; ct!1; btgg

0

1&(27) = ffat!1; atgfbt!1; ct!1; bt; ctgg

0

1&(28) = ffat!1; bt; ctgfbt!1; ct!1; atgg

0

1&(29) = ffat!1; btgfbt!1; ct!1; at; ctgg

0

1&(30) = ffat!1; ctgfbt!1; ct!1; at; btgg

0

1&(31) = ffat!1gfbt!1; ct!1; at; bt; ctgg

p(i)e (XtjXt!1)

Figure S3: Decomposed e�ect repertoire for an exemplary discrete-statesystem. For the system state X = (1, 0, 0), the �gure visualizes the complete setof decomposed variants of the e�ect repertoire p(Xt|Xt−1 = X). Each subpanelincludes the corresponding partition of the set of random variables (Xt−1, Xt) =(at−1, bt−1, ct−1, at, bt, ct) that gives rise to these variants. The asterisk signi�es thedecomposed e�ect repertoire which results in the minimum EMD with respect to theoriginal e�ect repertoire, which in turn corresponds the e�ect information φe((1, 0, 0)).

33

0

1&(1) = ffat!1; bt!1; ct!1; at; btgfctgg

*

0

1&(2) = ffat!1; bt!1; ct!1; at; ctgfbtgg

*

0

1&(3) = ffat!1; bt!1; ct!1; atgfbt; ctgg

0

1&(4) = ffat!1; bt!1; ct!1; bt; ctgfatgg

*

0

1&(5) = ffat!1; bt!1; ct!1; btgfat; ctgg

0

1&(6) = ffat!1; bt!1; ct!1; ctgfat; btgg

0

1&(7) = ffat!1; bt!1; ct!1gfat; bt; ctgg

0

1&(8) = ffat!1; bt!1; at; bt; ctgfct!1gg

*

0

1&(9) = ffat!1; bt!1; at; btgfct!1; ctgg

0

1&(10) = ffat!1; bt!1; at; ctgfct!1; btgg

0

1&(11) = ffat!1; bt!1; atgfct!1; bt; ctgg

0

1&(12) = ffat!1; bt!1; bt; ctgfct!1; atgg

0

1&(13) = ffat!1; bt!1; btgfct!1; at; ctgg

0

1&(14) = ffat!1; bt!1; ctgfct!1; at; btgg

*

0

1&(15) = ffat!1; bt!1gfct!1; at; bt; ctgg

*

0

1&(16) = ffat!1; ct!1; at; bt; ctgfbt!1gg

*

0

1&(17) = ffat!1; ct!1; at; btgfbt!1; ctgg

0

1&(18) = ffat!1; ct!1; at; ctgfbt!1; btgg

0

1&(19) = ffat!1; ct!1; atgfbt!1; bt; ctgg

0

1&(20) = ffat!1; ct!1; bt; ctgfbt!1; atgg

0

1&(21) = ffat!1; ct!1; btgfbt!1; at; ctgg

0

1&(22) = ffat!1; ct!1; ctgfbt!1; at; btgg

0

1&(23) = ffat!1; ct!1gfbt!1; at; bt; ctgg

0

1&(24) = ffat!1; at; bt; ctgfbt!1; ct!1gg

0

1&(25) = ffat!1; at; btgfbt!1; ct!1; ctgg

0

1&(26) = ffat!1; at; ctgfbt!1; ct!1; btgg

0

1&(27) = ffat!1; atgfbt!1; ct!1; bt; ctgg

0

1&(28) = ffat!1; bt; ctgfbt!1; ct!1; atgg

0

1&(29) = ffat!1; btgfbt!1; ct!1; at; ctgg

0

1&(30) = ffat!1; ctgfbt!1; ct!1; at; btgg

0

1&(31) = ffat!1gfbt!1; ct!1; at; bt; ctgg

*

p(i)c (Xt!1jXt)

Figure S4: Decomposed cause repertoire for an exemplary discrete-statesystem. For the system state X = (1, 0, 0), the �gure visualizes the complete set ofdecomposed variants of the random variable set (Xt−1, Xt) giving rise to the causerepertoire variants. The asterisks signify the decomposed cause repertoires whichresults in a minimum EMD with respect to the original cause repertoire, which in turncorresponds the cause information φc((1, 0, 0))

34

Figure S5: System decomposition-induced covariance cov(Xt−1, Xt). For athree-dimensional system, the panels depict the covariance structure induced by eachpossible bipartition of (Xt−1, Xt), here indicated by index sets (cf. eq. (50)). Black tilesindicate zero elements and white tiles indicate arbitrary elements of cov(Xt−1, Xt)

(i).

35