Full Bayesian Network Classifiers by Jiang Su and Harry...

Full Bayesian Network Classifiersby Jiang Su and Harry Zhang

Flemming Jensen

November 2008

Purpose

To introduce the full Bayesian network classifier(FBC).

Introduction

Bayesian networks are often used for the classification problem,where a learner attempts to construct a classifier from a given setof labeled training examples.

Since the number of possible network structures is extremely huge,structure learning often has high computational complexity.

The idea behind the full Bayesian network classifier is to reduce thecomputational complexity of structure learning by using a fullBayesian network as the structure, and represent variableindependence in the conditional probability tables instead of in thenetwork structure.

We use decision trees to represent the conditional probabilitytables to keep the compact representation of the joint distribution.

Introduction

Variable Independence

Definition - Conditionally independence

Let X , Y , Z be subsets of the variable set W . The subsets X andY are conditionally independent given Z if:

P(X |Y ,Z ) = P(X |Z )

Definition - Contextually independence

Let X , Y , Z , T be disjoint subsets of the variable set W . Thesubsets X and Y are contextually independent given Z and thecontext t if:

P(X |Y ,Z , t) = P(X |Z , t)

Variable Independence

Definition - Conditionally independence

Let X , Y , Z be subsets of the variable set W . The subsets X andY are conditionally independent given Z if:

P(X |Y ,Z ) = P(X |Z )

Definition - Contextually independence

Let X , Y , Z , T be disjoint subsets of the variable set W . Thesubsets X and Y are contextually independent given Z and thecontext t if:

P(X |Y ,Z , t) = P(X |Z , t)

Existence

Theorem - Existence

For any BN B, there exists an FBC FB, such that B and FB encodethe same variable independencies.

Proof:

Since B is an acyclic graph, the nodes of B can be sorted onthe basis of the topological ordering.

Go through each node X in the topological ordering, and addarcs to all the nodes ranked after X .

The resulting network FB is a full BN.

Build a CPT-tree for each node X in FB, such that anyvariable that is not in the parent set ΠX of X in B does notoccur in the CPT-tree of X in FB.

Existence

Theorem - Existence

Proof:

Existence

Theorem - Existence

Proof:

Existence

Theorem - Existence

Proof:

Existence

Theorem - Existence

Proof:

Existence

Theorem - Existence

Proof:

Example - FBC for Naive Bayes

Example of a naive Bayes

X1 X2 X3 X4

Example - FBC for Naive Bayes

Example of an FBC for the naive Bayes

X1 X2 X3 X4

p11p12p13p14

p21p22p23p24

p31p32p33p34

p41p42p43p44

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts:

Construction of a full BN.

Learning of decision trees to represent the CPT of eachvariable.

The full BN is implemented using a Bayesian multinet.

Definition - Bayesian multinet

A Bayesian multinet is a set of Bayesian networks, each of whichcorresponds to a value c of the class variable C .

Structure Learning

Learning the structure of a full BN actually means learning anorder of variables and then adding arcs from a variable to all thevariables ranked after it.

A variable is ranked based on its total influence on other variables.

The influence (dependency) between two variables can bemeasured by mutual information.

Definition - Mutual information

Let X and Y be two variables in a Bayesian network. The mutualinformation is defined as:

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

Structure Learning

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

Structure Learning

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

Structure Learning

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

Structure Learning

It is possible that the dependency between two variables, measuredby mutual information, is caused merely by noise.

Results by Friedman are used as a dependency threshold to filterout unreliable dependencies.

Definition - Dependency threshold

Let Xi and Xj be two variables in a Bayesian network. Thedependency threshold, denoted by φ, is defined as:

φ(Xi ,Xj) = logN2N × Tij , where Tij = |Xi | × |Xj |.

Structure Learning

The total influence of a variable on other variables can now bedefined:

Definition - Total influence

Let Xi be a variable in a Bayesian network. The total influence ofXi on other variables, denoted by W (Xi ), is defined as:

W (Xi ) =

M(Xi ;Xj )>φ(Xi ,Xj )∑j(j 6=i)

M(Xi ; Xj).

Structure Learning

The total influence of a variable on other variables can now bedefined:

Definition - Total influence

Let Xi be a variable in a Bayesian network. The total influence ofXi on other variables, denoted by W (Xi ), is defined as:

W (Xi ) =

M(Xi ; Xj).

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty.

2 Partition the training data S into |C | subsets Sc by the classvalue c .

3 For each training data set Sc

Compute the mutual information M(Xi ; Xj) and thedependency threshold φ(Xi ,Xj) between each pair of variablesXi and Xj .Compute W (Xi ) for each variable Xi .For all variables Xi in X- Add all the variables Xj with W (Xj) > W (Xi ) to the parentset ΠXi of Xi .- Add arcs from all the variables Xj in ΠXi to Xi .Add the resulting network Bc to B.

4 Return B.

1 B = empty.

4 Return B.

1 B = empty.

4 Return B.

1 B = empty.

4 Return B.

1 B = empty.

Compute the mutual information M(Xi ; Xj) and thedependency threshold φ(Xi ,Xj) between each pair of variablesXi and Xj .

Compute W (Xi ) for each variable Xi .For all variables Xi in X- Add all the variables Xj with W (Xj) > W (Xi ) to the parentset ΠXi of Xi .- Add arcs from all the variables Xj in ΠXi to Xi .Add the resulting network Bc to B.

4 Return B.

1 B = empty.

Compute the mutual information M(Xi ; Xj) and thedependency threshold φ(Xi ,Xj) between each pair of variablesXi and Xj .Compute W (Xi ) for each variable Xi .

For all variables Xi in X- Add all the variables Xj with W (Xj) > W (Xi ) to the parentset ΠXi of Xi .- Add arcs from all the variables Xj in ΠXi to Xi .Add the resulting network Bc to B.

4 Return B.

1 B = empty.

Compute the mutual information M(Xi ; Xj) and thedependency threshold φ(Xi ,Xj) between each pair of variablesXi and Xj .Compute W (Xi ) for each variable Xi .For all variables Xi in X

- Add all the variables Xj with W (Xj) > W (Xi ) to the parentset ΠXi of Xi .- Add arcs from all the variables Xj in ΠXi to Xi .Add the resulting network Bc to B.

4 Return B.

1 B = empty.

Compute the mutual information M(Xi ; Xj) and thedependency threshold φ(Xi ,Xj) between each pair of variablesXi and Xj .Compute W (Xi ) for each variable Xi .For all variables Xi in X- Add all the variables Xj with W (Xj) > W (Xi ) to the parentset ΠXi of Xi .

- Add arcs from all the variables Xj in ΠXi to Xi .Add the resulting network Bc to B.

4 Return B.

1 B = empty.

Compute the mutual information M(Xi ; Xj) and thedependency threshold φ(Xi ,Xj) between each pair of variablesXi and Xj .Compute W (Xi ) for each variable Xi .For all variables Xi in X- Add all the variables Xj with W (Xj) > W (Xi ) to the parentset ΠXi of Xi .- Add arcs from all the variables Xj in ΠXi to Xi .

Add the resulting network Bc to B.

4 Return B.

1 B = empty.

4 Return B.

1 B = empty.

4 Return B.

Example - Structure Learning Algorithm

Example using 1000 labeled instances, where C is the class variableand A, B, and D are feature variables.

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

c1 a2 b2 d2 25

C A B D #

c2 a1 b1 d1 36

c2 a1 b1 d2 36

c2 a1 b2 d1 259

c2 a1 b2 d2 29

c2 a2 b1 d1 96

c2 a2 b1 d2 96

c2 a2 b2 d1 43

c2 a2 b2 d2 5

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

c1 a2 b2 d2 25The 400 data instances where C = c1.

a111+5400

7+17400

a2227+97

40011+25400

P(A,B)

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

a111+5400

7+17400

a2227+97

40011+25400

P(A,B)

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

a111+5400

7+17400

a2227+97

40011+25400

P(A,B)

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

a111+5400

7+17400

a2227+97

40011+25400

P(A,B)

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

a111+5400

7+17400

a2227+97

40011+25400

P(A,B)

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

a111+5400

7+17400

a2227+97

40011+25400

P(A,B)

C A B D #

c1 a1 b1 d1 11

c1 a1 b1 d2 5

c1 a1 b2 d1 7

c1 a1 b2 d2 17

c1 a2 b1 d1 227

c1 a2 b1 d2 97

c1 a2 b2 d1 11

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09)

a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09)

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09)

a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09)

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09)

a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09)

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09)

a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09)

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09)

a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09)

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09)

a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09)

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B)= 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B) = 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B) = 0.04 · log(0.04

0.085)+0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B) = 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B) = 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+ 0.06 · log(0.06

0.015)+0.09 · log(

0.135) = 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B) = 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+ 0.06 · log(0.06

0.015) + 0.09 · log(

0.135)= 0.027

a1 0.04 0.06

a2 0.81 0.09

P(A,B)

a1 0.085 0.015

a2 0.765 0.135

P(A)P(B)

M(X ; Y ) =∑

x∈X ,y∈Y

P(x , y)logP(x , y)

P(x)P(y)

M(A; B) = 0.04 · log(0.04

0.085) + 0.81 · log(

0.765)

+ 0.06 · log(0.06

0.015) + 0.09 · log(

0.135) = 0.027

Mutual informationM(A; B) = 0.027

M(A; D) = 0.004M(B; D) = 0.018

Dependency thresholdφ(Xi ,Xj) = logN

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027indent indent indentW (B) = M(A; B) +M(B; D) = 0.045indent indent indentW (D) = M(B; D) = 0.018

Mutual informationM(A; B) = 0.027M(A; D) = 0.004

M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

Mutual informationM(A; B) = 0.027M(A; D) = 0.004M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D)

= 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800

= 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A)

= M(A; B) = 0.027indent indent indentW (B) = M(A; B) +M(B; D) = 0.045indent indent indentW (D) = M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B)

= 0.027indent indent indentW (B) = M(A; B) +M(B; D) = 0.045indent indent indentW (D) = M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027

indent indent indentW (B) = M(A; B) +M(B; D) = 0.045indent indent indentW (D) = M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027indent indent indentW (B)

= M(A; B) +M(B; D) = 0.045indent indent indentW (D) = M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027indent indent indentW (B) = M(A; B)

+M(B; D) = 0.045indent indent indentW (D) = M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027indent indent indentW (B) = M(A; B) +M(B; D)

= 0.045indent indent indentW (D) = M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027indent indent indentW (B) = M(A; B) +M(B; D) = 0.045

indent indent indentW (D) = M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027indent indent indentW (B) = M(A; B) +M(B; D) = 0.045indent indent indentW (D)

= M(B; D) = 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

indent indent indentW (A) = M(A; B) = 0.027indent indent indentW (B) = M(A; B) +M(B; D) = 0.045indent indent indentW (D) = M(B; D)

= 0.018

2N × Tij

φ(A,B) = φ(A,D) = φ(B,D) = 4log400800 = 0.013

Total influence

W (Xi ) =

M(Xi ; Xj)

We now construct a full Bayesian network with variable orderaccording to the total influence values:

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

We now have the full Bayesian network Bc1 , which is the part ofthe multinet that corresponds to C = c1.We should now repeat the process to construct Bc2 and therebycomplete the FBC structure learning.

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

W (A) = 0.027W (B) = 0.045W (D) = 0.018

W (B) > W (A) > W (D)

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN.

A traditional decision tree learning algorithm, such as C4.5, can beused to learn CPT-trees. However, since the time complexitytypically is O(n2 · N) the resulting FBC learning algorithm wouldhave a complexity of O(n3 · N).

Instead a fast decision tree learning algorithm is purposed.

The algorithm uses the mutual information to determine a fixedordering of variables from root to leaves.

The predefined variable ordering makes the algorithm faster thantraditional decision tree learning algorithms.

CPT-tree Learning

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T.2 If (S is pure or empty) or (ΠXi

is empty)Return T.

3 qualified = False.4 While (qualified == False) and (ΠXi

is not empty)Choose the variable Xj with the highest M(Xj ; Xi ).Remove Xj from ΠXi .Compute the local mutual information MS(Xi ; Xj) on S.Compute the local dependency threshold φS(Xi ,Xj) on S.If MS(Xi ; Xj) > φS(Xi ,Xj) qualified = True

5 If qualified == TrueCreate a root Xj for T.Partition S into disjoint subsets Sx , x is a value of Xj .For all values x of Xj

- Tx = Fast-CPT-Tree(ΠXi , Sx)- Add Tx as a child of Xj .

6 Return T.

1 Create an empty tree T.

2 If (S is pure or empty) or (ΠXiis empty)

Return T.3 qualified = False.4 While (qualified == False) and (ΠXi

6 Return T.

is empty)Return T.

6 Return T.

is empty)Return T.

3 qualified = False.

4 While (qualified == False) and (ΠXiis not empty)

Choose the variable Xj with the highest M(Xj ; Xi ).Remove Xj from ΠXi .Compute the local mutual information MS(Xi ; Xj) on S.Compute the local dependency threshold φS(Xi ,Xj) on S.If MS(Xi ; Xj) > φS(Xi ,Xj) qualified = True

6 Return T.

is empty)Return T.

is not empty)

Choose the variable Xj with the highest M(Xj ; Xi ).Remove Xj from ΠXi .Compute the local mutual information MS(Xi ; Xj) on S.Compute the local dependency threshold φS(Xi ,Xj) on S.If MS(Xi ; Xj) > φS(Xi ,Xj) qualified = True

6 Return T.

is empty)Return T.

is not empty)Choose the variable Xj with the highest M(Xj ; Xi ).

Remove Xj from ΠXi .Compute the local mutual information MS(Xi ; Xj) on S.Compute the local dependency threshold φS(Xi ,Xj) on S.If MS(Xi ; Xj) > φS(Xi ,Xj) qualified = True

6 Return T.

is empty)Return T.

is not empty)Choose the variable Xj with the highest M(Xj ; Xi ).Remove Xj from ΠXi .

Compute the local mutual information MS(Xi ; Xj) on S.Compute the local dependency threshold φS(Xi ,Xj) on S.If MS(Xi ; Xj) > φS(Xi ,Xj) qualified = True

6 Return T.

is empty)Return T.

is not empty)Choose the variable Xj with the highest M(Xj ; Xi ).Remove Xj from ΠXi .Compute the local mutual information MS(Xi ; Xj) on S.

Compute the local dependency threshold φS(Xi ,Xj) on S.If MS(Xi ; Xj) > φS(Xi ,Xj) qualified = True

6 Return T.

is empty)Return T.

is not empty)Choose the variable Xj with the highest M(Xj ; Xi ).Remove Xj from ΠXi .Compute the local mutual information MS(Xi ; Xj) on S.Compute the local dependency threshold φS(Xi ,Xj) on S.

If MS(Xi ; Xj) > φS(Xi ,Xj) qualified = True5 If qualified == True

Create a root Xj for T.Partition S into disjoint subsets Sx , x is a value of Xj .For all values x of Xj

6 Return T.

is empty)Return T.

6 Return T.

is empty)Return T.

5 If qualified == True

Create a root Xj for T.Partition S into disjoint subsets Sx , x is a value of Xj .For all values x of Xj

6 Return T.

is empty)Return T.

5 If qualified == TrueCreate a root Xj for T.

Partition S into disjoint subsets Sx , x is a value of Xj .For all values x of Xj

6 Return T.

is empty)Return T.

5 If qualified == TrueCreate a root Xj for T.Partition S into disjoint subsets Sx , x is a value of Xj .

For all values x of Xj

6 Return T.

is empty)Return T.

6 Return T.

is empty)Return T.

- Tx = Fast-CPT-Tree(ΠXi , Sx)

- Add Tx as a child of Xj .

6 Return T.

is empty)Return T.

6 Return T.

is empty)Return T.

6 Return T.

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first.

Fast-CPT-Tree(ΠD = {A,B}, S)

M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.MS(D; B) = M(D; B) = 0.018 , φS(D,B) = φ(D,B) = 0.013MS(D; B) > φS(D,B) so qualified = True.Since qualified == True, create a root for Xj = B andpartition S into the subsets Sb1 and Sb2 .Recursively call:Fast-CPT-Tree(ΠD = {A}, Sb1) andFast-CPT-Tree(ΠD = {A}, Sb2)Add the resulting trees as children of Xj = B.

Bb1 b2

We construct the CPT-tree for the variable D first.Fast-CPT-Tree(ΠD = {A,B}, S)

Bb1 b2

M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.

MS(D; B) = M(D; B) = 0.018 , φS(D,B) = φ(D,B) = 0.013MS(D; B) > φS(D,B) so qualified = True.Since qualified == True, create a root for Xj = B andpartition S into the subsets Sb1 and Sb2 .Recursively call:Fast-CPT-Tree(ΠD = {A}, Sb1) andFast-CPT-Tree(ΠD = {A}, Sb2)Add the resulting trees as children of Xj = B.

Bb1 b2

M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.MS(D; B) = M(D; B) = 0.018

, φS(D,B) = φ(D,B) = 0.013MS(D; B) > φS(D,B) so qualified = True.Since qualified == True, create a root for Xj = B andpartition S into the subsets Sb1 and Sb2 .Recursively call:Fast-CPT-Tree(ΠD = {A}, Sb1) andFast-CPT-Tree(ΠD = {A}, Sb2)Add the resulting trees as children of Xj = B.

Bb1 b2

M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.MS(D; B) = M(D; B) = 0.018 , φS(D,B) = φ(D,B) = 0.013

MS(D; B) > φS(D,B) so qualified = True.Since qualified == True, create a root for Xj = B andpartition S into the subsets Sb1 and Sb2 .Recursively call:Fast-CPT-Tree(ΠD = {A}, Sb1) andFast-CPT-Tree(ΠD = {A}, Sb2)Add the resulting trees as children of Xj = B.

Bb1 b2

M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.MS(D; B) = M(D; B) = 0.018 , φS(D,B) = φ(D,B) = 0.013MS(D; B) > φS(D,B) so qualified = True.

Since qualified == True, create a root for Xj = B andpartition S into the subsets Sb1 and Sb2 .Recursively call:Fast-CPT-Tree(ΠD = {A}, Sb1) andFast-CPT-Tree(ΠD = {A}, Sb2)Add the resulting trees as children of Xj = B.

Bb1 b2

M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.MS(D; B) = M(D; B) = 0.018 , φS(D,B) = φ(D,B) = 0.013MS(D; B) > φS(D,B) so qualified = True.Since qualified == True, create a root for Xj = B andpartition S into the subsets Sb1 and Sb2 .

Recursively call:Fast-CPT-Tree(ΠD = {A}, Sb1) andFast-CPT-Tree(ΠD = {A}, Sb2)Add the resulting trees as children of Xj = B.

Bb1 b2

M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.MS(D; B) = M(D; B) = 0.018 , φS(D,B) = φ(D,B) = 0.013MS(D; B) > φS(D,B) so qualified = True.Since qualified == True, create a root for Xj = B andpartition S into the subsets Sb1 and Sb2 .Recursively call:Fast-CPT-Tree(ΠD = {A}, Sb1) andFast-CPT-Tree(ΠD = {A}, Sb2)

Add the resulting trees as children of Xj = B.

Bb1 b2

Fast-CPT-Tree(ΠD = {A}, Sb1)

Only one parent variable remains, so Xj = A.

MSb1 (D; A) = 7 · 10−6 , φSb1 (D,A) = 0.015MSb1 (D; A) ≯ φSb1 (D,A) so qualified = False.

Since qualified == False, return the empty tree.

MSb1 (D; A) = 7 · 10−6

, φSb1 (D,A) = 0.015MSb1 (D; A) ≯ φSb1 (D,A) so qualified = False.

MSb1 (D; A) = 7 · 10−6 , φSb1 (D,A) = 0.015

MSb1 (D; A) ≯ φSb1 (D,A) so qualified = False.

MSb2 (D; A) = 4 · 10−5

, φSb2 (D,A) = 0.059MSb2 (D; A) ≯ φSb2 (D,A) so qualified = False.

MSb2 (D; A) = 4 · 10−5 , φSb2 (D,A) = 0.059

MSb2 (D; A) ≯ φSb2 (D,A) so qualified = False.

We now only need to add Xi = D as children of B and specify theprobabilities, which are trivial to calculate.

Bb1 b2

We should repeat this process for each variable in each network.

Bb1 b2

Dd1 d2

Bb1 b2

Dd1 d2

11+227340 =0.7

5+97340 =0.3

7+1160 =0.3

17+2560 =0.7

Complexity

Let n be the number of variables and N the number of datainstances.

FBC-Structure has time complexity O(n2 · N).

Fast-CPT-Tree has time complexity O(n · N).Fast-CPT-Tree is called once for each variable in each of the |C |multinet parts. Hence the time complexity:O(|C | · n2 · N

|C |) = O(n2 · N).

Thus, the FBC learning algorithm has the time complexityO(n2 · N).

Complexity

|C |) = O(n2 · N).

Complexity

Fast-CPT-Tree has time complexity O(n · N).

Fast-CPT-Tree is called once for each variable in each of the |C |multinet parts. Hence the time complexity:O(|C | · n2 · N

|C |) = O(n2 · N).

Complexity

|C |) = O(n2 · N).

Complexity

|C |) = O(n2 · N).

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments.

Performance of an algorithm on each data set is observed via 10runs of 10-fold cross validation.

Two-tailed t-test with a 95% confidence interval is conducted tocompare each pair of algorithms on each data set.

Results on accuracy - classification (data sets won - draw - lost)AODE HGC TAN NBT C4.5 SMO

FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2

Results on AUC - ranking (data sets won - draw - lost)AODE HGC TAN NBT C4.5L SMO

FBC 7/22/4 6/25/2 9/24/0 8/24/1 25/7/1 10/20/3

FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2

FBC 7/22/4 6/25/2 9/24/0 8/24/1 25/7/1 10/20/3

FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2

FBC 7/22/4 6/25/2 9/24/0 8/24/1 25/7/1 10/20/3

FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2

FBC 7/22/4 6/25/2 9/24/0 8/24/1 25/7/1 10/20/3

FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2

FBC 7/22/4 6/25/2 9/24/0 8/24/1 25/7/1 10/20/3

Experiments - Complexity

Complexity of tested algorithms

Training Classification

FBC O(n2 · N) O(n)AODE O(n2 · N) O(n2)HGC O(n4 · N) O(n)TAN O(n2 · N) O(n)NBT O(n3 · N) O(n)C4.5 O(n2 · N) O(n)SMO O(n2.3) O(n)

Experiments - Conclusion

FBC demonstrates good performance in both classification andranking.

FBC is among the most efficient algorithms in both training andclassification time.

Overall, the performance of FBC is the best among the algorithmscompared.

Full Bayesian Network Classifiers by Jiang Su and Harry...

Documents

Transcript of Full Bayesian Network Classifiers by Jiang Su and Harry...