LATENT TREE MODELS FOR MULTIVARIATE DENSITY …lzhang/paper/pspdf/wang-thesis.pdf · YI WANG This...
Transcript of LATENT TREE MODELS FOR MULTIVARIATE DENSITY …lzhang/paper/pspdf/wang-thesis.pdf · YI WANG This...
LATENT TREE MODELS FOR MULTIVARIATEDENSITY ESTIMATION: ALGORITHMS AND
APPLICATIONS
by
YI WANG
A Thesis Submitted toThe Hong Kong University of Science and Technology
in Partial Fulfillment of the Requirements for
the Degree of Doctor of Philosophy
in Computer Science and Engineering
August 2009, Hong Kong
Copyright c© by Yi Wang 2009
Authorization
I hereby declare that I am the sole author of the thesis.
I authorize the Hong Kong University of Science and Technology to lend this thesis
to other institutions or individuals for the purpose of scholarly research.
I further authorize the Hong Kong University of Science and Technology to reproduce
the thesis by photocopying or by other means, in total or in part, at the request of other
institutions or individuals for the purpose of scholarly research.
YI WANG
ii
LATENT TREE MODELS FOR MULTIVARIATEDENSITY ESTIMATION: ALGORITHMS AND
APPLICATIONS
by
YI WANG
This is to certify that I have examined the above Ph.D. thesis
and have found that it is complete and satisfactory in all respects,
and that any and all revisions required by
the thesis examination committee have been made.
PROF. NEVIN LIANWEN ZHANG, THESIS SUPERVISOR
PROF. MOUNIR HAMDI, HEAD OF DEPARTMENT
Department of Computer Science and Engineering
18 August 2009
iii
ACKNOWLEDGMENTS
First of all, I would like to express my gratitude to my supervisor Prof. Nevin Lianwen
Zhang. He guided me through all the phases to this thesis. In the past six years, I have
learned a lot from him, not only useful skills for research, but also positive attitude toward
research and life. It is my fortune to be his student.
I thank the members of my proposal and thesis examination committees: Prof. Ning
Cai, Prof. Lei Chen, Prof. Marek J. Druzdzel, Prof. Brian Mak, and Prof. Dit-Yan
Yeung. They took time to read the earlier versions of this manuscript and provided
valuable comments. Those comments help significantly improve the quality of this thesis.
I am grateful to my colleagues in HKUST: Tao Chen, Haipeng Guo, and Kin Man
Poon. We collaborated closely in many projects. Discussions with them inspired me a lot.
My thanks also go to my friends: James Cheng, Sheng Gao, Bingsheng He, Dan Hong,
Xing Jin, Wujun Li, An Lu, Jialin Pan, Pingzhong Tang, Ya Tang, Gang Wang, Yajun
Wang, Qiuyan Xia, Mingxuan Yuan, Jun Zhang, Li Zhao, Wenchen Zheng, Weihua Zhou,
and the list goes on. I had so much fun with you guys, which made my PhD life much
more enjoyable.
Finally, I want to thank my beloved family. I am deeply indebted to my parents Jie’er
Wang and Xueqiang Peng. They gave me good education, encourage me to pursue my
dreams, and always support my big decisions. I also owe a lot to my wife Yiping Ke. She
accompanied me through the long journey towards this thesis, cheered me up when I was
down, and gave me courage to move forward. Without her love and tolerance, I could
never make this far.
I dedicate this thesis to my family.
iv
TABLE OF CONTENTS
Title Page i
Authorization Page ii
Signature Page iii
Acknowledgments iv
Table of Contents v
List of Figures ix
List of Tables xi
Abstract xii
Chapter 1 Introduction 1
1.1 Approaches to Density Estimation 1
1.2 Latent Tree Models for Density Estimation 2
1.3 Learning Latent Tree Models 3
1.4 Contributions 4
1.4.1 New Algorithms for Learning LTMs 4
1.4.2 Two Applications 6
1.5 Organization 7
Chapter 2 Background 8
2.1 Notations 8
2.2 Bayesian Networks 8
2.3 Latent Tree Models 9
2.3.1 Learning Latent Tree Models for Density Estimation 10
2.3.2 Model Inclusion and Equivalence 12
2.3.3 Root Walking and Unrooted LTMs 12
2.3.4 Regular LTMs 13
v
Chapter 3 Algorithm 1: EAST 14
3.1 Search Operators and Search Procedure 14
3.1.1 Search Operators 14
3.1.2 Brute-Force Search 15
3.1.3 EAST Search 16
3.2 Efficient Model Evaluation 19
3.2.1 Parameter Sharing among Models 19
3.2.2 Restricted Likelihood 20
3.2.3 Local EM 21
3.2.4 Avoiding Local Maxima 22
3.2.5 Two-Stage Model Evaluation 22
3.2.6 The PickModel Subroutine 23
3.3 Operation Granularity 23
3.4 Summary 25
Chapter 4 Algorithm 2: Hierarchical Clustering Learning 26
4.1 Heuristic Construction of Model Structure 26
4.1.1 Basic Ideas 26
4.1.2 MI Between A Latent Variable and Another Variable 27
4.2 Cardinalities of Latent Variables 29
4.2.1 Larger C for Better Estimation 29
4.2.2 Maximum Value of C Under Complexity Constraint 31
4.3 Model Simplification 32
4.3.1 Model Regularization 32
4.3.2 Redundant Variable Absorption 34
4.4 Parameter Optimization 38
4.5 The HCL Algorithm 39
4.6 Summary 39
Chapter 5 Algorithm 3: Pyramid 40
5.1 Basic Ideas 40
5.1.1 Bottom-up Construction of Model Structure 40
5.1.2 Unidimensionality Test 41
5.1.3 Subset Growing Termination 42
5.1.4 Sibling Cluster Determination 43
5.2 The Pyramid Algorithm 44
vi
5.3 Mutual Information 46
5.3.1 MI Between Manifest Variables 47
5.3.2 MI Between a Latent Variable and a Manifest Variable 47
5.3.3 MI Between Two Latent Variables 48
5.4 Simple Model Learning 48
5.4.1 Exhaustive Search 48
5.4.2 Restricted Expansion 49
5.4.3 When S Contains Latent Variables 51
5.5 Cardinality and Parameter Refinement 53
5.6 Summary 54
Chapter 6 Empirical Evaluation 55
6.1 Data Sets 55
6.1.1 Synthetic Data Sets 55
6.1.2 Real-World Data Sets 55
6.2 Measures of Model Quality 57
6.3 Impact of Algorithmic Parameters 57
6.3.1 Experimental Settings 57
6.3.2 EAST 58
6.3.3 HCL 59
6.3.4 Pyramid 59
6.4 Comparison of EAST, HCL and Pyramid 60
6.4.1 Model Quality 60
6.4.2 Computational Efficiency 63
6.4.3 Latent Structure Discovery 63
6.5 Summary 65
Chapter 7 Application 1: Approximate Probabilistic Inference 67
7.1 Probabilistic Inference in Bayesian Networks 67
7.2 Basic Idea 68
7.2.1 User Specified Bound on Inferential Complexity 68
7.3 Approximating Bayesian Networks with Latent Tree Models 68
7.3.1 Two Computational Difficulties 69
7.3.2 Optimization via Density Estimation 69
7.3.3 Impact of Imax 70
7.4 LTM-based Approximate Inference 70
vii
7.5 Empirical Results 71
7.5.1 Experimental Settings 71
7.5.2 Impact of N and Imax 72
7.5.3 Comparison with CTP 77
7.5.4 Comparison with LBP 84
7.5.5 Comparison with CL-based Method 84
7.5.6 Comparison with LCM-based Method 85
7.6 Related Work 86
7.7 Summary 87
Chapter 8 Application 2: Classification 88
8.1 Background 88
8.2 Build Classifiers via Density Estimation 89
8.2.1 The Generative Approach to Classification 89
8.2.2 Generative Classifiers Based on Latent Tree Models 89
8.3 Latent Tree Classifier 91
8.4 A Learning Algorithm for Latent Tree Classifier 91
8.4.1 Parameter Smoothing 92
8.5 Empirical Evaluation 93
8.5.1 Data Sets 93
8.5.2 Experimental Settings 93
8.5.3 Effect of Parameter Smoothing 95
8.5.4 LTC-E versus LTC-P 97
8.5.5 Comparison with the Other Algorithms 97
8.5.6 Appreciating Learned Models 101
8.6 Related Work 103
8.7 Summary 104
Chapter 9 Conclusions and Future Work 105
9.1 Summary of Contributions 105
9.2 Future Work 106
9.2.1 Other Applications 106
9.2.2 Handling Continuous Data 107
9.2.3 Generalization to Partially Observed Trees 107
Bibliography 109
viii
LIST OF FIGURES
1.1 Example latent tree model and latent class model. X’s denote manifestvariables, Y ’s denote latent variables. 3
2.1 The Asia network. 9
2.2 Rooted latent tree models, latent tree model obtained by root walking,and unrooted latent tree model. The X’s are manifest variables and theY ’s are latent variables. 9
3.1 The NI and NR operators. The model m2 is obtained from m1 by intro-ducing a new latent node Y3 to mediate between Y1 and two of its neighborsX1 and X2. The cardinality of Y3 is set to be the same as that of Y1. Themodel m3 is obtained from m2 by relocating X3 from Y1 to Y3. 15
3.2 A candidate model obtained by modifying the model in Figure 2.2. Thetwo models share the parameters for describing the distributions P (Y1),P (Y2|Y1), P (X1|Y2), P (X2|Y2), P (X3|Y2), P (X4|Y1), P (Y3|Y1), and P (X5|Y3).On the other hand, the parameters for describing P (Y4|Y3), P (X6|Y4), andP (X7|Y4) are peculiar to the candidate model. 20
4.1 An illustrative example. The numbers within the parentheses denote thecardinalities of the variables. 28
4.2 Redundant variable absorption. (a) A part of a model that contains twoadjacent and saturated latent nodes Y1 and Y2, with Y2 subsuming Y1. (b)Simplified model with Y1 absorbed by Y2. 35
5.1 An example subset growing process. The numbers within the parenthesesdenote the cardinalities of the latent variables. 43
5.2 The model structure after adding one latent variable Y1. 44
5.3 An example for evaluating simple models over latent variables. The colorsof the nodes in Figure 5.3c indicate where the parameters of the nodescome from. 52
6.1 The structures of the generative models of the 3 synthetic data sets. 56
6.2 The quality of the models produced by the three learning algorithms undervarious settings. 61
6.3 The running time of the three learning algorithms under various settings. 62
6.4 The structures of the best models learned by EAST from the 3 syntheticdata sets. 64
6.5 The structures of the best models learned by Pyramid from the 3 syntheticdata sets. 65
6.6 The structures of the best models learned by HCL from the 3 syntheticdata sets. 66
7.1 Running time of HCL under different settings. Settings for which EM didnot converge are indicated by arrows. 73
ix
7.1 Running time of HCL under different settings (continued). Settings forwhich EM did not converge are indicated by arrows. 74
7.2 Approximation accuracy of the LTM-based method under different set-tings. 75
7.2 Approximation accuracy of the LTM-based method under different settings(continued). 76
7.3 Running time of the online phase of the LTM-based method under differentsettings. 78
7.3 Running time of the online phase of the LTM-based method under differentsettings (continued). 79
7.4 Approximation accuracy of various inference methods. 80
7.4 Approximation accuracy of various inference methods (continued). 81
7.5 Running time of various inference methods. 82
7.5 Running time of various inference methods (continued). 83
8.1 NB, TAN, and LTC. C is the class variable, X1, X2, X3, and X4 are fourattributes, Y1 and Y2 are latent variables. 90
8.2 The training time of LTC-E and LTC-P. 99
8.3 The classification time of different classifiers. 101
8.4 The structures of the LTMs for Corral data. 102
8.5 The attribute distributions in each latent class and the correspondingconcept. 103
x
LIST OF TABLES
1.1 A taxonomy of representative density estimation approaches. 2
2.1 The parameters of the Asia network. Abbreviations: V — VisitAsia; S— Smoking; T — Tuberculosis; C — Cancer; B — Bronchitis; TC —TbOrCa; X — XRay; D — Dyspnea. 10
4.1 The empirical MI between the manifest variables. 27
4.2 The estimated MI between each latent variable and other variables. 29
6.1 The 3 settings on the algorithmic parameters of EAST and Pyramid thathave been tested. 59
7.1 The networks used in the experiments and their characteristics. 72
8.1 The 37 data sets used in the experiments. 94
8.2 The classification accuracy of LTC-E and LTC-P with/without parame-ter smoothing. Boldface numbers denote higher accuracy. Small circlesindicate significant wins. 96
8.3 Comparison of classification accuracy between LTC-E and LTC-P. 98
8.4 The classification accuracy of the tested algorithms. The 3 entries indi-cated by small circles become the best after taking out C4.5. 100
8.5 The number of times that LTC significantly won, tied with, and losed tothe other algorithms. 101
xi
LATENT TREE MODELS FOR MULTIVARIATEDENSITY ESTIMATION: ALGORITHMS AND
APPLICATIONS
by
YI WANG
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology
ABSTRACT
Multivariate density estimation is a fundamental problem in Applied Statistics and
Machine Learning. Given a collection of data sampled from an unknown distribution, the
task is to approximately reconstruct the generative distribution. There are two different
approaches to the problem, the parametric approach and the non-parametric approach.
In the parametric approach, the approximate distribution is represented by a model from
a predetermined family.
In this thesis, we adopt the parametric approach and investigate the use of a model
family called latent tree models for the task of density estimation. Latent tree models
are tree-structured Bayesian networks in which leaf nodes represent observed variables,
while internal nodes represent hidden variables. Such models can represent complex
relationships among observed variables, and in the meantime, admit efficient inference
among them. Consequently, they are a desirable tool for density estimation.
While latent tree models are studied for the first time in this thesis for the purpose of
density estimation, they have been investigated earlier for clustering and latent structure
discovery. Several algorithms for learning latent tree models have been proposed. The
state-of-the-art is an algorithm called EAST. EAST determines model structures through
principled and systematic search, and determines model parameters using the EM algo-
rithm. It has been shown to be capable of achieving good trade-off between fit to data
and model complexity. It is also capable of discovering latent structures behind data.
Unfortunately, it has a high computational complexity, which limits its applicability to
xii
density estimation problems.
In this thesis, we propose two latent tree model learning algorithms specifically for
density estimation. The two algorithms have distinct characteristics and are suitable for
different applications. The first algorithm is called HCL. HCL assumes a predetermined
bound on model complexity and restricts to binary model structures. It first builds a
binary tree structure based on mutual information and then runs the EM algorithm once
on the resulting structure to determine the parameters. As such, it is efficient and can
deal with large applications. The second algorithm is called Pyramid. Pyramid does not
assume predetermined bounds on model complexity and does not restrict to binary tree
structures. It builds model structures using heuristics based on mutual information and
local search. It is slower than HCL. However, it is faster than EAST and is only slightly
inferior to EAST in terms of the quality of the resulting models.
In this thesis, we also study two applications of the density estimation techniques that
we develop. The first application is to approximate probabilistic inference in Bayesian
networks. A Bayesian network represents a joint distribution over a set of random vari-
ables. It often happens that the network structure is very complex and making inference
directly on the network is computational intractable. We propose to approximate the
joint distribution using a latent tree model and exploit the latent tree model for faster
inference. The idea is to sample data from the Bayesian network, learn a latent tree
model from the data offline, and when online, make inference with the latent tree model
instead of the original Bayesian network. HCL is used here because the sample size needs
to be large to produce accurate approximation and it is possible to predetermine a bound
on the online running. Empirical evidence shows that this method can achieve good
approximation accuracy at low online computational cost.
The second application is classification. A common approach to this task is to for-
mulate it as a density estimation problem: One constructs the class-conditional density
for each class and then uses the Bayes rule to make classification. We propose to esti-
mate those class-conditional densities using either EAST or Pyramid. Empirical evidence
shows that this method yields good classification performances. Moreover, the latent tree
models built for the class-conditional densities are often meaningful, which is conducive to
user confidence. A comparison between EAST and Pyramid reveals that Pyramid is sig-
nificantly more efficient than EAST, while it results in more or less the same classification
performance as the latter.
xiii
CHAPTER 1
INTRODUCTION
Multivariate density estimation is a fundamental problem in Applied Statistics and Ma-
chine Learning. Suppose there is a collection of data that was drawn from an unknown
probability distribution. The task is to construct an estimate of the generative distribu-
tion from the data (Silverman, 1986). The estimate can help domain experts understand
the properties of the population. It can also be used to make predictions. It can be used
to calculate the likelihood of new data cases, classify new data cases, and compute the
posterior distributions of some variables after observing others.
1.1 Approaches to Density Estimation
Density estimation approaches can be categorized along two dimensions: (1) whether or
not they are based on some parametric models and (2) whether they deal with contin-
uous data or discrete data. Parametric approaches assume that the generative model is
from a given parametric family, and pick one model from the family to approximate the
generative model. For continuous data, commonly used model families include Gaussian
distributions (Duda & Hart, 1973), mixtures of Gaussian (Fraley & Raftery, 2002), and
factor models (Bartholomew & Knott, 1999). For discrete data, Markov random fields
(Kindermanna & Snell, 1980) and Bayesian networks (Pearl, 1988) are often used. Non-
parametric approaches do not restrict the form of the generative distribution. Examples
of non-parametric methods include histogram (Scott, 1992), nearest neighbor (Loftsgaar-
den & Quesenberry, 1965), and Parzen windows (Parzen, 1962). Most non-parametric
approaches deal with only continuous data. Table 1.1 shows a taxonomy of representative
density estimation approaches. In this thesis, we are concerned with density estimation
problems with discrete variables and we focus on parametric approaches.
In the case of discrete variables, a natural class of models for density estimation is
Bayesian networks (Pearl, 1988; Heckerman, 1995; Jordan, 1998). A Bayesian network
(BN) is an annotated directed acyclic graph. Each node in the graph represents a random
variable, and is attached with a conditional probability distribution of the node given its
parent nodes. The Bayesian network as a whole represents a joint distribution over all
the variables. In theory, Bayesian networks can represent any joint distributions exactly.
Therefore, one can often obtain, from observed data, Bayesian networks that approximate
1
Continuous Variable Discrete Variable
ParametricGaussian distribution Markov random fieldMixture of Gaussian Bayesian network
Factor model Latent tree model
Non-parametricHistogram
Nearest neighbor –Parzen window
Table 1.1: A taxonomy of representative density estimation approaches.
the generative distribution well (Heckerman, 1995). However, the resulting Bayesian
networks might be complex and hence computationally hard to deal with. As a matter
of fact, making inference in general Bayesian networks is NP-hard (Cooper, 1990; Dagum
& Luby, 1993).
Another class of models that one can use for density estimation of discrete data is tree-
structured Bayesian networks. This method was first proposed by Chow and Liu (1968)
and is commonly known as Chow-Liu trees. Probabilistic inference in such models only
takes time linear in the number of nodes (Pearl, 1988). However, Chow-Liu trees capture
only second-order dependencies among the variables and ignore higher-order dependencies.
As such, the method might not yield good approximation to the generative model when
the model contains complex relationships among the variables.
1.2 Latent Tree Models for Density Estimation
In this thesis, we study the use of a new class of models, called latent tree models, for
density estimation of discrete data. Like Chow-Liu trees, latent tree models (LTMs) are
also tree-structured BNs. Unlike Chow-Liu trees, LTMs contain latent variables that
are not observed, in addition to manifest variables which are observed. Specifically, the
internal nodes in an LTM represent latent variables and the leaf node represent manifest
variables. An example LTM is shown in Figure 1.1a.
LTMs were first systematically studied by Zhang (2004) and were then called hier-
archical latent class models. The models were identified as a class of potentially useful
models much earlier by Pearl (1988). There are two reasons. First, LTMs are compu-
tationally simple to work with because they are tree-structured. Second, if viewed as
models for the manifest variables, they can represent complex relationship among those
variables. As a matter of fact, no conditional independence relationships hold among the
manifest variables in an LTM. Using LTMs, one can approximate any distribution over
2
(a) LTM (b) LCM
Figure 1.1: Example latent tree model and latent class model. X’s denote manifestvariables, Y ’s denote latent variables.
the manifest variables arbitrarily well.
Learning LTMs from data is a challenging task. One needs to determine the number
of latent variables, the cardinality (i.e., number of states) of each latent variable, the tree
topology that connects the latent variables and manifest variables, and all the conditional
probability distributions. Those factors jointly lead to a huge model space. This is the
reason why the potentials of LTMs have not been explored until recently.
Researchers have previously studied a subclass of LTMs, namely LTMs with one single
latent variable. Such models are called latent class models (LCMs). Lazarsfeld and Henry
(1968) use them for cluster analysis. Lowd and Domingos (2005) use them for density
estimation and call them naıve Bayes models with latent variables. An example LCM is
shown in Figure 1.1b. LCMs constitute a much smaller model space than LTMs. This
makes the learning of LCMs easier than LTMs. On the other hand, LCMs might not be
able to approximate a generative model as well as LTMs.
1.3 Learning Latent Tree Models
Only a few algorithms for constructing LTMs have been developed so far. The first algo-
rithm was proposed by Pearl (1988). The objective is to construct an LTM to represent a
generative distribution assuming that it can be represented exactly using an LTM. Sarkar
(1995) extended the work to the case where the generative distribution cannot be repre-
sented exactly by an LTM. In both cases, one is given the generative distribution instead
of data and all the manifest variables are binary.
The first algorithm that learns LTMs from data was proposed by Zhang (2004). The
algorithm is called double hill climbing (DHC). DHC is a search-based algorithm that
uses the BIC score for model selection. It uses two nested hill climbing subroutines to
seek for high scoring LTMs. At the first level, DHC hill climbs in the space of latent
3
tree structures. For each candidate model structure, DHC invokes a second hill climb
subroutine to optimize the cardinalities of latent variables. This strategy leads to a large
number of candidate models. To calculate the BIC score of each candidate model, DHC
runs EM to optimize the model parameters. EM is known to be time consuming. As a
result, DHC is computationally expensive to use. It can only handle data sets with a half
to one dozen manifest variables.
Zhang and Kocka (2004) proposed a more efficient algorithm called heuristic single
hill climbing (HSHC). HSHC improves DHC in two ways. First, it uses one single hill
climbing routine and directly searches in the space of LTMs. At each step, modifications
to both structures and cardinalities of latent variables are considered at the same time.
Second, HSHC uses easy-to-compute heuristics to evaluate candidate models instead of
calculating the BIC scores exactly. HSHC was the first LTM learning algorithm that can
analyze non-trivial data sets.
The current state-of-the-art for learning LTMs is the EAST algorithm (Chen, 2008).
EAST differs from HSHC in two ways. First, EAST adopts the grow-restructure-thin
search strategy and divides the search process into three stages. At each stage, EAST
searches with only a subset of operators rather than all of them. This reduces the number
of candidate models at each search step. Second, instead of the heuristics of HSHC,
EAST adopts a more principled approach for efficient evaluation of candidate models.
The approach is based on an approximation to maximized likelihood, called restricted
maximized likelihood. We will discuss EAST in details in Chapter 3.
1.4 Contributions
EAST is a search-based algorithm that aims at finding the LTM with the highest BIC
score. It was designed for latent structure discovery and clustering (Chen, 2008). It can
also be used for density estimation. In this context, the AIC score should be used for
model selection instead of the BIC score (see Section 2.3.1). We empirically evaluate
EAST and find that it can yield good solutions to density estimation problems when
coupled with the AIC score. However, it has a serious drawback. Its computational
complexity is high. It typically takes days to process data sets with dozens of manifest
variables and thousands of samples. As such, it is not suitable for large density estimation
problems.
Our contribution in this thesis is two-fold. First, we develop two new algorithms for
learning LTMs that are significantly more efficient than EAST. Second, we apply the
density estimation techniques that we develop to two problems, approximate inference in
complex Bayesian networks and model-based classification.
4
1.4.1 New Algorithms for Learning LTMs
Consider a given LTM. Suppose we remove from it all the conditional distributions and
all the information about the cardinalities of the latent variables. What remains is called
an LTM structure. In an LTM structure, we know the number of latent variables and
how they are connected with the manifest variables, but we do not know how many states
each latent variable takes.
Suppose there is a given LTM structure. Theoretically, one can represent any distribu-
tion over the manifest variables exactly by setting the cardinalities of the latent variable
large enough (see Proposition 4.1). If the cardinalities are not large enough, what we
can get is an approximation of the distribution. If the structure is ‘good’, we can achieve
good approximation with low cardinalities. If the structure is ‘not good’, we need the
cardinalities to be large in order to achieve good approximation.
Finding a ‘good’ LTM structure takes time. This is why EAST is so inefficient. In
this thesis, we develop two new algorithms for learning LTMs. The idea is to spend less
time on finding LTM structure and compensate for sub-optimality in model structure by
increasing the cardinalities of latent variables. In this way, good approximation to the
generative model can still be achieved. The two new algorithms differ on how much efforts
they spend on optimizing model structure. Hence they have different characteristics and
are suitable for different applications.
Our first new algorithm for learning LTM is called hierarchical clustering learning
(HCL). It spends the least effort on determining model structure among all the algo-
rithms. It builds model structures based on the heuristic that variables exhibiting strong
correlation in data should share a common latent parent. More specifically, it constructs
a binary latent tree structure through hierarchical clustering of manifest variables. At
each step, two closely correlated sets of manifest variables are grouped, and a new latent
variable is introduced to account for the relationship between them. The cardinalities of
the latent variables are determined according to a predetermined threshold on the com-
plexity of the resulting model. Finally, model parameters are optimized using the EM
algorithm. HCL is drastically more efficient than EAST. It is suitable for applications
where there are a large number of manifest variables, a large number of data samples, 1
and a pre-specified constraint on the complexity of the resulting the model. It can yield
a good approximation to the generative distribution under the complexity constraint.
The second new algorithm for learning LTM is called Pyramid. It spends more ef-
forts on determining model structure than HCL and less efforts than EAST. Like HCL,
it builds model structure in a bottom-up fashion. At each step, it selects a subset of
1Because the model structure is not very ‘good’, one might need to set the cardinalities of latent variableshigh. This implies a large number of model parameters, which in term requires large sample size.
5
strongly correlated variables and introduces a latent parent for them. To make the selec-
tion, it starts with the subset consisting of the two most closely correlated variables. It
then adds other closely correlated variables to the subset one by one, until the so-called
unidimensionality test fails. Several variables in the final subset are selected according to
the outcomes of the test, and a new common latent parent is introduced for them. The
cardinality of the new latent variable is determined by considering the AIC score of a local
model that contains the variable. Pyramid is significantly more efficient than EAST, but
significantly less efficient than HCL. It can find ‘good’ model structures (almost as good
as those found by EAST) and hence can achieve good approximation of the generative
model without using high cardinalities for the latent variables.
1.4.2 Two Applications
In this thesis, we also apply the density estimation techniques that we develop to two im-
portant problems in Artificial Intelligence and Machine Learning and propose interesting
new solutions for those problems.
The first problem is probabilistic inference in Bayesian networks (BN). When the
network structure of a BN is complex, exact inference is infeasible. We propose a novel
approximate inference method for such cases. Suppose there is a BN over a set X of
variables. It represents a joint distribution P (X) over the variables X. Our idea is to
sample data from the BN, and learn from the data an LTM with X as manifest variables.
The learning is done offline. The resultant LTM is viewed also as a model for X and
represents another distribution P ′(X). When online, we make inference using the LTM
instead of the original BN. Obviously, inference in the LTM can be much more efficient
than in the original BN. Meanwhile, because LTM can represent complex relationships
among manifest variables, P ′(X) can be a good approximation of P (X). Hence, inference
results obtained in the LTM can be accurate approximations of those in the original BN.
Among the aforementioned three LTM learning algorithms, HCL was designed specif-
ically for this application. There are three characteristics with this application. First, the
number of manifest variables, i.e., the number of variables in X, can be large. Second,
the sample size needs to be large for accurate approximation. In practice, one wants to
set the sample size as large as possible. Third, it is desirable to impose a bound on the
complexity of the resulting LTM. In many real-world applications, such as embedded sys-
tems, one would want to guarantee that inference can be done within certain time limit.
HCL is ideal for such situations.
Empirical results on an array of example BNs show that our approximate method
can achieve high approximation accuracy with low online computational time. It is often
6
orders of magnitude faster than exact inference. It also consistently outperforms loopy
belief propagation (Pearl, 1988), a previous approximate inference algorithm that is widely
used in many domains (Frey & MacKay, 1997; Murphy et al., 1999).
The second application we consider is classification. Let C and X be the class variable
and the set of attributes respectively. From the probabilistic perspective, the task is to
determine P (C|X). The Bayes rule states that
P (C|X) =P (C)P (X|C)
P (X).
So, one approach to classification is to first estimate the class-conditional distribution
P (X|C = c) for each class C = c and then use the Bayes rule for classification.
We propose to estimate the class-conditional distributions P (X|C = c) using LTMs.
This idea is interesting for three reasons. First, because LTMs can represent complex
relationships among manifest variables, the estimation can be accurate. This implies that
the classification accuracy can be high. Second, LTMs have low computational complexity.
This means that the online classification time would be short. Third, the LTMs might
reveal interesting latent structures behind the data. This is conducive to user confidence
in the classifiers.
In this application, quality of model structures is an important consideration. Good
model structures not only help to boost user confidence, but also allow us to find an
appropriate balance between fit to data and model complexity. In other words, they help
us to avoid overfitting. As such, the HCL algorithm is not suitable here. We consider
only the use of EAST and Pyramid.
Empirical results on a large number of data sets show that our method compares
favorably, in terms of classification accuracy, with related alternative methods such as
the naıve Bayes classifier, tree-augmented naıve Bayes (Friedman et al., 1997), averaged
one-dependence estimators (Webb et al., 2005), and decision tree (Quinlan, 1993). It
is fast in terms of online classification time. And it is unique in that it can discover
interesting latent structures. The empirical results also suggest that Pyramid achieves a
better tradeoff between classification accuracy and training time than EAST. When one
moves from EAST to Pyramid, the classification accuracy drops only slightly, while the
training decreases drastically.
1.5 Organization
The rest of this thesis is structured as follows. In the next chapter, we review basic
concepts and facts about BNs and LTMs. In Chapters 3–5, we present the EAST, HCL,
7
and Pyramid algorithms for learning LTMs. In Chapter 6, we evaluate them empirically
on synthetic density estimation problems. In Chapter 7, we describe the use of LTMs
for approximate inference in BNs. In Chapter 8, we describe the application of LTMs to
classification. Finally, we conclude this thesis in Chapter 9 and discuss possible future
directions.
8
CHAPTER 2
BACKGROUND
In this chapter, we review basic concepts and facts about Bayesian networks and latent
tree models.
2.1 Notations
In this thesis, we deal only with categorical variables, i.e., variables that take finite number
of values. We use capital letters X, Y , Z to denote random variables, and use lower-case
letters x, y, z to denote specific values that X, Y , Z can take. We use bold-face letters
X, Y, Z to represent sets of random variables, and x, y, z to represent their values.
2.2 Bayesian Networks
Bayesian networks (BNs) are a class of probabilistic models that use graphs to model con-
ditional independence among random variables. They provide a compact representation
for joint probability distributions by exploiting conditional independence relationships.
Formally, a BN N over a set of random variables X consists of two components: a
directed acyclic graph (DAG) G and a collection of conditional probability tables (CPTs)
θG. We will refer to the first component as the structure of the BN, and the second
component as the parameters of the BN.
The structure G encodes the conditional independence relationships among the random
variables X. Each node in G represents a random variable in X. In this thesis, we use
the terms ‘node’ and ‘variable’ interchangeably. Each edge in G represents the direct
dependency between two nodes, while lacking of edge between two nodes implies that
they are conditional independent given some other variables. In particular, a node is
conditionally independent of all its non-descendent nodes given its parent nodes.
The parameter θG quantifies the strength of the dependencies along the edges in G.It consists of a CPT P (X|pa(X)) for each node X given its parent nodes pa(X) in G. If
X is a root node, the set pa(X) is empty, and the conditional distribution P (X|pa(X))
reduces to the marginal distribution P (X). Semantically, the BN N represents a joint
9
Figure 2.1: The Asia network.
(a) Rooted LTM (b) LTM after root walking (c) Unrooted LTM
Figure 2.2: Rooted latent tree models, latent tree model obtained by root walking, andunrooted latent tree model. The X’s are manifest variables and the Y ’s are latent vari-ables.
probability distribution PN (X) over X that decomposes as follows:
PN (X) =∏
X∈X
P (X|pa(X), θG).
Figure 2.1 shows an example BN called Asia. It models the relationships among the
profile of a patient (whether he visited Asia and whether he smokes), the possibility that
the patient has the diseases of tuberculosis, cancer, and bronchitis, as well as the symptoms
(positive outcome of X-Ray and dyspnea) that the patient may have. The parameters of
this network is given in Table 2.1. Note that each row of the tables represents a conditional
distribution and sums up to one.
2.3 Latent Tree Models
A latent tree model (LTM) is a special Bayesian network. Its structure is a rooted tree.
The leaf nodes represent manifest variables that are observed, while the internal nodes
represent latent variables that are hidden. Figure 2.2a shows an example LTM. In this
model, X1–X7 represent manifest variables, while Y1–Y3 represent latent variables.
10
V = y V = n
0.01 0.99
(a) P (V )
S = y S = n
0.5 0.5
(b) P (S)
T = y T = n
V = y 0.05 0.95V = n 0.01 0.99
(c) P (T |V )
C = y C = n
S = y 0.1 0.9S = n 0.01 0.99
(d) P (C|S)
B = y B = n
S = y 0.6 0.4S = n 0.3 0.7
(e) P (B|S)
X = y X = n
TC = y 0.98 0.02TC = n 0.05 0.95
(f) P (X |TC)
TC = y TC = n
T = yC = y 1 0C = n 1 0
T = nC = y 1 0C = n 0 1
(g) P (TC|T, C)
D = y D = n
TC = yB = y 0.9 0.1B = n 0.7 0.3
TC = nB = y 0.8 0.2B = n 0.1 0.9
(h) P (D|TC, B)
Table 2.1: The parameters of the Asia network. Abbreviations: V — VisitAsia; S —Smoking; T — Tuberculosis; C — Cancer; B — Bronchitis; TC — TbOrCa; X — XRay;D — Dyspnea.
Following the notation of Bayesian network, we write an LTM as a pairM = (m, θm).
The second component θm is the same as in BN. It is the collection of parameters of the
LTM and consists of a conditional distribution P (Z|pa(Z)) for each node Z.1 The first
component m denotes the rest of the LTM. It consists of the variables, the cardinalities
of the variables, and the topology of the rooted tree. We sometimes refer to m also as an
LTM.
2.3.1 Learning Latent Tree Models for Density Estimation
Let P (X) be an unknown distribution over a set X of variables. Suppose D is a collection
of data drawn from P (X). There are infinitely many possible LTMs with X as manifest
nodes. Each LTM M represents a distribution PM(X) over X. In this thesis, we are
concerned with the problem of learning LTMs for density estimation, i.e., to find the
1In LTM, each node has at most one parent node. We thus denote the parent of a node Z as pa(Z)rather than pa(Z).
11
LTM that is as close to the generative distribution P (X) as possible.
A commonly used quantity for measuring discrepancy between two distributions is
KL divergence (Cover & Thomas, 1991). The KL divergence of an LTM m from the
generative distribution P (X) is defined as follows:
D(P‖Pm) =∑
X
P (X) logP (X)
Pm(X). (2.1)
The smaller the KL divergence, the closer Pm(X) to P (X), and the better the LTM m.
Conceptually, our goal is to find the LTM m⋆ that minimize the KL divergence.
How do we find m⋆? One way is to view D(P‖Pm) as a scoring function on model m
and hill-climbs in the space of LTMs to maximize the score. The problem with this method
is that the score D(P‖Pm) cannot be computed because the generative distribution P (X)
is unknown. Fortunately, we do have a data set D that was sampled from P (X). We can
obtain an approximation of D(P‖Pm) based on D. As a matter of fact, D(P‖Pm) can
be approximated using the AIC score (Akaike, 1974) when the sample size is large. The
AIC score of an LTM m is defined as follows:
AIC(m|D) = −2[maxθm
log P (D|m, θm)− d(m)], (2.2)
where d(m) is dimension of model m, i.e., the number of independent parameters of the
model. In machine learning community, however, researchers usually use the negation of
Equation 2.2, i.e.,
AIC(m|D) = maxθm
log P (D|m, θm)− d(m).
The first term is known as the maximized log-likelihood of m. It measures how well model
m fits the data D. The second term is a penalty term for model complexity. The density
estimation problem is thus transformed into the problem of finding the LTM with the
highest AIC score.
There exists other approximations to the KL divergence. For example, leave-one-out
cross-validation (LOOCV) is asymptotically equivalent to the AIC score (Shao, 1997).
However, to use LOOCV is computationally much more demanding than to use the AIC
score. Calculating LOOCV for an LTM needs to optimize the model parameters |D| times,
where |D| denotes the sample size of D. In contrast, calculating AIC score only requires
one pass of parameter optimization. Researchers have also proposed variants to the AIC
score, e.g., the AICc score with a correction for handling small sample sizes. However,
the original AIC score is still the most widely used approximation.
Previous learning algorithms for LTMs all use the BIC score (Schwarz, 1978) for model
selection (Zhang, 2004; Zhang & Kocka, 2004; Chen, 2008). In contrast to AIC, BIC is
12
based on a different philosophy. The underlying assumption is that the model space
contains the generative model. Given a training data, the objective is thus to find the
most probable model from the space. The probability of a model is know as the marginal
likelihood. The BIC score is a large sample approximation to the marginal likelihood.
Clearly, the assumption does not hold in our settings. Therefore, the BIC is not suitable
for density estimation problem.
2.3.2 Model Inclusion and Equivalence
Consider two LTMs m and m′ that share the same set of manifest variables X. We say
that m includes m′ if for any parameter value θm′ of m′, there exists a parameter value
θm of m such that
P (X|m, θm) = P (X|m′, θm′).
When this is the case, m can represent any distributions over the manifest variables that
m′ can. As such, the maximized log-likelihood of m is larger than or equal to that of m′:
maxθm
log P (D|m, θm) ≥ maxθm′
log P (D|m′, θ′m).
If m includes m′ and vice versa, we say that m and m′ are marginally equivalent.
Marginally equivalent models are equivalent if they have the same number of independent
parameters. It is impossible to distinguish between equivalent models based on data if
the AIC score, or any other penalized likelihood score (Green, 1999), is used for model
selection.
2.3.3 Root Walking and Unrooted LTMs
Let Y1 be the root of a latent tree model m. Suppose Y2 is a child of Y1 and it is also
a latent node. Define another latent tree model m′ by reversing the arrow Y1 → Y2.
Variable Y2 becomes the root in the new model. The operation is called root walking: The
root has walked from Y1 to Y2. The model m′ in Figure 2.2b is the model obtained by
walking the root from Y1 to Y2 in model m.
It has been shown that root walking leads to equivalent models (Zhang, 2004). There-
fore, the root and edge orientations of an LTM cannot be determined from data. We can
only learn unrooted LTMs, which are LTMs with all directions on the edges dropped. An
example of an unrooted LTM is given in Figure 2.2c.
An unrooted LTM represents an equivalent class of LTMs. Members of the class are
obtained by rooting the model at various nodes. Semantically it is a Markov random field
over an undirected tree. The leaf nodes are observed while the interior nodes are latent.
13
Model inclusion and equivalence can be defined for unrooted LTMs in the same way as
for rooted models. In the rest of this thesis, LTMs always mean unrooted LTMs unless it
is explicitly stated otherwise.
2.3.4 Regular LTMs
For a latent variable Y in an LTM, enumerate its neighbors as Z1, Z2, . . . , Zk. An LTM
is regular if for any latent variable Y ,
|Y | ≤∏k
i=1 |Zi|maxk
i=1 |Zi|, (2.3)
and when Y has only two neighbors, strict inequality holds and one of the neighbors must
be a latent node.
For any irregular model m, there always exists a regular model m′ that is marginally
equivalent to m and has fewer independent parameters (Zhang, 2004). The model m′ can
be obtained from m through the following regularization process:
1. For each latent variable Y in m,
(a) If it violates inequality (2.3), reduce the cardinality of Y toQk
i=1|Zi|
maxki=1
|Zi|.
(b) If it has only two neighbors with one being a latent node and it violates the
strict version of inequality (2.3), remove Y from m and connect the two neigh-
bors of Y .
2. Repeat Step 1 until no further changes.
The regular model m′ has a higher AIC score than m itself. Therefore, we can restrict
our attention to the space of regular models when searching for the LTM with the highest
AIC score. For a given set of manifest variables, there are only finitely many regular
LTMs (Zhang, 2004).
14
CHAPTER 3
ALGORITHM 1: EAST
In this chapter, we present a search-based algorithm, called EAST, for learning LTMs.
We start with the operators and the search procedure (Section 3.1). Then, we discuss two
issues that are critical to the performance of EAST, namely efficient model evaluation
(Section 3.2) and operation granularity (Section 3.3).
3.1 Search Operators and Search Procedure
EAST hill-climbs in the space of regular LTMs under the guidance of the AIC score. It
uses 5 search operators borrowed from (Zhang & Kocka, 2004) with minor adaptations.
And it adopts a search strategy known as grow-restructure-thin, which originated from
the literature on learning Bayesian networks without latent variables (e.g., Chickering,
2002).
3.1.1 Search Operators
The search operators are: state introduction (SI), node introduction (NI), node relocation
(NR), state deletion (SD), and node deletion (ND). We describe them one by one in the
following.
Given an LTM and a latent variable in the model, the state introduction (SI) operator
creates a new model by adding a state to the domain of the variable. The state deletion
(SD) operator does the opposite. Applying SI on a model m results in another model
that includes m. Applying SD on a model m results in another model that is included by
m.
Node introduction (NI) involves one latent node Y and two of its neighbors. It creates
a new model by introducing a new latent node Z to mediate between Y and the two
neighbors. The cardinality of Z is set to be the same as that of Y . In the model m1 of
Figure 3.1, introducing a new latent node Y3 to mediate Y1 and its neighbors X1 and X2
results in m2. Applying NI on a model m results another model that includes m. For the
sake of computational efficiency, we do not consider introducing a new node to mediate
Y and more than two of its neighbors. This restriction will be compensated in search
control.
15
(a) m1 (b) m2 (c) m3
Figure 3.1: The NI and NR operators. The model m2 is obtained from m1 by introducinga new latent node Y3 to mediate between Y1 and two of its neighbors X1 and X2. Thecardinality of Y3 is set to be the same as that of Y1. The model m3 is obtained from m2by relocating X3 from Y1 to Y3.
Node deletion (ND) is the opposite of NI. It involves two neighboring latent nodes Y
and Z. It creates a new model by deleting Z and making all neighbors of Z other than
Y neighbors of Y . We refer to Y as the anchor variable of the deletion and say that Z
is deleted with respect to Y . In the model m2 of Figure 3.1, deleting Y3 with respect to
Y1 leads us back to the model m1. Applying ND on a model m results in another model
that is included by m if the node deleted has more or the same number of states as the
anchor node.
Node relocation (NR) involves a node W , one of its latent neighbors Y , and another
latent node Z. It creates a new model by relocating W to Z, i.e., removing the link
between W and Y and adding a link between W and Z. In m2 of Figure 3.1, relocating
X3 from Y1 to Y3 results in m3. Unlike in (Zhang & Kocka, 2004), we now do not require
the two latent nodes Y and Z to be neighbors.
There are some boundary conditions on the search operators. The SD operator cannot
be applied to latent variables with only two possible states. The NI and NR operators
cannot be applied if they make some latent nodes leaves. To ensure regularity, a regular-
ization step is applied to every candidate model right after its creation.
3.1.2 Brute-Force Search
Let m be an LTM. In the following we use NI(m), SI(m), NR(m), ND(m), and SD(m)
to respectively denote the sets of candidate models that one can obtain by applying the
five search operators on m. The models are sometimes referred to as NI, SI, NR, ND, SD
candidate models, respectively. The union of the five sets is denoted by ALL(m).
Suppose we are given a data set D and an initial model m. Algorithm 3.1 gives a
brute-force search algorithm for learning an LTM.
16
Algorithm 3.1 BruteForce(m,D)
1: while true do2: m1 ← arg maxm′∈ALL(m) AIC(m′|D)3: if AIC(m1|D) ≤ AIC(m|D) then4: return m5: else6: m← m1
7: end if8: end while
Brute-force search is inefficient for two reasons. First, it evaluates a large number
of candidate models at each step. Let n, l, and r be the number of manifest nodes,
the number of latent nodes, and the maximum number of neighbors that any latent
node has in the current model, respectively. The numbers of candidate models that the
five operators SI, SD, NI, ND and NR generate are O(l), O(l), O(lr(r − 1)/2), O(lr)
and O(l(l + n)) respectively. So the brute-force algorithm evaluates a total number of
O(l(2+ r/2+ r2/2+ l+n)) candidate models at each step. Most of the candidate models
are generated by the NI and NR operators.
Second, one needs to compute the maximized log-likelihood of each candidate model
m′ in order to calculate its AIC score. This requires the expectation-maximization (EM)
algorithm due to the presence of latent variables. EM is known to be time-consuming.
We will next describe a search procedure that generates fewer candidate models than
brute-force search. In Section 3.2, we will present an efficient way to evaluate candidate
models.
3.1.3 EAST Search
The five operators can be classified into three groups. The NI and SI operators produce
candidate models that include the current model. They are hence expansion operators.
The ND and SD operators produce candidate models that are included by the current
model. They are hence simplification operators. NR does not alter nodes in the current
model. It only changes the connections between the nodes. Hence we call it an adjustment
operator.
The EAST algorithm is given in Algorithm 3.2. At each step of search, EAST uses
only a subset of the search operators instead all of them. More specifically, it divides
search into three stages: expansion, adjustment and simplification. At each stage, it uses
only the operators from the corresponding group. For example, it searches only with the
expansion operators at the expansion stage. If the model score is improved in any of the
three stages, the algorithm continues search by repeating the loop. This is why it is called
17
Algorithm 3.2 EAST(m,D)
1: while true do2: m1 ← Expand(m,D)3: m2 ← Adjust(m1,D)4: m3 ← Simplify(m2,D)5: if AIC(m3|D) ≤ AIC(m|D) then6: return m7: else8: m← m3
9: end if10: end while
Algorithm 3.3 Expand(m,D)
1: while true do2: m1 ← PickModel-IR(NI(m) ∪ SI(m), m)3: if AIC(m1|D) ≤ AIC(m|D) then4: return m5: else if m1 ∈ NI(m) then6: m← EnhanceNI(m1, m,D)7: else8: m← m1
9: end if10: end while
‘EAST’ — Expansion, Adjustment, Simplification until Termination.
At the expansion stage, EAST searches with the expansion operators until the AIC
score ceases to increase. See Algorithm 3.3. To understand the intuition, recall that the
AIC score consists of a term that measures model fit and another term that penalizes for
model complexity. If we start with a model that fits data poorly, which is usually the
case, then improving model fit is the first priority. Model fit can be improved by searching
with the expansion operators. This is exactly what EAST does at the expansion stage.
The pseudo code for the expansion stage contains two subroutines. The subroutine
PickModel-IR(NI(m) ∪ SI(m), m) selects one model from all the candidate models gen-
erated from m by the NI and SI operators. It will be discussed in details in the next two
sections.
The second subroutine EnhanceNI is called after each application of the NI operator.
This is to compensate for the constraint imposed on NI. Consider the model m1 in Figure
3.1. We can introduce a new latent node Y3 to mediate Y1 and two of its neighbors, say
X1 and X2, and thereby obtain the model m2. However, we are not allowed to introduce
a latent node to mediate Y1 and more than two of its neighbors, say X1, X2, and X3, and
thereby obtain m3. As a remedy we consider, after each application of the NI operator,
enhancements to the operation. As an example, suppose we have just applied NI to m1
and have obtained m2. What we do next is to consider relocating the other neighbors of
18
Algorithm 3.4 EnhanceNI(m′, m,D)
1: while L 6= ∅ do2: m′
W1→Z ← PickModel({m′W→Z|W ∈ L}, m)
3: if AIC(m′W1→Z |D) ≤ AIC(m′|D) then
4: return m′
5: else6: m′ ← m′
W1→Z
7: L← L \ {W1}8: end if9: end while
Algorithm 3.5 Adjust(m,D)
1: while true do2: m1 ← PickModel(NR(m), m)3: if AIC(m1|D) ≤ AIC(m|D) then4: return m5: else6: m← m1
7: end if8: end while
Y1 in m1, i.e., X3, X4, X5 and Y2, to the new latent variable Y3. If it turns out to be
beneficial to relocate X3 but not the other three nodes, then we obtain the model m3.
In general, suppose we have just introduced a new node Z into the current model m to
mediate a latent node Y and two of its neighbors, and obtained a candidate model m′. Let
L be the list of all the other neighbors of Y in m. For any W ∈ L, use m′W→Z to denote
the model obtained from m′ by relocating W to Z. What we do next is to enhance the NI
operation using the subroutine described in Algorithm 3.4. The subroutine PickModel
selects and returns one model from a list of candidate models. It will be given in the next
section.
After model expansion ceases to increase the AIC score, EAST enters the adjustment
stage (Algorithm 3.5). At this stage, EAST repeatedly relocates nodes in the current
model until it is no longer beneficial to do so, and there is no restriction on how far away
a node can be relocated. Node relocation is necessary because multiple latent nodes are
usually introduced during model expansion and two nodes that should be together might
end up at different parts of the model at the end of the expansion process.
The adjustment stage is followed by the simplification stage (Algorithm 3.6). At this
stage EAST first repeatedly applies ND to the current model until the AIC score ceases
to increase and then it does the same with SD. We choose not to consider ND and SD
simultaneously because that would be computationally more expensive and it is not clear
whether that would be helpful in avoiding local maxima.
19
Algorithm 3.6 Simplify(m,D)
1: while true do2: m1 ← PickModel(ND(m), m)3: if AIC(m1|D) ≤ AIC(m|D) then4: break5: else6: m← m1
7: end if8: end while9: while true do
10: m1 ← PickModel(SD(m), m)11: if AIC(m1|D) ≤ AIC(m|D) then12: return m13: else14: m← m1
15: end if16: end while
At each step in the expansion stage, EAST generates O(l + lr(r − 1)/2) candidate
models. At each step in the adjustment stage, EAST generates O(l(l + n)) candidate
models. The simplification stage consists of two sub-stages. At the first sub-stage, EAST
searches with the ND operator and generates O(lr) candidate models at each step. At
the second sub-stage, EAST searches with the SD operator and generates O(l) candidate
models at each step. So EAST generates fewer candidate models than the brute-force
algorithm at each step of search.
3.2 Efficient Model Evaluation
The PickModel subroutine is supposed to find, from a list of candidate models, the model
with the highest AIC score. A straightforward way to do so is to calculate the AIC
score of each candidate model and then pick the best one. Calculating the AIC scores
of a large number of models exactly is computationally prohibitive. So, we propose to
use approximations of the AIC score for model selection. In this section, we present
one approximation of the AIC score that is easy to compute. The idea is to replace the
likelihood term with what we call restricted likelihood. We begin by discussing parameter
sharing between a candidate model and the current model.
3.2.1 Parameter Sharing among Models
Conceptually we work with unrooted LT models. In implementation, however, we repre-
sent unrooted models as rooted models. Rooted LTMs are Bayesian networks and their
20
(a) m′ (rooted) (b) m′ (unrooted)
Figure 3.2: A candidate model obtained by modifying the model in Figure 2.2. The twomodels share the parameters for describing the distributions P (Y1), P (Y2|Y1), P (X1|Y2),P (X2|Y2), P (X3|Y2), P (X4|Y1), P (Y3|Y1), and P (X5|Y3). On the other hand, the pa-rameters for describing P (Y4|Y3), P (X6|Y4), and P (X7|Y4) are peculiar to the candidatemodel.
parameters are defined without ambiguity. This makes it easy to see how the parameter
composition of a candidate model is related to that of the current model.
Consider the model m in Figure 2.2. Let m′ be the model obtained from m by
introducing a new latent node Y4 to mediate Y3 and two of its neighbors X6 and X7, as
shown in Figure 3.2. If both m and m′ are represented as rooted models, their parameter
compositions are clear. The two models share parameters for describing the distributions
P (Y1), P (Y2|Y1), P (X1|Y2), P (X2|Y2), P (X3|Y2), P (X4|Y1), P (Y3|Y1), and P (X5|Y3). On
the other hand, the parameters for describing P (Y4|Y3), P (X6|Y4), and P (X7|Y4) are
peculiar to m′ while those for describing P (X6|Y3) and P (X7|Y3) are peculiar to m.
We write the parameters of a candidate model m′ as a pair (θ′1, θ
′2), where θ′
1 is the
collection of parameters that m′ shares with the current model m. The other parameters
θ′2 are peculiar to m′ and are called new parameters of m′. Similarly we write the param-
eters of the current model m as a pair (θ1, θ2), where θ1 is the collection of parameters
that m shares with m′.
One unrooted LTM can be represented by multiple rooted LTMs. In the aforemen-
tioned example, if the representation of m′ is rooted at Y3 instead of Y1, then we would
have P (Y3) and P (Y1|Y3) instead of P (Y1) and P (Y3|Y1). The parameters describing P (Y3)
and P (Y1|Y3) would be peculiar to m′. However, this is due to the choice of representation
rather than search operation. Hence those parameters are not genuinely new parameters.
In implementation, one needs to coordinate the representations of the current model and
the candidate models so as to avoid such fake new parameters.
3.2.2 Restricted Likelihood
Suppose we have computed the MLE (θ⋆1, θ
⋆2) of the parameters of the current model m.
For a given value of θ′2, (m′, θ⋆
1, θ′2) is a fully specified Bayesian network. In this network,
21
we can compute
P (D|m′, θ⋆1, θ
′2) =
∏
d∈D
P (d|m′, θ⋆1, θ
′2).
As a function of θ′2, this is referred to as the restricted likelihood function of m′. The
maximum restricted log-likelihood, or simply the maximum RL, of the candidate model m′
is defined to be
maxθ′
2
log P (D|m′, θ⋆1, θ
′2).
Replacing the likelihood term in the AIC score of m′ with its maximum RL, we get the
following approximate score:
AICRL(m′|D) = maxθ′
2
log P (D|m′, θ⋆1, θ
′2)− d(m′). (3.1)
We propose that PickModel uses the AICRL score for model selection instead of the
AIC score. It should be noted that the idea of optimizing only some parameters of a
model while freezing the others is used in, among others, phylogenetic tree reconstruction
(Guindon & Gascuel, 2003) and learning of continuous Bayesian networks (Nachmana
et al., 2004).
Next we describe an efficient method for approximately calculating the AICRL score.
The method is called local EM.
3.2.3 Local EM
Local EM works in the same way as EM except with the value of θ′1 fixed at θ⋆
1. It starts
with an initial value δ(0)2 for θ′
2 and iterates. After t − 1 iterations, it obtains δ(t−1)2 . At
iteration t, it completes the data D using the Bayesian network (m′, θ⋆1, δ
(t−1)2 ), calculates
some sufficient statistics, and therefrom obtains δ(t)2 . Suppose the parameters θ′
2 of m′
describe distributions P (Zj|Wj) (j = 1, . . . , ρ).1 The distributions P (Zj|Wj , δ(t)2 ) that
make up δ(t)2 can be obtained in two steps:
• E-Step: For each data case d ∈ D, make inference in the Bayesian network (m′, θ⋆1, δ
(t−1)2 )
to compute
P (Zj, Wj|d, m′, θ⋆1, δ
(t−1)2 ) (j = 1, . . . , ρ).
1When Zj is the root, Wj is to be regarded as a vacuous variable and P (Zj|Wj) is simply P (Zj).
22
• M-Step: Obtain
P (Zj|Wj, δ(t)2 ) = f(Zj, Wj)/
∑
Zj
f(Zj, Wj) (j = 1, . . . , ρ)
where the sufficient statistic f(Zj,Wj) =∑
d∈D P (Zj ,Wj|d,m′,θ⋆1, δ
(t−1)2 ).
Local EM converges. That is, the series of log-likelihood {log P (D|m′, θ⋆1, δ
(t)2 )|t =
0, 1, . . .} increases monotonically with t and it is upper-bounded by 0.
Unlike local EM, standard EM optimizes all parameters. To avoid potential confusions,
we call it full EM. The M-step of a local EM is computationally much cheaper than that of
a full EM because a local EM updates fewer parameters. For the candidate model shown
in Figure 3.2, we need to update only the parameters that describe P (Y4|Y3), P (X6|Y4)
and P (X7|Y4). Besides reduction in computation, this fact also implies that a local EM
takes fewer steps to converge than a full EM.
3.2.4 Avoiding Local Maxima
Like full EM, local EM might get stuck at local maxima. To avoid the local maxima,
we adopt the scheme proposed by Chickering and Heckerman (1997a) and call it the
pyramid scheme. The idea is to randomly generate a number µ of initial values for the
new parameters θ′2, resulting in µ initial models. One local EM iteration is run on all the
models and afterwards the bottom µ/2 models with the lowest log-likelihood are discarded.
Then two local EM iterations are run on the remaining models and afterwards the bottom
µ/4 models are discarded. Then four local EM iterations are run on the remaining models,
and so on. The process continues until there is only one model. After that, some more
local EM iterations are run on the remaining model, until the total number of iterations
reaches a predetermined number ν. Hence, there are two algorithmic parameters µ and
ν.
Suppose m′ is a candidate model obtained from the current model m. Use
LocalEM(m, m′, µ, ν) to denote the procedure described in the previous paragraph. The
output is an estimate of the new parameter θ′2. Denote it by θ2. The PickModel subrou-
tine evaluates m′ using the following quantity:
AIC(m′, θ⋆1, θ2|D) = log P (D|m′, θ⋆
1, θ2)− d(m′). (3.2)
Note that the AIC score given here is for a model m′ and a set of parameter values θ⋆1
and θ2 for the model. In contrast, the AIC score given by Equation (2.2) is for a model
only.
23
3.2.5 Two-Stage Model Evaluation
Local EM is faster than full EM. To achieve further speedup, we propose to divide model
evaluation into two stages, a screening stage and an evaluation stage. In the screening
stage, we screen out most of the candidate models by running local EM at a low setting,
while in the evaluation stage we evaluate the remaining models by running local EM at a
high setting. We call this approach two-stage model evaluation.
In local EM, the parameter µ controls the number of initial points and the parameter
ν controls the number of iterations. For the screening stage, we fix the first parameter at 1
and we allow only the second parameter to vary. To distinguish it from the corresponding
parameter at the evaluation stage, we denote it by νs.
Because local EM starting from only one initial point at the screening stage, there
is no effort to avoid local maxima at all. We argue that this does not cause serious
problems because there is an implicit local-maximum-avoidance mechanism built in. A
particular application of a search operator is called a search operation. It corresponds to
one candidate model. So, evaluation of candidate models can also be viewed as evaluation
of search operations. Suppose local EM picks a poor initial point at one step when
evaluating an operation and consequently the operation is screened out. Chances are
that the same operation is also applicable at the next few steps. In that case local EM
would be called to evaluate the operation again and again, each time from a different
starting point. So in the end local EM is run from multiple starting points to evaluate
the operation. If the operation is a good one, there is high probability for it to be picked
at one of those steps.
3.2.6 The PickModel Subroutine
Finally, the pseudo code for PickModel is given in Algorithm 3.7. It has four algorithmic
parameters. The parameters νs and k control the screening stage, while µ and ν control
the evaluation stage.
PickModel involves only local EM. In EAST (Algorithm 3.2), full EM is run on the
model m1 returned by PickModel in order to compute its AIC score. This also facilitates
the next call to PickModel. In the next step, m1 will be the current model. So, PickModel
will need the MLE of the parameters of m1 at the next step. To kick start the whole
search process, full EM needs to be run on the initial model. So, EAST runs full EM once
at each step of search.
24
Algorithm 3.7 PickModel(L, m)
1: for each m′ ∈ L do2: Run LocalEM(m, m′, 1, νs) to estimate the parameters of m′
3: end for4: L′ ← Top k models from L with the highest AIC scores as given by Equation 3.25: for each m′ ∈ L′ do6: Run LocalEM(m, m′, µ, ν) to estimate the parameters of m′
7: end for8: return the model in L′ with the highest AIC score as given by Equation 3.2
3.3 Operation Granularity
At the expansion stage, EAST does not select models using the subroutine PickModel.
Rather it uses another subroutine called PickModel-IR. This is to deal with the issue of
operation granularity.
Operation granularity refers to the phenomenon where some operations might increase
the complexity of the current model much more than other operations. As an example,
consider the situation where there are 100 binary manifest variables. Suppose the search
starts with the LC model with one binary latent node Y . Applying the SI operator to the
model would introduce 101 additional model parameters, while applying the NI operator
to the model would increase the number of model parameters by only 2. The latter
operation is clearly of much finer-grain than the former.
It has been observed that operation granularity often leads to local maxima when BIC
score is used to guide the search (Zhang & Kocka, 2004). The reason is that, at the early
stage of search, SI operations are usually of larger grain than NI operations and often
have higher BIC scores. So, SI operations tend to be applied early, which sometimes leads
to fat latent variables, i.e., latent variables that have excessive numbers of states. Fat
latent variables tend to attract excessive numbers of neighbors. This makes it difficult to
thin fat variables despite of the SD operator. Local maxima are consequently produced
(Chen, 2008). Though the phenomenon is observed when BIC score is used to guide the
search, we believe that it will also happen in the case of AIC score.
One might suggest that we deal with fat latent variables by introducing an additional
search operator that simultaneously reduces the number of states and the number of
neighbors of a latent variable. However, this would complicate algorithm design and
would increase the complexity of the search process. We adopt a simple and effective
strategy called the cost-effectiveness principle (Zhang & Kocka, 2004).
Let m be the current model and m′ be a candidate model. Define the improvement
25
Algorithm 3.8 PickModel-IR(L, m)
1: for each m′ ∈ L do2: Run LocalEM(m, m′, 1, νs) to estimate the parameters of m′
3: end for4: L′ ← Top k models from L with the highest IR scores as given by Equation 3.45: for each m′ ∈ L′ do6: Run LocalEM(m, m′, µ, ν) to estimate the parameters of m′
7: end for8: return the model in L′ with the highest IR score as given by Equation 3.4
ratio of m′ over m given data D to be
IR(m′, m|D) =AIC(m′|D)− AIC(m|D)
d(m′)− d(m). (3.3)
It is the increase in model score per unit increase in model complexity. The cost-
effectiveness principle stipulates that one chooses, among a list of candidate models, the
one with the highest improvement ratio.
The principle is applied only on candidate models generated by the SI and NI opera-
tors. The other operators do not or do not necessarily increase model complexity. Hence
the term d(m′)− d(m) is or might be negative.
Like PickModel, PickModel-IR does not run full EM to optimize the parameters of the
candidate models. Instead, it inherits the values of the old parameters from the current
model and runs local EM to optimize only the new parameters. Let m be the current
model and m′ be a candidate model obtained from m. Suppose the MLE (θ⋆1, θ
⋆2) of the
parameters m have been computed. Let θ2 be the estimate of the new parameters of m′
obtained by local EM. The subroutine PickModel-IR evaluates the candidate model m′
using the following IR score:
IR(m′, m, θ⋆1, θ
⋆2, θ2|D) =
AIC(m′, θ⋆1, θ2|D)−AIC(m, θ⋆
1, θ⋆2|D)
d(m′)− d(m). (3.4)
The pseudo code of PickModel-IR is given in Algorithm 3.8. It is the same as
PickModel except IR scores, rather than AIC scores, are used to evaluate candidate
models.
3.4 Summary
EAST is the state-of-the-art algorithm for learning LTMs. It systematically searches for
LTMs with the highest AIC score in a principled manner. It adopts the grow-restructure-
thin strategy and structures the search into three stages. It has an efficient method for
26
model evaluation that is based on an approximation to AIC score. EAST can induce high
quality models from data and can discover underlying latent structures. We will provide
empirical evidence for those points in Chapter 6.
27
CHAPTER 4
ALGORITHM 2: HIERARCHICALCLUSTERING LEARNING
EAST is a principled search-based learning algorithm. It can induce high quality LTMs,
and hence yield good solutions to density estimation problems. Moreover, it is capable
of revealing interesting latent structures. All those merits come at a cost, namely high
computational complexity. This drawback limits the applicability of EAST to large-scale
problems.
In this chapter, we develop a heuristic learning algorithm that is much more efficient
than EAST. The new algorithm is called HCL, a shorthand for Hierarchical Clustering
Learning of LTMs. It targets at density estimation problems with large training samples,
and assumes that there is a predetermined constraint on the inferential complexity of
the resulting LTM. By inferential complexity, we mean the computational cost of making
inference in the model. In Chapter 7, the reader will see an application that suits HCL.
4.1 Heuristic Construction of Model Structure
We first present the heuristic that HCL uses to determine the model structure.
4.1.1 Basic Ideas
We start with a definition.
Definition 4.1 In an LTM, two manifest variables are called siblings if they share the
same parent.
Our heuristic is based on three ideas:
1. In an LTMM, siblings are generally more closely correlated than variables that are
located far apart.
2. If M is a good estimation of the generative distribution P (X), then two variables
X and X ′ are closely correlated in M if and only if they are closely correlated in
P (X).
28
X1 X2 X3 X4 X5
X1 - - - - -X2 0.0000 - - - -X3 0.0003 0.0971 - - -X4 0.0015 0.0654 0.0196 - -X5 0.0017 0.0311 0.0086 0.1264 -X6 0.0102 0.0252 0.0080 0.1817 0.0486
Table 4.1: The empirical MI between the manifest variables.
3. Given a large data set D drawn from the generative distribution P (X), if two vari-
ables X and X ′ are closely correlated in P (X), then they should also reflect strong
interactions in D.
Therefore, we can examine each pair of variables in data D, pick the two variables that
are most closely correlated, and introduce a latent variable as their parent inM.
Denoted by P (X) the empirical distribution over X induced by the data D. We
measure the strength of correlation between a pair of variables X and X ′ by the empirical
mutual information (MI) (Cover & Thomas, 1991)
I(X; X ′) =∑
X,X′
P (X, X ′) logP (X, X ′)
P (X)P (X ′).
Example 4.1 Consider a data set D containing 6 binary variables X1, X2, . . ., X6.
Suppose the empirical MI based on D is as presented in Table 4.1. We find that X4 and
X6 are the pair with the largest MI. Therefore, we create a latent variable Y1 and make it
the parent of X4 and X6. See the lower-left corner of the model m in Figure 4.1a.
The next step is to find, among Y1, X1, X2, X3, and X5, the pair of variables with the
largest MI. The next subsection explains how.
4.1.2 MI Between A Latent Variable and Another Variable
To pick the best pair among Y1, X1, X2, X3, and X5, we need to calculate the MI between
all possible pairs. There is one difficulty: Y1 is not in the generative distribution and thus
not observed in the data set. Consequently, the MI between Y1 and the other variables
cannot be computed directly. We hence seek an approximation.
In the final model M, Y1 would d-separate X4 and X6 from the other variables.
Therefore, for any X ∈ {X1, X2, X3, X5}, we have
IM(Y1; X) ≥ IM(X4; X), IM(Y1; X) ≥ IM(X6; X).
29
(a) Latent tree model m (b) Regularized model m′
(c) Simplified model m′′
Figure 4.1: An illustrative example. The numbers within the parentheses denote thecardinalities of the variables.
We hence approximate IM(Y1; X) using the lower bound
max{IM(X4; X), IM(X6; X)}.
In general, suppose we need to calculate the MI between a latent variable Y and
another variable Z inM. Denote the two children of Y by W and W ′. We estimate the
MI as follows:
IM(Y ; Z) ≈ max{IM(W ; Z), IM(W ′; Z)}.
Example 4.2 Back to our running example, the estimated MI between Y1 and X1, X2,
X3, X5 is as presented in Table 4.2a. We see that the next pair to pick is Y1 and X5. We
introduce a latent variable Y2 as the parent of Y1 and X5. The process continues.
The next step is the find the best pair among Y2, X1, X2, and X3. The estimated MI
between Y2 and X1, X2, X3 is as shown in Table 4.2b. It is clear that the pair X2 and X3
have the largest MI. We hence introduce a latent variable Y3 as the parent of X2 and X3.
The process moves on.
Next, we pick the best pair among Y3, X1, and Y2. The estimated MI for Y3 is given
30
X1 X2 X3 X5
Y1 0.0102 0.0654 0.0196 0.1264
(a) Y1
X1 X2 X3
Y2 0.0102 0.0654 0.0196
(b) Y2
X1 Y2
Y3 0.0003 0.0654
(c) Y3
Table 4.2: The estimated MI between each latent variable and other variables.
in Table 4.2c. The pair Y2 and Y3 has the largest MI. We thus add a latent variable Y4
and make it the parent of Y2 and Y3.
Lastly, we introduce a latent variable Y5 for the only two remaining variables Y4 and
X1. The final model structure is a binary tree as shown in Figure 4.1a.
4.2 Cardinalities of Latent Variables
After obtaining a model structure, the next step is to determine the cardinalities of the
latent variables. We set the cardinalities of all the latent variables at a certain value C.
In the following we show that, under the assumption of large training samples, one can
achieve better estimation by choosing larger C. We then derive the maximum possible
value of C subject to the presumed constraint on the inferential complexity of the resultant
model.
4.2.1 Larger C for Better Estimation
We first discuss the impact of the value of C on the estimation quality. We start by
considering the case when C equals to∏
X∈X|X|, i.e., the product of the cardinalities
of all the manifest variables. In this case, each latent variable can be viewed as a joint
variable of all the manifest variables. Intuitively, for an arbitrary distribution P (X), we
can set the parameters θm so that the obtained LTM represents P (X) exactly. That is, m
can capture all the interactions among the manifest variables. This intuition is justified
by the following proposition.
Proposition 4.1 Let m be an LTM over a set X of manifest variables. If the cardinalities
of the latent variables in m are all equal to∏
X∈X|X|, then for any joint distribution
31
P (X), there exists a parameter value θm of m such that
P (X|m, θm) = P (X). (4.1)
Proof: We enumerate the variables in X as X1, X2, . . . , Xn. Let Y be a latent variable
in m. Note that the cardinality of Y equals to∏
X∈X|X|. Hence, there are two ways
to represent a state of Y . The first is to index a state using a natural number i ∈{1, 2, . . . , ∏X∈X
|X|}. The second is to write a state as a vector < x1, x2, . . . , xn >, where
xi denotes a state of Xi, ∀i = 1, 2, . . . , n. We will use both representations.
Given a joint distribution P (X), we define a parameter value θm of m as follows:
• The prior distribution for root node W of m:
P (W =< x1, x2, . . . , xn > |m, θm) = P (X1 = x1, X2 = x2, . . . , Xn = xn),
• The conditional distribution for each non-root latent node Y :
P (Y = i|pa(Y ) = j, m, θm) =
{
1 i = j0 otherwise
• The conditional distribution for manifest node Xi, ∀i = 1, 2, . . . , n:
P (Xi = x′i|pa(Xi) =< x1, x2, . . . , xi, . . . , xn >, m, θm) =
{
1 xi = x′i
0 otherwise
It is easy to verify that Equation 4.1 holds for the parameter value θm defined as
above.
Q.E.D.
What happens if we decrease C? It can be shown that the estimation quality will
degrade. Let m be a model obtained with value C and m′ be another model obtained
with a smaller value C ′. It is easy to see that m includes m′. The following lemma states
that the estimation quality of m′ is no better than that of m.
Lemma 4.1 Let P (X) be the generative distribution underlying the training data. Let m
and m′ be two models with manifest variables X. If m includes m′, then
minθm
D[P (X)‖P (X|m, θm)] ≤ minθm′
D[P (X)‖P (X|m′, θm′)].
Proof: Define
θ⋆m′ = arg min
θm′
D[P (X)‖P (X|m′, θm′)].
32
Because m includes m′, there must be parameters θ⋆m of m such that
P (X|m, θ⋆m) = P (X|m′, θ⋆
m′).
Therefore,
minθm
D[P (X)‖P (X|m, θm)] ≤ D[P (X)‖P (X|m, θ⋆m)]
= D[P (X)‖P (X|m′, θ⋆m′)]
= minθm′
D[P (X)‖P (X|m′, θm′)]
Q.E.D.
As mentioned earlier, when C is large enough, model m can capture all the interactions
among the manifest variables and hence can represent the generative distribution P (X)
exactly. If C is not large enough, we can only represent P (X) approximately. According
to the previous discussion, as C decreases, the estimation accuracy (in terms of KL
divergence) will gradually degrade, indicating that model m can capture less and less
interactions among the manifest variables. The worst case occurs when C = 1. In this
case, all the interactions are lost. The estimation is the poorest.
4.2.2 Maximum Value of C Under Complexity Constraint
HCL assumes that there is a predetermined bound on the inferential complexity of the
resultant model. So we need to define the inferential complexity of a model first. We
use the clique tree propagation (CTP) algorithm for inference. Therefore, we define the
inferential complexity to be the sum of the clique sizes in the clique tree of m.
Let m be a model obtained by using the technique described in Section 4.1 and setting
the cardinalities of latent variables at C. The following proposition states the relationship
between the inferential complexity of m and the value of C.
Proposition 4.2 The inferential complexity of m is
(|X| − 2) · C2 +∑
X∈X
|X| · C. (4.2)
Note that |X| is the number of manifest variables, while |X| is the cardinality of a manifest
variable X.
Proof: m is a tree model. Therefore, the cliques in the clique tree of m one-to-one
correspond to the edges in m. Moreover, the clique corresponding to an edge Z—Z ′
33
consists of {Z, Z ′}, and its size equals to |Z| · |Z ′|. So, in order to calculate the inferential
complexity of m, we only need to enumerate the edges in m.
Recall that m has a binary tree structure over manifest variables X. Therefore, it
contains (2|X| − 2) edges. Among all the edges, |X| edges connect manifest variables to
latent variables, one for each manifest variable. The other (|X| − 2) edges are between
latent variables. Note that all the latent variables have cardinality C. Therefore, the
inferential complexity of m is
(|X| − 2) · C2 +∑
X∈X
|X| · C.
Q.E.D.
As a corollary of Proposition 4.2, the maximum achievable value of C subject to a
predetermined constraint on the inferential complexity is given as follows.
Corollary 4.1 Given a bound Imax on the inferential complexity, the maximum possible
value of C is
Cmax =⌊
√b2 + 4aImax − b
2a
⌋
, (4.3)
where a = (|X| − 2) and b =∑
X∈X|X|.
Proof: The solution can be easily obtained by solving the inequality
(|X| − 2) · C2 +∑
X∈X
|X| · C ≤ Imax
and applying the condition that C takes value from natural numbers.
Q.E.D.
In HCL, we set the cardinalities of the latent variables at Cmax.
4.3 Model Simplification
Suppose we have obtained a model m using the techniques described in Sections 4.1 and
4.2. In the following two subsections, we will show that it is sometimes possible to simplify
m without compromising the approximation quality.
4.3.1 Model Regularization
We first notice that m could be irregular. As an example, let us consider the model
in Figure 4.1a. Its structure was constructed in Example 4.2 and the cardinalities of
34
its latent variables were set at 8 to satisfy a certain inferential complexity constraint.
By checking the latent variables, we find that Y5 violates the regularity condition. It
has only two neighbors and |Y5| ≥ |X1|·|Y4|/ max{|X1|, |Y4|}. Y1 and Y3 also violate
the regularity condition because |Y1| > |X4|·|X6|·|Y2|/ max{|X4|, |X6|, |Y2|} and |Y3| >
|X2|·|X3|·|Y4|/ max{|X2|, |X3|, |Y4|}. The following proposition suggests that irregular
models should always be simplified until they become regular.
Proposition 4.3 If m is an irregular model, then there must exist a model m′ with lower
inferential complexity such that
AIC(m′|D) ≥ AIC(m|D). (4.4)
Proof: Let Y be a latent variable in m which violates the regularity condition. Denote
its neighbors by Z1, Z2, . . . , Zk. We obtain another model m′ from m as follows:
1. If Y has only two neighbors, then remove Y from m and connect Z1 with Z2.
2. Otherwise, replace Y with a new latent variable Y ′, where
|Y ′| =∏k
i=1|Zi|maxk
i=1 |Zi|.
As shown by Zhang (2004), m′ is marginally equivalent to m and has fewer independent
parameters. As a direct corollary, inequality 4.4 holds.
To show that the inferential complexity of m′ is lower than that of m, we compare the
clique trees of m and m′. Consider the aforementioned two cases:
1. Y has only two neighbors. In this case, cliques {Y, Z1} and {Y, Z2} in the clique
tree of m are replaced with {Z1, Z2} in the clique tree of m′. Assume |Z2| ≥ |Z1|.The difference in the sum of clique sizes is
sum(m)− sum(m′) = |Y ||Z1|+ |Y ||Z2| − |Z1||Z2|≥ |Z1||Z1|+ |Z1||Z2| − |Z1||Z2|= |Z1||Z1|> 0.
2. Y has more than two neighbors. In this case, for all i = 1, 2, . . . , k, clique {Y, Zi}in the clique tree of m is replaced with a smaller clique {Y ′, Zi} in the clique tree
of m′.
35
In both cases, the inferential complexity of m′ is lower than that of m.
Q.E.D.
To simplify an irregular model, we put it through the regularization process described
in Section 2.3.4. In particular, we handle the latent variables in the order by which they
are created in Section 4.1. In the following, we use the irregular model m in Figure 4.1a
to demonstrate the process.
Example 4.3 We begin with latent variable Y1. It has three neighbors and violates the reg-
ularity condition. So we decrease its cardinality to |X4|·|X6|·|Y2|/ max{|X4|, |X6|, |Y2|} =
4. Then we consider Y2. It satisfies the regularity condition and hence no changes are
made. The next latent variable to examine is Y3. It violates the regularity condition. So
we decrease its cardinality to |X2|·|X3|·|Y4|/ max{|X2|, |X3|, |Y4|} = 4. We do not change
Y4 because it does not cause irregularity. At last, we remove Y5, which has only two neigh-
bors and violates the regularity condition, and connect Y4 with X1. There is no latent
variable violating the regularity condition any more. So we end up with the regular model
m′ as shown in Figure 4.1b.
4.3.2 Redundant Variable Absorption
After regularization, there are sometimes still opportunities for further model simpli-
fication. To facilitate our discussion, we first introduce the notions of saturation and
subsumption.
Definition 4.2 Let Y be a latent variable in a regular model m. Enumerate its neighbors
as Z1, Z2, . . ., Zk. Then Y is saturated if
|Y | =∏k
i=1 |Zi|maxk
i=1 |Zi|.
Definition 4.3 If a latent variable is saturated, we say it subsumes all its neighbors
except the one with the largest cardinality.
Saturated latent variables are potentially redundant. Take the model m′ in Figure
4.1b as an example. It contains two adjacent latent variables Y1 and Y2. Both variables
are saturated. Y1 subsumes X4 and X6, and Y2 subsumes Y1 and X5. Conceptually, Y2
can be viewed as a joint variable of Y1 and X5, while Y1 can be in turn viewed as a joint
variable of X4 and X6. Intuitively, we can eliminate Y1 and directly make Y2 the joint
variable of X4, X5, and X6. This intuition is formalized by the following proposition.
36
(a) (b)
Figure 4.2: Redundant variable absorption. (a) A part of a model that contains twoadjacent and saturated latent nodes Y1 and Y2, with Y2 subsuming Y1. (b) Simplifiedmodel with Y1 absorbed by Y2.
Proposition 4.4 Let m be a model with more than one latent node. Let Y1 and Y2 be two
adjacent latent nodes. If both Y1 and Y2 are saturated while Y2 subsumes Y1, then there
exists another model m′ with lower inferential complexity such that
AIC(m′|D) ≥ AIC(m|D). (4.5)
Proof: We enumerate the neighbors of Y1 as Y2, Z11, Z12, . . ., Z1k, and the neighbors of Y2
as Y1, Z21, Z22, . . ., Z2l. Define another model m′ by removing Y1 from m and connecting
Z11, Z12, . . . , Z1k to Y2. We refer to this operation as variable absorption: Latent variable
Y1 is absorbed by latent variable Y2. See Figure 4.2.
We first prove that inequality 4.5 holds for models m and m′ obtained above. It
is sufficient to show that m′ is marginally equivalent to m and has fewer independent
parameters.
We start by proving the marginal equivalence. For technical convenience, we will work
with unrooted models and show that the unrooted versions of m and m′ are marginally
equivalent. For simplicity, we abuse m and m′ to denote the unrooted models.
Recall that an unrooted model is semantically a Markov random field over an undi-
rected tree. Its parameters include a potential for each edge in the model. The potential
is a non-negative function of the two variables that are connected by the edge. We use
f(·) to denote a potential in the parameters θm of m, and g(·) to denote a potential in
the parameters θm′ of m′.
Note that Y1 and Y2 are saturated, while Y2 subsumes Y1. When all variables have no
less than two states, this implies that:
1. Y1 subsumes Z11, Z12, . . . , Z1k.
37
2. Suppose that |Z2l| = maxlj=1 |Z2j |. Then Y2 subsumes Z21, Z22, . . . , Z2l−1.
Therefore, a state of Y1 can be written as y1 =< z11, z12, . . . , z1k >, while a state of Y2
can be written as y2 =< y1, z21, z22, . . . , z2l−1 >. The latter can be further expanded as
y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >.
We first show that m′ includes m. Let θm be parameters of m. We define parameters
θm′ of m′ as follows:
• Potential for edge Y2 — Z2l:
g(Y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >, Z2l = z2l)
=∑
Y1,Y2
f(Y1, Y2)k
∏
i=1
f(Y1, Z1i = z1i)l
∏
j=1
f(Y2, Z2j = z2j).
• Potential for edge Y2 — Z1i, ∀i = 1, 2, . . . , k:
g(Y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >, Z1i = z′1i) =
{
1 z1i = z′1i
0 otherwise
• Potential for edge Y2 — Z2j , ∀j = 1, 2, . . . , l − 1:
g(Y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >, Z2j = z′2j) =
{
1 z2j = z′2j
0 otherwise
• Set the other potentials in θm′ the same as those in θm.
It is easy to verify that
∑
Y1,Y2
f(Y1, Y2)
k∏
i=1
f(Y1, Z1i)
l∏
j=1
f(Y2, Z2j) =∑
Y2
k∏
i=1
g(Y2, Z1i)
l∏
j=1
g(Y2, Z2j). (4.6)
Therefore,
P (X|m, θm) = P (X|m′, θm′). (4.7)
Next, we prove that m includes m′. Given parameters θm′ of m′, we define parameters
θm of m as follows:
• Potential for edge Y1 — Y2:
f(Y1 =< z11, z12, . . . , z1k >, Y2 = y2) =k
∏
i=1
g(Y2 = y2, Z1i = z1i).
38
• Potential for edge Y1 — Z1i, ∀i = 1, 2, . . . , k:
f(Y1 =< z11, z12, . . . , z1k >, Z1i = z′1i) =
{
1 z1i = z′1i
0 otherwise
• Set the other potentials in θm the same as those in θm′ .
It can be verified that Equation 4.6 and 4.7 also hold. Therefore, m and m′ are marginally
equivalent.
We now prove that m′ contains less independent parameters than m. To show this
point, we return to rooted models. The parameters θm and θm′ consist of a CPT for each
node in models m and m′. There are only two differences between θm and θm′ :
1. The CPTs P (Z1i|Y1) for all i = 1, 2, . . . , k and P (Y1|Y2) are peculiar to θm;
2. The CPTs P (Z1i|Y2) for all i = 1, 2, . . . , k are peculiar to θm′ .
Therefore, the difference between the numbers of free parameters in m and m′ is
d(m)− d(m′) =∑
i
(|Z1i| − 1)|Y1|+ (|Y1| − 1)|Y2| −∑
i
(|Z1i| − 1)|Y2|
=∑
i
(|Z1i| − 1)|Y1|+(
∏
i
|Z1i| − 1)
|Y2| −∑
i
(|Z1i| − 1)|Y2|
=∑
i
(|Z1i| − 1)|Y1|+(
∏
i
|Z1i| −∑
i
|Z1i|+ k − 1)
|Y2|
≥∑
i
(|Z1i| − 1)|Y1|.
The second equation holds because Y1 is saturated and subsumes Z11, Z12, . . . , Z1k. The
last inequality holds because∏
i |Z1i| ≥∑
i |Z1i| when |Z1i| ≥ 2 for all i = 1, 2, . . . , k.
It is clear that the difference is strictly positive when |Y1| ≥ 2. The model m hence has
more parameters than m′ when Y1 and Z1i are non-trivial.
Finally, we compare the inferential complexity of m and m′. According to the con-
struction of m′, the clique tree of m′ is different from the clique tree of m in that it contains
one less clique {Y1, Y2} and replaces clique {Y1, Z1i} with {Y2, Z1i} for all i = 1, 2, . . . , k.
39
Algorithm 4.1 RedundantVariableAbsorption(m)
1: repeat2: for each pair of adjacent latent variables Y and Y ′ in m do3: if Y and Y ′ are saturated and Y subsumes Y ′ then4: Denote the set of neighbors of Y ′ by nb(Y ′)5: Connect nb(Y ′) \ {Y } to Y6: Remove Y ′ from m7: end if8: end for9: until no further changes
Therefore, the difference between the sums of clique sizes is
sum(m)− sum(m′) = |Y1||Y2|+∑
i
|Y1||Z1i| −∑
i
|Y2||Z1i|
= |Y2|∏
i
|Z1i|+∑
i
|Y1||Z1i| −∑
i
|Y2||Z1i|
= |Y2|(
∏
i
|Z1i| −∑
i
|Z1i|)
+∑
i
|Y1||Z1i|
≥∑
i
|Y1||Z1i|.
The last inequality holds when |Z1i| ≥ 2 for all i = 1, 2, . . . , k. Therefore, the inferential
complexity of m′ is always lower than that of m when Z1i is nontrivial.
Q.E.D.
Based on Proposition 4.4, we design a procedure called RedundantVariableAbsorption
for simplifying regular models. Algorithm 4.1 shows the pseudo code of this procedure. It
checks each pair of adjacent latent variables in the input model m. If one latent variable
of the pair is redundant, it is absorbed by the other. This process repeats until no further
changes. We use an example to demonstrate this procedure.
Example 4.4 Consider to simplify the regular model m′ in Figure 4.1b. We first check
the pair Y1 and Y2. Both of them are saturated while Y2 subsumes Y1. We thus remove Y1
and connect Y2 to X4 and X6. We then check Y3 and Y4. It turns out that Y3 is redundant.
Therefore, we remove it and connect Y4 to X2 and X3. The last pair to check are Y2 and
Y4. They are both saturated, but neither of them subsumes the other. Hence, they cannot
be removed. The final model m′′ is shown in Figure 4.1c.
4.4 Parameter Optimization
After we obtain a simplified model m, we run EM to optimize its parameters. Throughout
the learning process, we run EM only once. Therefore, we can afford a relatively high
40
setting. This can help avoid poor local maxima in the parameter space and usually leads
to better estimation to the generative distribution.
4.5 The HCL Algorithm
To wrap up, we have outlined the HCL algorithm for learning LTMs. HCL has 2 inputs: a
data set D and a predetermined bound Imax on the inferential complexity of the resultant
model. The output of HCL is an LTM that approximates the generative distribution
underlying D and satisfies the inferential complexity constraint. HCL is briefly described
as follows.
1. Obtain an LTM structure by performing hierarchical clustering of variables, using
empirical MI based on D as the similarity measure. (Section 4.1)
2. Set cardinalities of latent variables at Cmax according to Equation 4.3. (Section 4.2)
3. Simplify the model through regularization and redundant variable absorption. (Sec-
tion 4.3)
4. Optimize the parameters by running EM. (Section 4.4)
5. Return the resultant LTM.
4.6 Summary
We have presented a heuristic algorithm called HCL for learning LTMs. HCL deals with
density estimation problems with large samples, and assumes a predetermined bound
on the complexity of the resultant LTM. It constructs model structure by hierarchically
clustering manifest variables, and determines the cardinalities of latent variables based
on the given complexity constraint. In Chapter 6, we will empirically compare HCL with
EAST. In particular, we will show that HCL is about two to three orders of magnitude
faster than EAST. In Chapter 7, we will present an application of HCL in approximate
inference in Bayesian networks. We will use HCL to learn an LTM to approximate a given
Bayesian network, and utilize the approximate LTM to answer probabilistic queries.
41
CHAPTER 5
ALGORITHM 3: PYRAMID
We have so far presented two algorithms for learning LTMs, namely EAST and HCL.
EAST is a general-purpose algorithm. It finds high quality models and hence can yield
good solutions to density estimation problems. Moreover, it can discover interesting latent
structures behind data. However, it has a high computational cost. HCL is much faster
than EAST. However, it is a special-purpose algorithm, designed only for density estima-
tion problems where there is a predetermined bound on model complexity. Moreover, it
always produces binary latent tree structures which are typically not meaningful.
In this chapter, we present a third algorithm called Pyramid. It is also a general-
purpose algorithm and is much faster than EAST. It usually finds high quality models
and discovers interesting latent structures as EAST does. As will be seen in Chapter 8,
Pyramid offers a better tradeoff between model quality and computational efficiency in
classification problems than EAST.
5.1 Basic Ideas
In this section, we explain the basic ideas of the Pyramid algorithm.
5.1.1 Bottom-up Construction of Model Structure
The Pyramid algorithm is an extension to HCL. The two algorithms both exploit the
same intuition, i.e., sibling variables in LTMs are in general more closely correlated than
variables that are located far apart. To utilize this intuition in LTM construction, one can
first somehow identify variables that are closely correlated and then make them siblings
in the LTM under construction.
Here is a general bottom-up procedure for constructing LTMs that adopts the above
idea. The procedure keeps a working list of variables that initially consists of all manifest
variables. It operates as follows:
1. Choose from the working list a closely correlated subset of variables;
2. Introduce a new latent variable as the parent of each variable in the subset;
42
3. Remove from the working list the variables in the subset and add to it the new
latent variable.
4. Repeat until there is only one variable in the working list.
Apparently, step 1 is the critical step in the procedure. The subset of variables that it
chooses become siblings after step 2. So, we say that the task of step 1 is sibling cluster
determination.
HCL takes a simple approach to sibling clustering determination. It simply chooses,
from the working list, the two variables with the highest mutual information as the mem-
bers of the sibling cluster. So, at each iteration, it introduces a new latent variable as the
parent of two existing variables. The final model structure is hence a binary tree.
To overcome this limitation of HCL, we allow Pyramid to create sibling clusters with
more than two variables. Pyramid starts with a subset consisting of the two variables
with the highest mutual information. The other variables from the working list are then
added to the subset one by one until a certain condition is met. A sibling cluster is then
chosen from the subset.
There are three questions. First, at each step, which variable do we choose to add
to the current subset? Second, when do we stop growing the subset? Third, how do we
obtain a sibling cluster from the final subset?
Here is the answer to the first question: At each step, we add to the current subset the
variable from the working list that has the highest mutual information with the current
subset. A definition of mutual information between a variable and a set of variables will
be given in Section 5.2.
The other two questions will be answered in Sections 5.1.3 and 5.1.4, respectively. In
preparation, we will next introduce unidimensionality test.
5.1.2 Unidimensionality Test
We begin with some definitions.
Definition 5.1 An LTM is called a single-latent tree (1-LT) model if it contains only
one latent variable.
Definition 5.2 An LTM is called a twin-latent tree (2-LT) model if it contains exactly
two latent variables.
Definition 5.3 An LTM is simple if it is either an 1-LT or a 2-LT model.
43
Suppose we are given a data set D over a set of variables X. Let S be a subset of
X with more than 2 variables and D↓S be the projection of D onto S. Among all simple
models over S, denote the model with the highest AIC score with respect to D↓S byM⋆S.
Definition 5.4 A subset S of variables is unidimensional if M⋆S
is an 1-LT model.
The unidimensionality test checks whether a subset S of variables is unidimensional.
This is done in two steps:
1. Find the optimal simple LTMM⋆S
over S;
2. Return true ifM⋆S
is an 1-LT model; returns false, otherwise.
The first step is a non-trivial task. We will address it in Section 5.4.
When it is necessary to distinguish between simple models over S and the LTM over
X under construction, we will refer to the former as local models and the latter as the
global model.
5.1.3 Subset Growing Termination
Let us return to the second question raised in Section 5.1.1, i.e., when to stop growing
the subset. Our answer is to conduct unidimensionality test. After adding a new variable
to the current subset S, we test S for unidimensionality. If S passes the test, we continue
to grow it. Otherwise, we stop.
Let M⋆ be the optimal LTM over X with respect to D, i.e.,
M⋆ = arg maxM
AIC(M|D).
The intuition is that, if a subset S of X is not unidimensional, then the variables in S are
probably not from the same sibling cluster in M⋆. Growing it further would not change
the fact. So, we stop.
Example 5.1 Consider m manifest variables X1, X2, . . ., Xm. Assume that the pair X1
and X2 has the highest MI among all possible pairs; X3 has the highest MI with {X1, X2}among X3, . . ., Xm; X4 has the highest MI with {X1, X2, X3} among X4, . . ., Xm; etc.
To determine a sibling cluster, Pyramid starts with the subset S = {X1, X2} and
grows it gradually. First, it adds X3 to S and runs a unidimensionality test. Suppose the
optimal simple model over X1–X3 is an 1-LT, as shown in Figure 5.1a. Then the new
subset S = {X1, X2, X3} passes the test and the subset growing process continues.
44
(a) Step 1 (b) Step 2 (c) Step 3
Figure 5.1: An example subset growing process. The numbers within the parenthesesdenote the cardinalities of the latent variables.
Pyramid next adds X4 to S. Suppose the optimal model over X1–X4 is still an 1-LT,
as shown in Figure 5.1b. Then the new subset again passes the unidimensionality test and
the subset growing process moves on.
Pyramid then adds X5 to S. Suppose the optimal simple model over X1–X5 is a 2-LT,
as shown in Figure 5.1c. Then S fails the unidimensionality test and the subset growing
process terminates.
Pyramid can now obtain a sibling cluster from S. The next subsection explains how.
5.1.4 Sibling Cluster Determination
When the subset growing process stops, we have a subset S of variables from the working
list. We have put S through unidimensionality test. We use the intermediate results
computed during the test to determine a sibling cluster.
LetM⋆S
be the optimal simple model over S that was found during the unidimension-
ality test. It is a 2-LT model. So it has two latent variables and hence two sibling clusters.
These clusters should not be confused with sibling clusters in the global model under con-
struction. Denote the two sibling clusters in M⋆S
by S1 and S2, where S = S1 ∪ S2. Let
X1 and X2 be the pair of variables with which we started the subset growing process.
Recall that, among all possible pairs of variables in the working list, X1 and X2 have the
highest MI.
Pyramid chooses one of S1 and S2 as a sibling cluster for the global model. Specifically,
it picks the one that contains both X1 and X2. In case that X1 and X2 lie in different
subsets, Pyramid chooses between S1 and S2 arbitrarily. However, this seldom happens
in practice.
Without lose of generality, suppose S1 is chosen as a sibling cluster. The next step is
to introduce a new latent variable Y to the global model and make it the parent of each
variable in S1. An immediate question is what the cardinality of Y should be. InM⋆S, the
45
Figure 5.2: The model structure after adding one latent variable Y1.
variables in S1 share one common parent, which we denote by Z. We set the cardinality
of Y to be the same as that of Z.
Example 5.2 Consider the subset S that we obtained in Example 5.1. It consists of X1–
X5. The optimal simple model over S is a 2-LT, as shown in Figure 5.1c. It contains two
sibling clusters {X1, X2, X3} and {X4, X5}.
Since {X1, X2, X3} contains X1 and X2, it is picked as a sibling cluster for the global
model under construction. A new latent variable Y1 is introduced as the parent of the three
variables. This gives the model structure shown in Figure 5.2. The cardinality of Y1 is set
to be the same as the latent variable Y in Figure 5.1c.
Going back to the general procedure outlined in Section 5.1.1, we have illustrated
steps 1–3. In step 4, the variables X1, X2, X3 are removed from the working list and
Y1 is added to the working list. The process then repeats itself. In the next iteration,
a new latent variable might be introduced for several other manifest variables, or for Y1
and some other manifest variables.
5.2 The Pyramid Algorithm
Algorithm 5.1 shows the pseudo code for Pyramid. It takes a data set D over a set of
variables X as the input, and outputs an LTMM with manifest variables X.
The algorithm begins with a BN M that consists of the nodes from X and that
contains no edges (line 2). Latent nodes will be added toM one by one. By the end,Mwill become an LTM.
To decide what latent nodes to add toM and how to add them, the algorithm main-
tains a working list W. Initially, it consists of a copy of all the variables from X (line
3). Variables will then be added to and removed from W. The newly added variables
will be a copy of the latent nodes to be introduced toM. By the end, W will become a
singleton set, containing only a copy of the root of the final model.
In each pass through the main while loop (lines 4–19), the algorithm first picks the
pair of variables from W that has the highest MI among all possible pairs (line 5). The
46
Algorithm 5.1 Pyramid(D)
1: Let X be the set of variables in D2: M← BN with nodes X and no edges3: W← A copy of X4: while |W| > 1 do5: {W1, W2} ← arg maxW,W ′∈W I(W ; W ′)6: S← {W1, W2}7: while |S| < |W| do8: {W3, W4} ← arg maxS∈S,W∈W\S I(S; W )9: S← S ∪ {W4}
10: MS ← RestrictedExpand(S, W3, W4,M,D)11: if MS is a 2-LT model then12: break13: end if14: end while15: MS ← PickSiblingCluster(MS, W1)16: M← AddNode(M,MS,D)17: Let Y be the new node just added toM18: W←W \ {The children of Y inM}∪ {A copy of Y}19: end while20: return Refine(M,D)
Algorithm 5.2 PickSiblingCluster(MS , W1)
1: Let Y be the latent node inMS that is not the parent of W12: Remove Y and all its manifest children fromMS
3: return MS
MI between two manifest variables is estimated based on data. The estimation of the MI
between two latent variables or between a latent variable and a manifest variable is not
trivial. This issue will be discussed in Section 5.3.
In lines 6–14, the algorithm determines a sibling cluster. It does so by iteratively
growing a set S that initially contains the pair of variables picked at line 5. In each pass
through the inner while loop, the algorithm first computes the MI between each variable
W from W \S and the subset S. In particular, we adopt the minimum linkage definition
(Duda & Hart, 1973). That is, the MI between a variable W and the subset S of variables
is defined to be the maximum of the MI values between W and each variable S in S,
I(W ;S) = maxS∈S
I(W ; S).
The variable W4 from W \ S that is the most closely related to S is then picked at line
8. It is added to S at line 9. Unidimensionality test is then carried on S in lines 10–13.
The key step is the search for the optimal simple model over S. This is done using a
subroutine called RestrictedExpand. This subroutine uses the variable W3 in S that has
the highest MI with W4. Section 5.4 will explain how the subroutine works.
When S fails the unidimensionality test, the algorithm proceeds to obtain a sibling
47
Algorithm 5.3 AddNode(M,MS,D)
1: Let L be the set of nodes inM that correspond to the leaf nodes ofMS
2: Create a new latent node Y and set its cardinality to be that of the root ofMS
3: Add Y toM and make it the parent of each node in L4: Let MY be the submodel ofM rooted at Y5: return EM(MY ,D)
cluster using a subroutine called PickSiblingCluster (line 15). The pseudo code of
the subroutine is given in Algorithm 5.2. Because S fails the unidemsionality test, the
simple model MS contains two latent nodes. PickSiblingCluster prunes fromMS all
the nodes that are not the parent or siblings of W1 and returns the resulting 1-LT model.
The sibling cluster obtained is the set of leaf nodes remaining in MS after the pruning.
Note that the subroutine returns the 1-LT model rather than just the sibling cluster itself.
This is because another aspect of the model will also be needed in the next step.
After line 15 of Pyramid, MS is an 1-LT model. Let L be the set of the nodes inMthat correspond to the leaf nodes of MS. It is the sibling cluster obtained. At line 16,
the subroutine AddNode (see Algorithm 5.3) is called to create a new latent node Y and
add it to the BNM. It is made the parent of each node in L. The cardinality of Y is set
to be the same as that of the root ofMS. The parameters pertaining to the new node Y
and in the subtree rooted at Y are optimized using EM, so that future MI estimates will
be accurate.
At line 18 of Pyramid, the nodes in L are removed from the working list W because
they already have a parent Y . A copy of Y is then added to W so that Y can be connected
to future latent nodes.
After line 19, M becomes an LTM. At line 20, a subroutine Refine is called to
optimize the cardinalities of the latent variables inM and the model parameters as well.
This subroutine will be discussed in Section 5.5. The output of the subroutine is the
output of the entire algorithm.
In the next 3 subsections, we present more implementation details about the Pyramid
algorithm.
5.3 Mutual Information
At lines 5 and 8, Pyramid needs to compute MI between variables. In this section, we
discuss the issues related to the calculation.
48
5.3.1 MI Between Manifest Variables
Theoretically, exact computation of MI is possible only when there is a probabilistic model
over the variables. However, in density estimation problems, we only have a data set at
hand. Therefore, we can only estimate MI based on data.
Suppose D is a data set over a set of manifest variables X. Let P (X) be the empirical
distribution over X induced by D. The empirical MI between two manifest variables X
and X ′ is defined as
I(X; X ′) =∑
X,X′
P (X, X ′) logP (X, X ′)
P (X)P (X ′).
We use the empirical MI I(X; X ′) to estimate the MI I(X, X ′).
5.3.2 MI Between a Latent Variable and a Manifest Variable
Pyramid sometimes need the MI between two latent variables or between a latent variable
and a manifest variable. As an example, consider the situation shown in Figure 5.2. The
latent variable Y1 has just be added to the global model. Consequently, X1, X2 and X3
have just been removed from the working list W, and Y1 has just be added to it. In the
next iteration of the main while loop, Pyramid will need the MI between Y1 and X4–Xm
at lines 5 and 8.
MI cannot be estimated solely from the empirical distribution when latent variables
are involved. This is because values of latent variables are not available in data. We
address this issue in the remaining of this section. We begin with the task of estimating
the MI I(Y ; X) between a latent variable Y and a manifest variable X.
Pyramid builds an LTM by adding latent nodes to a BNM that initially contains no
latent nodes and no edges. At the time when it needs the MI I(Y ; X), there is already
a sub-model in M rooted at Y . We denote it by MY . In Figure 5.2, for example, when
Pyramid needs to calculate the MI between latent variable Y1 and manifest variables
X4–Xm, there is a sub-model rooted at Y1 and it happens to be an 1-LT model.
The values of Y are not available in data. Conceptually, Pyramid exploits the sub-
modelMY to complete the values of Y in data, and then estimates the MI I(Y ; X) based
on the completed data. Let XY be the set of manifest variables in MY . Technically,
Pyramid first obtains a joint distribution over Y and X from the data D and the sub-
modelMY as follows:
PD,M(Y,X) =1
|D|∑
d∈D
P (Y |d↓XY,MY )1X(d↓X),
49
where |D| is the sample size of D, 1X(·) is the indication function, while d↓XYand d↓X
denote the values of XY and X in data case d, respectively. Then, Pyramid estimates
the MI I(Y ; X) using the following quantity:
ID,M(Y ; X) =∑
Y,X
PD,M(Y, X) logPD,M(Y, X)
PD,M(Y )PD,M(X).
In order for the estimation to be accurate, the parameters in the sub-modelMY should
be optimized. This is done at the end of the subroutine AddNode after the introduction
of each latent node.
5.3.3 MI Between Two Latent Variables
We next consider the MI I(Y ; Y ′) between two latent variables Y and Y ′. When Pyramid
needs the MI, there are a sub-model rooted at Y and another sub-model rooted at Y ′
in M. We denote them by MY and MY ′ , respectively. Let XY and XY ′ be the sets of
manifest variables in the two sub-models. Pyramid first obtains a joint distribution over
Y and Y ′ from the data D and the two sub-models MY andMY ′ as follows:
PD,M(Y, Y ′) =1
|D|∑
d∈D
P (Y |d↓XY,MY )P (Y ′|d↓XY ′
,MY ′).
Then, it estimates the MI I(Y ; Y ′) using the following quantity:
ID,M(Y ; Y ′) =∑
Y,Y ′
PD,M(Y, Y ′) logPD,M(Y, Y ′)
PD,M(Y )PD,M(Y ′).
5.4 Simple Model Learning
At line 10, Pyramid tries to find the optimal simple LTM over a set S. It does so by calling
a subroutine named RestrictedExpand. In this section, we explain how the subroutine
works. Because it is called many times, efficiency was an important consideration when
designing the subroutine.
The set S consists of variables from the working list W. It might contain only manifest
variables, or it might contain latent variables as well as manifest variables. We will focus
on the first case in Sections 5.4.1 and 5.4.2, and deal with the second case in Section 5.4.3.
5.4.1 Exhaustive Search
Suppose S consists of only manifest variables. A conceptually straightforward way to find
the optimal simple model over S is exhaustive search. That is, we enumerate all simple
50
Algorithm 5.4 RestrictedExpand(S, W, W ′,M,D)
1: MS ← 1-LT model over S with a binary latent variable Y2: while true do3: if MS is an 1-LT model then4: M1 ← SI(MS, Y )5: M2 ← NI(MS, Y, W, W ′)6: M′
S← PickModel1-IR({M1,M2},MS)
7: else8: Let Y and Y ′ be the two latent variables inMS
9: M1 ← SI(MS, Y )10: M2 ← SI(MS, Y ′)11: M′
S← PickModel1({M1,M2},MS)
12: end if13: if AIC(M′
S|M,D) ≤ AIC(MS|M,D) then
14: return MS
15: else if M′S
is a 2-LT model then16: MS ← EnhanceNI(M′
S,MS,D)
17: else18: MS ←M′
S
19: end if20: end while
regular LTMs over S, calculate their AIC scores, and return the one with the highest
score. This solution is computationally intractable because there are exponentially many
regular 2-LT structures over S, and hence exponentially many simple regular LTMs.
Proposition 5.1 There are 2m−1 − 1 different regular 2-LT structures over m manifest
variables.
Proof: A regular 2-LT structure can be uniquely determined by partitioning the m
manifest variables into two non-empty sibling clusters. Therefore, the number of regular
2-LT structures equals to the number of different partitions over m manifest variables,
i.e.,Pm−1
k=1Cm
k
2= 2m−1 − 1.
Q.E.D.
5.4.2 Restricted Expansion
Instead of exhaustive search, RestrictedExpand finds a simple LTM over S through hill-
climbing. The pseudo code of the subroutine is given in Algorithm 5.4. It is similar to the
expansion phase of the EAST algorithm discussed in Chapter 3, except that a number of
differences are introduced to make it as efficient as possible. As mentioned earlier, the
subroutine is called many times, and hence efficiency is critical.
Like EAST, RestrictedExpand starts the search from the 1-LT model over S that
has a single binary latent variable (line 1).
51
At each step of search, RestrictedExpand generates a number of candidate models
(lines 4, 5, 9, 10). Here, it differs from EAST in two aspects. First, it no longer considers
apllying the NI operator once the current model MS has two latent nodes (lines 9 and
10). This is because the objective here is to find the optimal simple LTM, i.e., the best
model among LTMs with 1 or 2 latent nodes. There is hence no need to consider models
with more than 2 latent nodes.
The second difference is that RestrictedExpand does not consider all possible ways
to apply the NI operator when there is only one latent node. Rather, it considers only
the NI operation that introduces a new latent node to mediate the existing latent node
Y and two of its children W and W ′ (line 5). The two nodes W and W ′ are provided as
arguments by the call to the subroutine.
This second modification clearly reduces the number of candidate models. However,
does it significantly reduce the chance to find a good simple model, and hence compromise
the quality of unidimensionality test? To answer this question, we need to go back to line
10 of the Pyramid algorithm where the subroutine is called. There, a new variable W4
has just been added to S. Let S′ = S \ {W4}. The variable W3 is the one in S′ that has
the highest MI with W4. In the subroutine call, W is W3 and W ′ is W4.
In EAST, |S|(|S|−1)/2 possible ways to apply the NI operator are considered, one for
each pair of nodes in S. Divide those operations into two groups: (1) Those for pairs of
nodes in S′, and (2) those for pairs that consist of W4 and a node in S′. The operations
in the first group have already been considered in previous unidimensonality tests on the
subsets created during the process of growing the initial subset {W1, W2} up to S′. They
were found to be inferior to corresponding SI operations on the previous simple models or
not to improve the AIC scores of those models. As a heuristic, we believe that the same
would be true if they are applied to the current simple model. So, we do not consider
them.
Among the operations in the second group, we believe that, as another heuristic, the
operation that introduces a new latent variable for the pair W3 and W4 would bring about
the most (if any) improvement over the current model. This is because W3 is the variable
in S′ that has the highest MI with W4. So, we choose to consider this operation only.
After generating the candidate models, RestrictedExpand evaluates them one by one
and picks the best model from the list. This is done by invoking subroutines PickModel1-IR
and PickModel1 at lines 6 and 11. Here comes the third difference between
RestrictedExpand and EAST. EAST needs to deal with a large number of candidate
models. Therefore, it uses PickModel and PickModel-IR (Algorithms 3.7 and 3.8) for
efficient model evaluation, which adopt a two-stage strategy. In the first screening stage,
PickModel and PickModel-IR optimize the parameters of the candidate models by run-
52
Algorithm 5.5 PickModel1(L,M)
1: for each M′ ∈ L do2: Run LocalEM(M,M′, µ, ν) to estimate the parameters ofM′
3: end for4: return the model in L with the highest AIC score as given by Equation 3.2
Algorithm 5.6 PickModel1-IR(L,M)
1: for each M′ ∈ L do2: Run LocalEM(M,M′, µ, ν) to estimate the parameters ofM′
3: end for4: return the model in L with the highest IR score as given by Equation 3.4
ning local EM at a low setting, calculate the AIC/IR scores of the candidate model based
on the estimated parameters, and prune most candidate models with low AIC/IR scores
from consideration. Then, in the second evaluation stage, they refine the parameters of
remaining candidate models by running local EM at a relatively high setting, and pick
the best one based on the parameter estimations obtained.
In contrast, RestrictedExpand only needs to handle a small number of candidate
models. Therefore, PickModel1 and PickModel1-IR accomplishes model evaluation in
one stage, as shown in Algorithms 5.5 and 5.6.
PickModel1 and PickModel1-IR have two algorithmic parameters, µ and ν. They
control the local EM in the same way as in PickModel and PickModel-IR. In practice, we
use a high setting for µ and ν to obtain accurate parameter estimations. Since PickModel1
and PickModel1-IR only deals with two candidate models which are usually small, we
believe that tuning local EM at a high setting will not significantly slows down the model
evaluation process.
After the model evaluation/selection phase, one candidate model is chosen.
RestrictedExpand compares the AIC score of the picked candidate model with that
of the current model (line 13). If the score does not increase, the search terminates and
the current model is returned as the output of RestrictedExpand (line 14). Otherwise,
the search continues (line 15–19). If a new latent node has just been introduced to the
best candidate model, the subroutine EnhanceNI (Algorithm 3.4) is called to adjust con-
nections among nodes so that potentially more manifest nodes can be connected with the
new latent node (line 17). This is the same as in EAST.
5.4.3 When S Contains Latent Variables
We now deal with the case when S contains latent variables. We begin with an example.
Suppose the current global model M is as shown in Figure 5.2. To introduce the next
53
(a) Candidate model MY (b) Sub-modelMY1(c) Concatenated modelMY Y1
Figure 5.3: An example for evaluating simple models over latent variables. The colors ofthe nodes in Figure 5.3c indicate where the parameters of the nodes come from.
latent node, Pyramid might call RestrictedExpand to find a simple model over Y1 and
some manifest variables, say X4–X6. In that process, it might need to evaluate candidate
models such as the modelMY shown in Figure 5.3a. An issue with the evaluation of the
model is that the node Y1 is latent and its values are absent from the data D. Therefore,
we cannot directly calculate the AIC score ofMY .
To solve this problem, we exploit the same idea that was used earlier in Section 5.3 to
estimate the MI between latent variables. We notice that, in the current global modelM,
there is a sub-model rooted at Y1. We refer to this sub-model as MY1. For convenience,
we reproduce it here in Figure 5.3b. The idea is to use the model MY1to complete the
values for Y1 in the data, and then use the completed data to evaluate the candidate
modelMY .
Technically, to evaluate MY , we first concatenate it with MY1. The result is a two-
layer LTM MY Y1, as shown in Figure 5.3c. In particular, MY Y1
borrows the parameters
of Y , Y ′, Y1, X4–X6 from MY , and the parameters of X1–X3 from MY1. Note that the
leaf nodes inMY Y1are all manifest variables. We then calculate the AIC score ofMY Y1
,
and return it as the AIC score ofMY .
In general, suppose we need to evaluate a candidate modelMS over S, while S contains
a set Y of latent variables. For each Y ∈ Y, there is a sub-model rooted at Y in the
current global modelM. Denote it by MY1. To estimate the AIC score ofMS, we first
concatenateMS with the sub-modelsMY for all Y ∈ Y. Refer to the resultant model as
MSY. We then calculate the AIC score of MSY, and return it as the AIC score of MS.
This estimation is formalized as follows:
AIC(MS|D) ≈∑
d∈D
log∑
Y
P (d↓S\Y,Y|MS)∏
Y ∈Y
P (d↓XY|Y,MY )− d(MSY),
where d(MSY) is the dimension of the concatenated modelMSY.
1This is not to be confused with the modelMY as shown in Figure 5.3a.
54
Algorithm 5.7 Refine(M,D)
1: while true do2: M′ ← arg maxM′∈SI(M) AIC(M′|D)3: if AIC(M′|D) < AIC(M|D) then4: break5: else6: M←M′
7: end if8: end while9: M← EM(M,D)
10: return M
5.5 Cardinality and Parameter Refinement
Pyramid starts with a BNM that consists of only manifest nodes and adds latent nodes
to it one by one. Latent nodes are added by the subroutine AddNode. When a latent
node is added, its cardinality is also determined (line 2 of AddNode). The cardinality is
determined locally using information from a subset of manifest variables. In the final
model, however, the latent node is connected to all manifest variables via other latent
nodes. As such, the cardinality might need to be increased so as to achieve better model
fit.
Consider the latent variable Y1 that was introduced in Example 5.2. As shown in
Figure 5.1c, the cardinality of Y1 was determined based on the interactions among manifest
variables X1–X5. Specifically, Y1 was introduced to capture the interactions between
two subsets of manifest variables, namely {X1, X2, X3} and {X4, X5}. The tighter the
interactions, the higher the cardinality of Y1 needs to be. In the final model, on the other
hand, Y1 needs to capture not only the interactions between {X1, X2, X3} and {X4, X5},but also the those between {X1, X2, X3} and all the other manifest variables. Hence, its
cardinality might need to be increased.
After the completion of model structure construction at line 19, Pyramid invokes the
subroutine Refine to determine whether the cardinalities of latent variables should be
increased and to increase them if necessary. The pseudo code for Refine is shown in
Algorithm 5.7. It is a hill-climbing process. At each step, the possibility of increasing the
cardinality of each latent variable by one is considered. This results in a list of candidate
models, with one corresponding to each latent variable. The candidate model with the
highest AIC score is then picked. If the chosen candidate model improves over the current
model, the search moves on to the next step. Otherwise, the search process terminates.
During the hill-climbing process, local EM instead of full EM is run on the candidate
models for the sake of efficiency. At the end of the search, full EM is run on the final
model to optimize the parameters.
55
5.6 Summary
In this chapter, we have developed a third algorithm, called Pyramid, for learning LTMs.
It is designed to be a general-purpose algorithm (1) that is much faster than EAST and (2)
that can find good quality models and discover interesting latent structures. In the next
chapter, we will provide empirical evidence that Pyramid indeed meets the objectives.
Furthermore, in Chapter 8, we will apply both Pyramid and EAST to classification. The
reader will see that Pyramid represents a better tradeoff between computational time and
model quality/classification performance than EAST.
56
CHAPTER 6
EMPIRICAL EVALUATION
We have presented three algorithms for learning LTMs for the task of density estima-
tion, namely EAST, HCL, and Pyramid. In this chapter, we empirically evaluate the
performance of those algorithms on both synthetic and real-world data.
6.1 Data Sets
We used 3 synthetic data sets and 2 real-world data sets in our experiments. They are
detailed in the following 2 subsections.
6.1.1 Synthetic Data Sets
The synthetic data sets were generated from 3 LTMs. The generative models contain 7,
12, and 18 manifest variables. We denote them by MG7 , MG
12, and MG18, respectively.
The structures of the models are shown in Figure 6.1. All the manifest variables in the
models take 3 values. The cardinalities of the latent variables are denoted by the numbers
within the parentheses in the latent nodes. The parameters of the models were randomly
generated such that correlations along the edges are strong.
From each model, we sampled a training set with 5k instances over the manifest
variables. We also sampled a separate 5k test set. In the experiments, we ran the learning
algorithms on the training set and evaluate the quality of the resulting LTMs using the
test set. Henceforth, we use D7, D12, and D18 to denote the 3 training/test pairs sampled
fromMG7 ,MG
12, andMG18, respectively.
6.1.2 Real-World Data Sets
CoIL Challenge 2000 Data
The first real-world data set comes from the contest of CoIL Challenge 2000 (van der
Putten & van Someren, 2000). We refer to it as the CoIL data for short. This data
contains information on customers of a Dutch insurance company. Its training set and
test set consist of 5,822 and 4,000 customer records, respectively. Each record consists of
86 attributes, containing socio-demographic information (Attributes 1–43) and insurance
57
(a) MG7
(b)MG12
(c) MG18
Figure 6.1: The structures of the generative models of the 3 synthetic data sets.
product ownerships (Attributes 44–86). The socio-demographic data are derived from
zip codes. In previous analysis, these variables were found more or less useless. In
our experiments, we included only three of them, namely Attributes 4 (average age),
5 (customer main type), and 43 (purchasing power class). All the product ownership
attributes were included in the experiments.
The data was preprocessed as follows: First, similar attribute values were merged so
that there are at least 30 records in the training set for each value. In the resulting
training set, there are fewer than 10 records where Attributes 50, 60, 71, and 81 take
nonzero values. Those attributes were excluded from the experiments. The final data set
consists of 42 attributes, each with 2 to 9 values.
Kidney Deficiency Data
The second real-world data set is a survey data from the domain of traditional Chinese
medicine (TCM) (Zhang et al., 2008a, 2008b). The data set contains symptom information
on seniors at or above the age of 60 years from several regions in China. It consists
of 35 symptom variables and 2,600 patient records. The symptom variables are the
most important factors that a TCM doctor would consider when determining whether a
patient has an illness condition called Kidney Deficiency and if so, which subtype. Hence,
we refer to the data set as the Kidney data. Each symptom variable has four possible
values, namely ‘no’, ‘light’, ‘medium’, and ‘severe’, representing four severity levels. In
our experiments, we split the data into a training set with 2,000 cases and a test set with
58
600 cases.
6.2 Measures of Model Quality
In density estimation problems, an estimate is of high quality if it is close to the generative
distribution. Hence, for LTMs learned from the synthetic data, we measure their quality
using the empirical KL divergence from the generative models. Given a generative model
MG and test data D, the empirical KL divergence of a modelM fromMG is defined as
D(MG‖M) =log P (D|MG)− log P (D|M)
|D| , (6.1)
where |D| denotes the sample size of D. It is an approximation to the true KL divergence
D(MG‖M) =∑
X
P (X|MG) logP (X|MG)
P (X|M),
where X denotes the set of manifest variables. The smaller the empirical KL divergence,
the better the modelM.
For real-world data, the generative models are unknown. Therefore, we cannot calcu-
late empirical KL divergence. We use the log-loss to measure the model quality. Given
test data D, the log-loss of a modelM is defined as follows,
log-loss(M|D) = − log P (D|M)
|D| .
It is different from the empirical KL divergence (Equation 6.1) in that it leaves out the
first term, which is independent of the model M. Hence, the smaller the log-loss, the
closer M to the unknown generative model, and the higher the quality ofM.
6.3 Impact of Algorithmic Parameters
Each of the three learning algorithms has a set of algorithmic parameters for the users
to set. In this section, we examine the impact of those parameters on the performance of
the algorithms. The empirical results provide a clue for choosing appropriate parameter
values in practice.
6.3.1 Experimental Settings
We begin with some experimental settings that are common to the three algorithms.
59
Taking a training data as input, each of the three algorithms outputs an LTM. The
parameters of the output models are optimized using full EM. To avoid local maxima,
we adopted the pyramid scheme proposed by Chickering and Heckerman (1997a) and set
the number of starting points of EM at 32. Moreover, we ran EM until the difference in
log-likelihood between two consecutive iterations fell below 0.1.
In EM, initial parameter values are randomly picked. Hence, there is inherent ran-
domness in the three algorithms. Consequently, we ran each algorithm 10 times on each
training set. We report the average performance, along with the standard deviation.
All the experiments in this chapter were run on a Linux server with an Intel Core2
Duo CPU at clock rate 2.4GHz and 4GB main memory.
6.3.2 EAST
As described in Chapter 3, EAST has six algorithmic parameters. Among those parame-
ters, four are used in the subroutines PickModel and PickModel-IR (Algorithms 3.7 and
3.8), i.e.,
1. νs, the number of iterations of local EM in the screening phase;
2. k, the number of candidate models that are selected to enter the evaluation phase;
3. µ, the number of starting points of local EM in the evaluation phase;
4. ν, the number of iterations of local EM in the evaluation phase.
The other two parameters are used by full EM, which is run on the models returned
by PickModel and PickModel-IR in order to compute their AIC scores. Those two
parameters are
1. µf , the number of starting points of full EM;
2. νf , the maximum number of iterations of full EM.
In our experiments, we fixed the values of k and νs at 50 and 10, respectively. We
then tested three different settings on the other four parameters. The details are given
in the Table 6.1. In the following, we examine the performances of EAST under different
settings.
We first look at the impact of the algorithmic parameters on model quality. The
quality of the models induced by EAST are indicated by the red curves in Figure 6.2. In
general, the model quality remains relatively stable as the setting changes. The quality
60
Setting µ ν µf νf
Coarse 4 10 8 20Mild 8 20 16 50Fine 16 50 32 100
Table 6.1: The 3 settings on the algorithmic parameters of EAST and Pyramid that havebeen tested.
at the mild and fine settings is roughly the same, and is slightly higher than that at the
coarse setting.
We then inspect how the computational efficiency of EAST changes along with the
algorithmic parameters. The running time of EAST is shown in Figure 6.3. Again, we
focus on the red curves for now. It is clear that the running time almost always increases
as the setting of the parameters lifts. However, the increase from the coarse setting to
the mild setting is not significant.
In summary, by increasing the algorithmic parameters of EAST, one can earn a slight
improvement in model quality. However, the running time will also increase. In practice,
users can tune the parameters to achieve an appropriate tradeoff between model quality
and computational efficiency. In Chapter 8, we will apply EAST to build classifiers. In
this application, we found the mild setting a good choice.
6.3.3 HCL
The HCL algorithm has one parameter for users to set, i.e., the bound Imax on the
inferential complexity of the resulting LTM. Or equivalently, one can set the bound Cmax
on the cardinalities of the latent variables according to Equation 4.3. In this subsection,
we test three different values of Cmax, namely 2, 4, 16, and examine the performance of
HCL under those settings.
The quality of the models produced by HCL is indicated by the green curves in Figure
6.2. It is clear that HCL achieved better model fit with larger Cmax value. This confirms
our analysis in Section 4.2: The larger the cardinalities of latent variable, the stronger
the expressive power of the resulting model, and thus the closer the resulting model to
the generative distribution.
On the other hand, larger Cmax value also leads to longer training time. This can be
seen from Figure 6.3.
61
6.3.4 Pyramid
As described in Chapter 5, Pyramid has 4 algorithmic parameters: (1) µ and ν for local
EM used in PickModel1 and PickModel1-IR, and (2) µf and νf for full EM used at line 5
of AddNode and line 13 of RestrictedExpand. In the experiments, we ran Pyramid using
the same three parameters settings as in the case of EAST (Table 6.1). We now examine
the impact of the parameters on the performance of Pyramid.
The quality of the models learned by Pyramid is denoted by the blue curves in Figure
6.2. In general, when we change to finer settings, the average model quality becomes
better. The only exception occurs for D7, where the empirical KL slightly increases when
we move from the mild setting to the fine setting. Moreover, the variance of model quality
decreases as we move from the coarse setting to the fine setting. This implies that Pyramid
with higher setting is more robust.
We then look into the impact of the algorithmic parameters on the time efficiency
of Pyramid. By examining Figure 6.3, we can see that the running time of Pyramid
consistently increases as we move from the coarse setting to the fine setting.
In summary, one can achieve a tradeoff between the model quality and the time effi-
ciency of Pyramid by tuning the algorithmic parameters. In practice, the users can set the
parameters according to the requirement of the applications. In our classification work
reported Chapter 8, we found the mild setting is appropriate.
6.4 Comparison of EAST, HCL and Pyramid
We now compare the performances of the three learning algorithms. We focus the com-
parisons on three aspects, namely model quality, computational efficiency, and latent
structure discover capability. The comparisons will give us information on when to use
which algorithm.
6.4.1 Model Quality
We first compare the quality of the models induced by the 3 learning algorithms. By
examining Figure 6.2, we can see that EAST is the clear winner among the three algo-
rithms. It produced the best models in almost all cases. The only two exceptions occur in
the coarse and mild settings on D7, where EAST performed slightly worse than Pyramid.
Pyramid comes in the second place. On D7, the empirical KL for Pyramid is com-
parable with that for EAST. On the other four data sets, Pyramid is not as good as
EAST. This is expected. A number of heuristics are introduced in Pyramid for the sake
62
0
0.2
0.4
Em
piric
al K
L D
iver
genc
e
Coarse Mild Fine6
8
10
12x 10
−3
HCL
EASTPyramid
(a) D7
0.5
1
1.5
2
Em
piric
al K
L D
iver
genc
e
Coarse Mild Fine0
0.02
0.04
0.06
(b) D12
0
1
2
3
Em
piric
al K
L D
iver
genc
e
Coarse Mild Fine0
0.05
0.1
(c) D18
9
11
13
15Lo
g−lo
ss
Coarse Mild Fine8.4
8.6
8.8
9
(d) CoIL
4.4
4.6
4.8
5
Coarse Mild Fine4.1
4.2
4.3
Log−
loss
(e) Kidney
Figure 6.2: The quality of the models produced by the three learning algorithms undervarious settings.
63
Coarse Mild Fine10
1
102
103
Tim
e (S
econ
d)
EASTHCLPyramid
(a) D7
Coarse Mild Fine10
1
102
103
104
Tim
e (S
econ
d)
(b) D12
Coarse Mild Fine10
1
102
103
104
105
Tim
e (S
econ
d)
(c) D18
Coarse Mild Fine10
1
102
103
104
105
106
Tim
e (S
econ
d)
(d) CoIL
Coarse Mild Fine10
1
102
103
104
105
106
Tim
e (S
econ
d)
(e) Kidney
Figure 6.3: The running time of the three learning algorithms under various settings.
64
of efficiency. There should be a decrease in the quality of the model it can find. On the
other hand, Pyramid also achieved good model fit. As we move from the coarse setting
to the fine setting, the performance of Pyramid gradually approaches that of EAST. In
particular, at the fine setting, the difference between Pyramid and EAST in empirical KL
divergence or log-loss is only 0.02–0.04.
HCL produced the worst models among the three algorithms. In comparison with that
of the EAST and the Pyramid models, the quality of the HCL models is so poor that we
have to plot the curves for HCL in a separate figure. We argue that, however, HCL can
yield solutions good enough to some density estimation problems in practice, especially
those with large sample sizes. The reader will see one such example in Chapter 7, where
we show that the LTMs induced by HCL provide accurate approximations to complex
Bayesian networks.
6.4.2 Computational Efficiency
In terms of computational efficiency, the ordering of the three algorithms is reversed.
EAST is the slowest among the three algorithms. As shown in Figure 6.3, on the smallest
synthetic data D7, EAST took several minutes. On the other two synthetic data D12 and
D18, it took several hours to produce a model. On the two real-world data sets, it took
up to one day to finish. Therefore, EAST is only suitable for small scale problems.
Pyramid is much more efficient than EAST. On the smallest data D7, Pyramid was
already 4–5 times faster. On the other 4 data sets, the difference was even larger. Pyramid
was more efficient than EAST by at least an order of magnitude. However, for real-world
data like CoIL and Kidney, Pyramid still ran for hours. Hence, it can be used to deal
with moderate size problems.
HCL is clearly the most efficient among the three algorithms. It is significantly faster
than Pyramid (and hence even faster than EAST). The gap can be up to two orders of
magnitude. See Figure 6.3d for example. Furthermore, on all the 5 data sets that we
tested, the learning process took at most several minutes. Therefore, HCL is the right
algorithm to choose when handling large problems.
6.4.3 Latent Structure Discovery
Besides model quality and computational efficiency, we are also concerned with the ca-
pability of discovering latent structures behind data. The latent structures can reveal
underlying regularities and give us insights into the domain. In this subsection, we com-
pare the performance of the three algorithms on this aspect.
65
(a)ME7 (b)ME
12
(c) ME18
Figure 6.4: The structures of the best models learned by EAST from the 3 synthetic datasets.
To evaluate the performance of an algorithm, we examine the structures of the models
that it learned from the 3 synthetic data sets, and compare them with the structures of
the generative models. On each synthetic data, we have run the algorithm 10 times using
3 different settings. This results in 30 models in total. Among those models, we only
pick the one with the highest quality and compare it with the generative model. We have
examined the other 29 models though and found their structures more or less the same
as that of the best model.
We start with EAST. Denote byME7 ,ME
12,ME18 the best models that EAST induced
from D7, D12, D18, respectively. Their structures are shown in Figure 6.4. We now
compare them with the generative models MG7 , MG
12, MG18. We first notice that EAST
perfectly recovered the structure of the generative model MG12. The structure of ME
12 is
exactly the same as that of MG12 . We also find that EAST almost perfectly recovered
the structures of the generative models MG7 and MG
12. There are only two differences
between ME7 and MG
7 : the latent variable Y1 in MG7 is missing from ME
7 , and X4 is
wrongly connected to latent variable Y2. The model ME18 is only different from MG
18 in
that it wrongly connects X5 to Y2 rather than Y6.
EAST also performed well in determining the cardinalities of the latent variables. By
comparing withMG7 ,MG
12,MG18, we can see that the latent variable cardinalities inME
7 ,
ME12,ME
18 are always correct or close to the true values.
We next examine the 3 best models learned by Pyramid. Denote them byMP7 ,MP
12,
MP18, respectively. Their structures are given in Figure 6.5. In general, Pyramid was
66
(a) MP7 (b)MP
12
(c) MP18
Figure 6.5: The structures of the best models learned by Pyramid from the 3 syntheticdata sets.
slightly inferior to EAST, but still did well in latent structure discovery. For D7, it
yielded the same model structure as EAST. See Figures 6.4a and 6.5a. For D12 and D18,
Pyramid only made a few more minor mistakes than EAST did. InMP12, Pyramid wrongly
connected X11 to Y5 rather than Y6. In MP18, Pyramid wrongly connected X11 to Y7 and
X14 to Y1, and missed the latent variable Y3, which leads to the incorrect connections
between Y2, Y6, Y7, and Y8. Due to those structural errors, Pyramid had to increase the
cardinalities of the latent variables in MP12 and MP
18 in order to achieve good model fit.
This leads to over-estimation of the cardinalities.
Finally, we examine the models produced by HCL. Their structures are shown in
Figure 6.6. The structures are all binary trees and very different from the structures of
the generative models MG7 , MG
12, MG18. In fact, we can hardly map any latent variables
in MH7 , MH
12, MH18 to the latent variables in MG
7 , MG12, MG
18. This partly explains why
the quality of the HCL models is significantly inferior to the quality of the EAST and
Pyramid models on synthetic data.
6.5 Summary
We have empirically compared EAST, HCL, and Pyramid. Among the three algorithms,
EAST yields the best models. It also outperforms the other two algorithms in terms
of latent structure discovery. In fact, on the 3 synthetic data we tested, EAST almost
67
(a) MH7 (b)MH
12
(c) MH18
Figure 6.6: The structures of the best models learned by HCL from the 3 synthetic datasets.
perfectly recovered the generative latent structures. However, EAST is computationally
expensive to use, which limits its applicability to only small scale problems. Pyramid is
more efficient than EAST. It produces models that are slightly worse than those produced
by EAST. It also makes a few more mistakes in discovering latent structures than EAST
does.
HCL is the most efficient among the three algorithms. It is faster than the other two by
orders of magnitude and can be used on large scale problems. However, the models induced
by HCL are significantly inferior to those induced by EAST and Pyramid. Moreover, HCL
always yields binary latent structures, which are very different from the ground truth and
not very meaningful.
In the next two chapters, we apply the three algorithms to two applications of different
natures. In the first application, the training data is usually large. Therefore, we use the
68
HCL algorithm in this case. In the second application, the training data is of moderate
size. Learning good models and discovering interesting latent structures are more critical.
Therefore, we apply EAST and Pyramid.
69
CHAPTER 7
APPLICATION 1: APPROXIMATEPROBABILISTIC INFERENCE
In the previous chapters, we have been focusing on developing algorithms for density
estimation using LTMs. Henceforth, we will turn to the applications of those techniques.
In this chapter, we study the problem of probabilistic inference in Bayesian networks and
propose an approximate method by utilizing the HCL algorithm. In the next chapter, we
will apply EAST and Pyramid to classification.
7.1 Probabilistic Inference in Bayesian Networks
We start by introducing the problem of probabilistic inference. Let N be a BN over a
set of nodes X. Denote by PN (X) the joint distribution that N represents. Given a set
of querying nodes Q and a piece of evidence E = e, the task of probabilistic inference is
to calculate the posterior distribution PN (Q|E = e). The set E can be empty. In this
case, there is no evidence observed and the quantity of interest reduces to the marginal
distribution PN (Q). For simplicity, we assume that Q contains only one single node Q.
However, the following discussions apply to the general case as well.
Probabilistic inference in general Bayesian networks is computationally intractable. As
shown by Cooper (1990), it is an NP-hard problem. As a matter of fact, all exact inference
algorithms, including clique tree propagation (Lauritzen & Spiegelhalter, 1988; Jensen
et al., 1990; Shenoy & Shafer, 1990), recursive conditioning (Darwiche, 2001), and variable
elimination (Zhang & Poole, 1994; Dechter, 1996), share an exponential complexity in the
network treewidth, which is defined to be one less than the size of the largest clique in
the optimal triangulation of the moral graph (Robertson & Seymour, 1984). Densely
connected Bayesian networks usually have large treewidths. To speed up inference in
such networks, researchers often resort to approximations. However, one cannot expect
a universally accurate yet efficient approximate method because approximate inference
with guaranteed accuracy is also NP-hard (Dagum & Luby, 1993).
Despite the negative complexity results, there are a special class of Bayesian net-
works amenable to fast inference, namely tree-structured Bayesian networks. This class
of networks have treewidth 1. Exact inference in them only takes time linear in the num-
ber of nodes (Pearl, 1988). In the following, we exploit this fact to develop an efficient
approximate inference method for general Bayesian networks.
70
7.2 Basic Idea
Our approximate method is based on LTMs. The idea is as follows:
• Offline: Construct an LTM M that approximates a BN N in the sense that the
joint distribution of the manifest variables in M approximately equals the joint
distribution of the variables in N .
• Online: Use M instead of N to compute answers to probabilistic queries.
Intuitively, our method can produce high quality approximations at a low online cost.
On one hand, LTMs are tree-structured. Hence, the online phase only takes time linear
in the number of nodes. On the other hand, due to the introduction of latent variables,
LTMs can capture complicated relationships among manifest variables. Therefore, they
can approximate complex BNs well and thus provide accurate answers to probabilistic
queries.
7.2.1 User Specified Bound on Inferential Complexity
The cardinalities of the latent variables play a crucial role in the approximation scheme.
They determine inferential complexity and influence approximation accuracy. At one
extreme, we can represent a BN exactly using an LTM by setting the cardinalities of the
latent variables large enough (Proposition 4.1). In this case, the inferential complexity
is very high. At the other extreme, we can set the cardinalities of the latent variables
at 1. In this case, the manifest variables become mutually independent. The inferential
complexity is the lowest and the approximation quality is the poorest.
In our approximate scheme, we seek an appropriate middle point between those two
extremes. In particular, we require the user to specify a bound Imax on the inferential
complexity. In the next section, we discuss how to construct an LTM approximation that
satisfies this bound.
7.3 Approximating Bayesian Networks with Latent
Tree Models
Given a BN N and a inferential complexity constraint Imax, we now study the problem
of approximating N with an LTM M. Let X be the set of variables in N and PN (X)
be the joint distribution represented by N . For an LTM M to be an approximation of
N , it should use X as its manifest variables. Moreover, an approximation M is of high
71
quality if PM(X) is close to PN (X). We measure the quality of the approximation by the
KL divergence (Cover & Thomas, 1991)
D[PN (X)‖PM(X)] =∑
X
PN (X) logPN (X)
PM(X).
Our objective is thus to find an LTM that minimizes the KL divergence
M⋆ = arg minM
D[PN (X)‖PM(X)]
subject to the complexity constraint Imax.
7.3.1 Two Computational Difficulties
The optimization problem is computationally intractable. There are two difficulties. The
first is that, given an LTMM, it is hard to compute the KL divergence D[PN (X)‖PM(X)]
due to the presence of latent variables in M. This can be seen by expanding the KL
divergence as follows,
D[PN (X)‖PM(X)] =∑
X
PN (X) logPN (X)
PM(X)
=∑
X
PN (X) logPN (X)−∑
X
PN (X) log PM(X)
=∑
X
PN (X) logPN (X)−∑
X
PN (X) log∑
Y
PM(X,Y).
The first term on the last line can be neglected because it is independent of M. The
difficulty lies in computing the second term. The summation over latent variables Y
appearing inside the logarithm makes this term indecomposable. Therefore, one has to
sum over all possible values of X in the outer summation. This takes time exponential in
|X|, i.e., the number of variables in X.
The second difficulty is on how to find a good LTM efficiently. Given a set of manifest
variables X, there are super-exponentially many LTMs over X. Exhaustive search over
such a large model space is apparently prohibitive. Even greedy search is infeasible given
the fact that evaluating the quality of one single LTM is already computationally hard.
7.3.2 Optimization via Density Estimation
Instead of directly tackling the optimization problem, we transform it into a density
estimation problem. The idea is as follows:
72
1. Sample a data set D with N i.i.d. cases from PN (X).
2. Learn an LTM M′⋆ from D that maximizes the AIC score and satisfies the com-
plexity constraint Imax.
It is well known that the KL divergence (and thus the approximation quality) of M′⋆
converges almost surely to that ofM⋆ as the sample size N approaches infinity (Akaike,
1974). In practice, we use large N to achieve good approximation.
We now discuss the implementation of this solution. We start by generating D from
PN (X). Since PN (X) is represented by BN N , we use forward sampling (Henrion, 1988)
for this task. Specifically, to generate a piece of sample from PN (X), we process the nodes
in a topological ordering1. When handling node X, we sample its value according to the
conditional distribution P (X|pa(X) = j), where pa(X) denotes the set of parents of X
and j denote their values that have been sampled earlier. To obtain D, we repeat the
procedure N times.
In the second step of this solution, we need to learn from D an LTM that has a high
AIC score and that satisfies the complexity constraint. We consider the three algorithms
developed in Chapters 3–5, namely EAST, HCL, and Pyramid, for this task. Recall that,
in order to achieve good approximation, we set the sample size N at a large value. Hence,
it would be computationally unaffordable to run EAST and Pyramid on D. So we choose
to run HCL to learn an LTM from D with inferential complexity constraint Imax.
7.3.3 Impact of Imax
Given the inferential complexity constraint Imax, HCL sets the cardinalities of latent
variables at Cmax as given by Equation 4.3. It is clear that Cmax a monotonically increasing
function of Imax. The larger the value of Imax, the larger the value of Cmax, and according
to the discussion in 4.2, the better the approximation that our method can achieve.
Therefore, the user can obtain more accurate approximation at the cost of longer online
running time.
7.4 LTM-based Approximate Inference
The focus of this chapter is approximate inference in Bayesian networks. We propose the
following two-phase method:
1A topological ordering sorts the nodes in a DAG such that a node always precedes its children.
73
1. Offline: Given a BN N and a bound Imax on the inferential complexity, sample a
data set with N samples from N and use HCL to learn an approximate LTM Mfrom the data. The sample size N should be set as large as possible.
2. Online: Make inference in M instead of N . More specifically, given a piece of
evidence E = e and a querying variable Q, return PM(Q|E = e) as an approximation
to PN (Q|E = e).
7.5 Empirical Results
In this section, we empirically evaluate our approximate inference method. We first ex-
amine the impact of sample size N and inferential complexity constraint Imax on the
performance of our method. Then we compare our method with clique tree propagation
(CTP), which is the state-of-the-art exact inference algorithm, and loopy belief propaga-
tion (LBP), which is a standard approximate inference method that has been successfully
used in many real world domains (Frey & MacKay, 1997; Murphy et al., 1999). We also
compare our method with another two approximate methods that perform exact inference
in approximate models, one based on Chow-Liu (CL) tree, and the other based on latent
class model (LCM).
7.5.1 Experimental Settings
We used 8 networks in our experiments. They are listed in Table 7.1. Cpcs54 is a subset
of the Cpcs network (Pradhan, Provan, Middleton, & Henrion, 1994). The other net-
works are available at http://www.cs.huji.ac.il/labs/compbio/Repository/. Table
7.1 also reports the characteristics of the networks, including the number of nodes, the
average/max indegree and cardinality of the nodes, and the inferential complexity (i.e.,
the sum of the clique sizes in the clique tree). The networks are sorted in ascending order
with respect to the inferential complexity.
For each network, we simulated 500 pieces of evidence. Each piece of evidence was set
on all the leaf nodes by sampling based on the joint probability distribution. Then we
used the CTP algorithm and the approximate inference methods to compute the posterior
distribution of each non-leaf node conditioned on each piece of evidence. The accuracy
of an approximate method is measured by the average KL divergence between the exact
and the approximate posterior distributions over all the query nodes and evidence.
All the algorithms in the experiments were implemented in Java and run on a machine
with an Intel Pentium IV 3.2GHz CPU and 1GB RAM.
74
NetworkNumber Average/Max Average/Max Inferentialof Nodes Indegree Cardinality Complexity
Alarm 37 1.24/4 2.84/4 1,038Win95pts 76 1.47/7 2/2 2,684Hailfinder 56 1.18/4 3.98/11 9,706Insurance 27 1.93/3 3.3/5 29,352Cpcs54 54 2/9 2/2 109,208Water 32 2.06/5 3.62/4 3,028,305Mildew 35 1.31/3 17.6/100 3,400,464Barley 48 1.75/4 8.77/67 17,140,796
Table 7.1: The networks used in the experiments and their characteristics.
7.5.2 Impact of N and Imax
We discussed the impact of N and Imax on the performance of our method in Sections
7.2 and 7.3.3. This subsection empirically verifies the claims.
Three sample sizes were chosen in the experiments: 1k, 10k, and 100k. For each
network, we also chose a set of Imax. LTMs were then learned using HCL with different
combination of the values of N and Imax. For parameter learning, we terminated EM
either when the improvement in loglikelihoods is smaller than 0.1, or when the algorithm
ran for two months. The pyramid scheme by Chickering and Heckerman (1997b) was used
to avoid local maxima. The number of starting points was set at 16.
The running time of HCL is plotted in Figure 7.1. The y-axes denote the time in hours,
while the x-axes denote the value of Cmax in HCL for different choices of Imax. Recall
that Cmax is a monotonically increasing function of Imax. Larger Cmax value implies larger
Imax value. The three curves correspond to different values of N . In general, the running
time increases with N and Cmax, ranging from seconds to weeks. For some settings, EM
failed to converge in two months. Those settings are indicated by arrows in the plots. We
emphasize that HCL is executed offline and its running time should not be confused with
the time for online inference, which will be reported next.
After obtaining the LTMs, we used clique tree propagation to make inference. The
approximation accuracy are shown in Figure 7.2. The y-axes denote the average KL
divergence, while the x-axes still denote the value of Cmax for HCL. The three curves in
each plot correspond to the three sample sizes we used.
We first examine the impact of sample size by comparing the corresponding curves in
each plot. We find that, in general, the curves for larger samples are located below those
for smaller ones. This shows that the approximation accuracy increases with the size of
75
1 2 4 810
−3
10−2
10−1
100
101
102
C
Tim
e (H
our)
N=1kN=10kN=100k
(a) Alarm
1 2 4 8 1610
−3
10−2
10−1
100
101
102
C
Tim
e (H
our)
(b) Win95pts
1 4 16 3210
−4
10−2
100
102
104
C
Tim
e (H
our)
(c) Hailfinder
1 4 16 3210
−3
10−1
101
103
C
Tim
e (H
our)
(d) Insurance
Figure 7.1: Running time of HCL under different settings. Settings for which EM did notconverge are indicated by arrows.
76
1 2 4 8 1610
−4
10−2
100
102
104
C
Tim
e (H
our)
(e) Cpcs54
1 4 16 6410
−4
10−2
100
102
104
C
Tim
e (H
our)
(f) Water
1 4 16 6410
−4
10−2
100
102
104
C
Tim
e (H
our)
(g) Mildew
1 4 16 6410
−4
10−2
100
102
104
C
Tim
e (H
our)
(h) Barley
Figure 7.1: Running time of HCL under different settings (continued). Settings for whichEM did not converge are indicated by arrows.
77
1 2 4 810
−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
N=1kN=10kN=100k
(a) Alarm
1 2 4 8 1610
−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(b) Win95pts
1 4 16 3210
−4
10−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(c) Hailfinder
1 4 16 3210
−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(d) Insurance
Figure 7.2: Approximation accuracy of the LTM-based method under different settings.
78
1 2 4 8 1610
−4
10−3
10−2
C
Ave
rage
KL
Div
erge
nce
(e) Cpcs54
1 4 16 6410
−4
10−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(f) Water
1 4 16 6410
−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(g) Mildew
1 4 16 6410
−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(h) Barley
Figure 7.2: Approximation accuracy of the LTM-based method under different settings(continued).
79
the training data.
To see the impact of Cmax, we examine each individual curve from left to right. Ac-
cording to our discussion, the curve is expected to drop monotonically as Cmax increases.
This is generally true for the results with sample size 100k. For sample sizes 1k and 10k,
however, there are cases in which the approximation becomes poorer as Cmax increases.
See Figure 7.2e and 7.2f. This phenomenon does not conflict with our claims. As Cmax
increases, the expressive power of the learned LTM increases. So it tends to overfit the
data. On the other hand, the empirical distribution of a small set of data may signifi-
cantly deviate from the joint distribution of the BN. This also suggests that the sample
size should be set as large as possible.
Finally, let us examine the impact of N and Cmax on the inferential complexity. Figure
7.3 plots the running time for calculating answers to all the queries using the learned
LTMs. It can be seen that the three curves for different sample sizes overlap in all
plots. This implies that the running time is independent of the sample size N . On the
other hand, all the curves are monotonically increasing. This confirms our claim that the
inferential complexity is positively dependent on Cmax.
In the following subsections, we will only consider the results for N = 100k. Under this
setting, our method achieves the highest accuracy. For clarity, we reproduce the average
KL divergence and the online running time of our method with N = 100k in Figures 7.4
and 7.5, respectively. See the blue curve in each plot.
7.5.3 Comparison with CTP
We now compare our method with CTP, the state-of-the-art exact inference algorithm.
The first concern is that, how accurate is our method. By examining Figure 7.4, we argue
that our method always achieves good approximation accuracy: For Hailfinder, Cpcs54,
Water, the average KL divergence of our method is around or less than 10−3; For the other
networks, the average KL divergence is around or less than 10−2.
We next compare the inferential efficiency of our method and the CTP algorithm. The
running time of CTP is denoted by dashed horizontal lines in the plots of Figure 7.5. It
can be seen that our method is more efficient than the CTP algorithm. In particular, for
the five networks with the highest inferential complexity, our method is faster than CTP
by two to three orders of magnitude.
To summarize, the results suggest that our method can achieve good approximation
accuracy at low computational cost.
80
1 2 4 810
−1
100
C
Tim
e (S
econ
d)
N=1kN=10kN=100k
(a) Alarm
1 2 4 8 1610
−1
100
101
C
Tim
e (S
econ
d)
(b) Win95pts
1 4 16 3210
−1
100
101
C
Tim
e (S
econ
d)
(c) Hailfinder
1 4 16 3210
−2
10−1
100
101
C
Tim
e (S
econ
d)
(d) Insurance
Figure 7.3: Running time of the online phase of the LTM-based method under differentsettings.
81
1 2 4 8 1610
−1
100
101
C
Tim
e (S
econ
d)
(e) Cpcs54
1 4 16 6410
−2
10−1
100
101
102
C
Tim
e (S
econ
d)
(f) Water
1 4 16 6410
−1
100
101
102
C
Tim
e (S
econ
d)
(g) Mildew
1 4 16 6410
−1
100
101
102
C
Tim
e (S
econ
d)
(h) Barley
Figure 7.3: Running time of the online phase of the LTM-based method under differentsettings (continued).
82
1 2 4 810
−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
LTMLBPCLLCM
(a) Alarm
1 2 4 8 1610
−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(b) Win95pts
1 4 16 3210
−4
10−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(c) Hailfinder
1 4 16 3210
−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(d) Insurance
Figure 7.4: Approximation accuracy of various inference methods.
83
1 2 4 8 1610
−4
10−3
10−2
C
Ave
rage
KL
Div
erge
nce
(e) Cpcs54
1 4 16 6410
−4
10−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(f) Water
1 4 16 6410
−3
10−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(g) Mildew
1 4 16 6410
−2
10−1
100
C
Ave
rage
KL
Div
erge
nce
(h) Barley
Figure 7.4: Approximation accuracy of various inference methods (continued).
84
1 2 4 8
10−1
100
101
C
Tim
e (S
econ
d)
LTMCTPLBPCLLCM
(a) Alarm
1 2 4 8 1610
−1
100
101
102
C
Tim
e (S
econ
d)
(b) Win95pts
1 4 16 3210
−1
100
101
102
C
Tim
e (S
econ
d)
(c) Hailfinder
1 4 16 3210
−2
10−1
100
101
102
103
C
Tim
e (S
econ
d)
(d) Insurance
Figure 7.5: Running time of various inference methods.
85
1 2 4 8 16
10−1
100
101
102
103
C
Tim
e (S
econ
d)
(e) Cpcs54
1 4 16 6410
−2
100
102
104
C
Tim
e (S
econ
d)
(f) Water
1 4 16 6410
−1
101
103
105
C
Tim
e (S
econ
d)
(g) Mildew
1 4 16 6410
−2
100
102
104
106
C
Tim
e (S
econ
d)
(h) Barley
Figure 7.5: Running time of various inference methods (continued).
86
7.5.4 Comparison with LBP
We now compare our method with LBP. The latter is an iterative algorithm. It can be used
as an anytime inference method by running a specific number of iterations. In our first set
of experiments, we let LBP run as long as our method and compare their approximation
accuracy. We did this for each network and each value of Cmax. The accuracy of LBP are
denoted by the curves labeled as LBP in Figure 7.4. By comparing those curves with the
LTM curves for N = 100k, we see that our method achieves significantly higher accuracy
than LBP in most cases: For Water, the difference in average KL divergence is up to
three orders of magnitude; For the other networks, the difference is up to one order of
magnitude. For Hailfinder with Cmax = 32, LBP is two times more accurate than our
method. However, our method also achieves good approximation accuracy in this case.
The average KL divergence is smaller than 10−3. Finally, we noticed that LBP curves
are horizontal lines for Cpcs54, Mildew, and Barley. Further investigation on those cases
shows that LBP finished only one iteration in the given time period.
We next examine how much time it takes for LBP to achieve the same level of accuracy
as our method. For each piece of evidence, we ran LBP until its average KL divergence
is comparable with that of our method or the number of iterations exceeds 100. The
running time of LBP are denoted by the curves labeled as LBP in Figure 7.5. Comparing
those curves with the LTM curves, we found that LBP takes much more time than our
method: For Mildew, LBP is slower than our method by three orders of magnitude; For
the other networks except Hailfinder, LBP is slower by one to two orders of magnitude;
For Hailfinder with Cmax = 32, the running time of the two methods are similar. The
results show that our method compares more favorably to LBP in the networks that we
examined.
7.5.5 Comparison with CL-based Method
Our inference method is fast because LTM is tree-structured. One can also construct a
Chow-Liu tree (Chow & Liu, 1968) to approximate the original BN and use it for inference.
We refer to this approach as the CL-based method. In this subsection, we empirically
compare our method with the CL-based method. More specifically, for each network, we
learn a tree model from the 100k samples using the maximum spanning tree algorithm
developed by Chow and Liu (1968). We then use the learned tree model to answer the
queries.
The approximation accuracy of the CL-based method are shown as solid horizontal
lines in the plots in Figure 7.4. Comparing with the CL-based method, our method
achieves higher accuracy in all the networks except for Mildew. For Insurance, Water,
87
and Barley, the differences are significant. For Mildew, our method is competitive with
the CL-based method. In the meantime, we notice that the CL-based method achieves
good approximations in all the networks except for Barley. The average KL divergence
is around or less than 10−2.
An obvious advantage of CL-based method is its high efficiency. This can be seen
from the plots in Figure 7.5. In most of the plots, the CL line locates below the second
data point on the LTM curve. The exception is Mildew, for which the running time of
the CL-based method is as long as our method with Cmax = 16.
In summary, the results suggest that the CL-based method is a good choice for ap-
proximate inference if the online inference time is very limited. Otherwise, our method
is more attractive because it is able to produce more accurate results when more time is
allowed.
7.5.6 Comparison with LCM-based Method
Lowd and Domingos (2005) have previously investigated the use of LCM for density
estimation. Given a data set, they determine the cardinality of the latent variable using
hold-out validation, and optimize the parameters using EM. It is shown that the learned
LCM achieves good model fit on a separate testing set. The LCM was also used to answer
simulated probabilistic queries and the results turn out to be good.
Inspired by their work, we also learned a set of LCMs from the 100k samples and
compared them with LTMs on the approximate inference task. Our learning strategy
is slightly different. Since LCM is a special case of LTM, its inferential complexity can
also be controlled by changing the cardinality of the latent variable. In our experiments,
we set the cardinality such that the sum of the clique sizes in the clique tree of the
LCM is roughly the same as that for the LTM learned with a chosen Cmax. In this way,
the inferential complexity of the two models are comparable. This can be verified by
examining the LCM curves in Figure 7.5. We then optimize the parameters of the LCM
using EM with the same setting as in the case of LTM.
As shown in Figure 7.4, for Alarm, Win95pts, Cpcs54, Water, and Barley, the LCM
curves are located above the LTM curves. That is, our method consistently outperforms
the LCM-based method for all Cmax. For Hailfinder and Mildew, our method is worse
than the LCM-based method when Cmax is small. But when Cmax becomes large, our
method begins to win. For Insurance, the performance of the two methods are very
close. The results suggest that unrestricted LTM is more suitable for approximation
inference than LCM does.
88
7.6 Related Work
The idea of approximating complex BNs by simple models and using the latter to make
inference has been investigated previously. The existing work mainly falls into two cat-
egories. The work in the first category approximates the joint distributions of the BNs
and uses the approximation to answer all probabilistic queries. In contrast, the work
in the second category is query-specific. It assumes the evidence is known and directly
approximates the posterior distribution of the querying nodes.
Our method falls in the first category. We investigate the use of LTMs under this
framework. This possibility has also been studied by Pearl (1988) and Sarkar (1995).
Pearl (1988) develops an algorithm for constructing an LTM that is marginally equivalent
to a joint distribution P (X), assuming such an LTM exists. Sarkar (1995) studies how to
build good LTMs when only approximations are amenable. Their methods, however, can
only deal with the cases of binary variables.
Researchers have also explored the use of other models. Chow and Liu (1968) consider
tree-structured BNs without latent variables. They develop a maximum spanning tree
algorithm to efficiently construct the tree model that is closest to the original BN in
terms of KL divergence. Lowd and Domingos (2005) learn an LCM to summarize a data
set. The cardinality of the latent variable is determined so that the logscore on a hold-out
set is maximized. They show that the learned model achieves good model fit on a separate
testing set, and can provide accurate answers to simulated probabilistic queries. In both
work, the approximation quality and the inferential complexity of the learned model are
fixed. Our method, on the other hand, provides a parameter Imax to let users make the
tradeoff between approximation quality and inferential complexity.
Our method and the methods mentioned in above build approximate models from
scratch. Alternatively, one can start with the given Bayesian network and simplify it
to obtain an approximation. This idea is realized by van Engelen (1997). The author
proposes to simplify a Bayesian network by removing a set of edges from it. The selection
of edges is made so as to achieve a tradeoff between the computational efficiency and the
accuracy of the resultant approximate network.
Rather than simplifying a Bayesian network itself, researchers have also considered
to simplify its clique tree when clique tree propagation is used for inference. Jensen and
Andersen (1990) propose to reduce clique sizes by annihilating configurations with small
probabilities from potential functions. Kjærulff (1994) develops a complementary method
that removes weak dependencies among variables within cliques. Removal of dependencies
causes cliques to split into smaller ones and thus reduces the computational cost of running
propagation on the resultant clique tree.
89
The work in the second category is mainly carried out under the variational framework.
The mean field method (Saul, Jaakkola, & Jordan, 1996) assumes that the querying
nodes are mutually independent. It constructs an independent model that is close to
the posterior distribution. As an improvement to the mean field method, the structured
mean field method (Saul & Jordan, 1996) preserves a tractable substructure among the
querying nodes, rather than neglecting all interactions. Bishop et al. (1997) consider
another improvement, i.e., mixtures of mean field distributions. It essentially fits an
LCM to the posterior distribution.
As a different variational method, Choi and Darwiche (2006) simplify the given Bayesian
network to obtain an approximation of the posterior distribution. The idea is to remove
a set of edges from the original network, and optimize the parameters of the simplified
network such that the divergence between the approximate and the true posterior distri-
butions is minimized. The more edges deleted, the faster the inference, and the worse
the approximation accuracy. At one extreme, when enough edges are removed to yield a
polytree, the proposed method reduces to LBP.
All those methods from the second category directly approximate posterior distribu-
tions. Therefore, they might be more accurate than our method when used to make
inference. However, these methods are evidence-specific and construct approximations
online. Moreover, they involve an iterative process for optimizing the variational pa-
rameters. Consequently, the online running time is unpredictable. With our method, in
contrast, one can determine the inferential complexity beforehand.
7.7 Summary
We have shown one application of the HCL algorithm described in Chapter 4 and use it
to develop a novel scheme for BN approximate inference. With our scheme one can make
tradeoff between the approximation accuracy and the inferential complexity. Our scheme
achieves good accuracy at low costs in all the networks that we examined. In particular,
it outperforms LBP. Given the same amount of time, our method achieves significantly
higher accuracy than LBP in most cases. To achieve the same accuracy, LBP needs one
to three orders of magnitude more time than our method. We also show that LTMs are
superior to Chow-Liu trees and LCMs when used for approximate inference.
90
CHAPTER 8
APPLICATION 2: CLASSIFICATION
In this chapter, we describe an application of LTMs in classification. The idea is to
estimate the class-conditional distributions of each class using LTMs and then use the
Bayes rule for classification. The resulting classifier are called latent tree classifiers (LTCs).
Both EAST and Pyramid are considered for the density estimation problem. Empirical
results are provided to compare EAST and Pyramid in this setting and to compare LTC
with a number of related alternative methods.
8.1 Background
Classification is one of the subjects that have received the most attention in the machine
learning literature. It has numerous applications in many areas such as computer vision,
speech recognition, text categorization, among the others. Given training data D =
{(x1, c1), (x2, c2), . . . , (xn, cn)} where each instance is described by a set X of attributes
and has a class label c, the problem is to build a classifier f : X→ C that can accurately
predict the class labels of future instances based on their attribute values.
Assume there is a generative distribution P (X, C) underlying the data. Given a
classifier f , we measure its classification accuracy using the probability of error :
err(f) =∑
X
P (X)(
1− P (f(X)|X))
.
The lower the error, the better the classifier.
Bayes decision theory (Duda & Hart, 1973) states that the minimum achievable prob-
ability of error is
minf
err(f) =∑
X
P (X)(
1−maxC
P (C|X))
.
This quantity is well known as the Bayes error rate. It is realized by the optimal classifier
f ⋆(X), where
f ⋆(X) = arg maxC
P (C|X). (8.1)
91
8.2 Build Classifiers via Density Estimation
If we knew the posterior class distribution P (C|X), we could easily obtain the optimal
classifier f ⋆(X) according to Equation 8.1. However, in practice, we only have a data
set D that were drawn from P (X, C). A common approach in machine learning is to
construct an estimate P (C|X) of P (C|X) from D, and then build a classifier based on
P (C|X). In general, the more accurate the estimate, the better the resulting classifier.
8.2.1 The Generative Approach to Classification
There are two different ways to estimate P (C|X). The generative approach first constructs
an estimate P (X, C) of P (X, C), and then computes the posterior distribution P (C|X)
based on P (X, C) using Bayes rule. In contrast, the discriminative approach directly
estimates P (C|X) from data.
The generative and discriminative approaches each has its own advantages and draw-
backs. On one hand, the generative approach requires an intermediate step to model the
joint distribution P (X, C) over both X and C. However, what we need for classification
is only the conditional distribution P (C|X). In this sense, the discriminative approach is
more direct than the generative approach. It usually yields more accurate classifiers than
the latter. On the other hand, modeling the joint distribution enables the generative ap-
proach to handle missing values in a principled manner and reveal interesting structures
underlying the data. Learning is also computationally more efficient in the generative
approach than in the discriminative approach. For detailed comparisons between the
generative and discriminative approaches, we refer the readers to (Rubinstein & Hastie,
1997; Ng & Jordan, 2001; Jebara, 2004).
8.2.2 Generative Classifiers Based on Latent Tree Models
In this chapter, we focus on the generative approach. Different generative methods make
different assumptions about the form of the true distribution P (X, C). The simplest one
is naıve Bayes (NB) classifier (Duda & Hart, 1973). It assumes that all the attributes in
a data set are mutually independent given the class label. Under this assumption, the
generative distribution decomposes as follows:
P (X, C) = P (C)∏
X∈X
P (X|C).
All dependencies among attributes are ignored. An example NB is shown in Figure 8.1a.
Despite its simplicity, NB has been shown to be surprisingly accurate in a number of
domains (Domingos & Pazzani, 1997).
92
(a) NB (b) TAN
(c) LTC
Figure 8.1: NB, TAN, and LTC. C is the class variable, X1, X2, X3, and X4 are fourattributes, Y1 and Y2 are latent variables.
The conditional independence assumption underlying NB is rarely true in practice.
Violating this assumption could lead to poor approximation to P (X, C) and thus er-
roneous classification. The past decade has seen a large body of work on relaxing this
assumption. One such work is tree augmented naıve Bayes (TAN) (Friedman et al., 1997).
It builds a Chow-Liu tree (Chow & Liu, 1968) to model the attribute dependencies (Fig-
ure 8.1b). Another work is averaged one-dependence estimators (AODE) (Webb et al.,
2005). It constructs a set of tree models over the attributes and averages them to make
classification.
We propose a novel generative approach based on LTMs. In our approach, we treat
attributes as manifest variables and build LTMs to model the relationships among them.
The relationships could be different for different classes. Therefore, we build one LTM
for each class. The LTM for class c is an estimate of the class-conditional distribution
P (X|C = c). We refer to the collection of LTMs plus the prior class distribution as latent
tree classifier (LTC). Figure 8.1c shows an example LTC. Each rectangle in the figure
contains the LTM for a class. Since the LTMs can model complex relationship among
attributes, we expect LTC to approximate the true distribution P (X, C) well and thus to
achieve good classification accuracy. Moreover, the latent structure induced for each class
may reveal underlying generative mechanism. We empirically verify those hypotheses in
93
the experiments.
As mentioned in Chapter 1, researchers have considered a special class of LTMs called
latent class models (LCMs) for density estimation and have obtained some promising
results (Lowd & Domingos, 2005). One can also build classifiers using LCMs instead of
LTMs. By restricting to LCMs, we obtain latent class classifier (LCC). In our experi-
ments, we empirically compare LTC with LCC. The results show that LTC is superior to
LCC. We attribute this to the flexibility of LTMs.
In the following two sections, we formally define latent tree classifier and present a
learning algorithm.
8.3 Latent Tree Classifier
We consider the classification problem where each instance is described using n attributes
X = {X1, X2, . . . , Xn}, and belongs to one of the r classes C = 1, 2, . . . , r. A latent
tree classifier (LTC) consists of a prior distribution P (C) on C and a collection of r
LTMs over the attributes X. We denote the c-th LTM by Mc = (mc, θc) and the set of
latent variables in Mc by Yc. The LTC represents a joint distribution over C and X,
∀c = 1, 2, . . . , r,
P (C = c,X) = P (C = c)PMc(X)
= P (C = c)∑
Yc
PMc(X,Yc). (8.2)
Given an LTC, we classify an instance X = x to the class c⋆, where
c⋆ = arg maxC
P (C|X = x)
= arg maxC
P (C,X = x). (8.3)
Making prediction with LTC requires marginalizing out all the latent variables from
each LTM. Since LTM is tree-structured, the marginalization can be done in time linear
in the number of latent variables. Moreover, regular LTMs contain strictly less latent
variables than manifest variables. Therefore, the time complexity of making prediction
with LTC is O(|X| · r).
8.4 A Learning Algorithm for Latent Tree Classifier
Given a labeled training set D, we now outline an algorithm for learning an LTC from D.
The algorithm consists of four steps:
94
1. Calculate the maximum likelihood estimate (MLE) P (C) of P (C) from D.
2. Split data D according to class label into r subsets, one for each class. Denote the
subset for class c by Dc.
3. For each class c = 1, 2, . . . , r, learn an LTM Mc from Dc to estimate the class-
conditional density P (X|C = c).
4. Smooth the obtained estimates P (C) and the parameters of each LTMMc.
In the first step, we calculate the MLE P (C). This can be easily done by counting
the number of instances belonging to each class in D. More specifically, we calculate
P (C = c) =|Dc||D| .
In the third step, we learn an LTM Mc for each class c. One can use either EAST
or Pyramid for this task. This results in two variants of the learning algorithm. We
refer to them as LTC-E and LTC-P, respectively. We will evaluate both variants in the
experiments.
In the last step, we smooth the parameters of the obtained LTC. This is detailed in
the next subsection.
8.4.1 Parameter Smoothing
We estimate the prior P (C) and the LTM parameters θc using maximum likelihood es-
timation. As noticed in previous work (e.g., Friedman et al., 1997), when the size of the
training data is small, this could lead to unreliable estimation and thus deficient classifi-
cation accuracy. One common way to address this issue is to smooth the parameters using
Laplace correction (Niblett, 1987). Let N be the total number of instances in training
data and Nc be the number of instance labeled as class c. Let α be a predefined smoothing
factor. We smooth the class prior distribution as follows:
P (C = c) =Nc + α
N + rα.
We also smooth the parameters for each LTM Mc. Let θcijk = P(
Zi = j|π(Zi) = k, mc
)
be the parameter estimation produced by EM. We calculate the smoothed parameter θscijk
for all i, j, k as
θscijk =
Ncθcijk + α
Nc + |Zi|α.
In the Section 8.5.3, we will empirically show that the smoothing technique leads to
significant improvement in classification accuracy.
95
8.5 Empirical Evaluation
In this section, we empirically evaluate LTC on an array of data. We first demonstrate the
necessity of parameter smoothing. We then contrast LTC-E with LTC-P, and compare
the latter with several mainstream classification algorithms including NB, TAN, AODE,
and C4.5 (Quinlan, 1993). We also include a restricted version of LTC, namely LCC, in
the comparison. Finally, we use an example to show that one can reveal meaningful latent
structures with LTC.
8.5.1 Data Sets
To get data for our experiments, we started with all the 47 data sets that are used by
Friedman et al. (1997) and recommended by Weka (Witten & Frank, 2005). Most of the
data sets are from the UCI machine learning repository (Asuncion & Newman, 2007).
We preprocessed the data as follows. The learning algorithms of TAN and AODE
proposed by Friedman et al. (1997) and Webb et al. (2005), as well as Pyramid, do
not handle missing values. Thus, we removed incomplete instances from the data sets.
Among the 47 data sets, there are 10 data sets in which every instance contains missing
values. Those data sets were excluded from our experiments. TAN, AODE, LTC, and
LCC require discrete attributes. Therefore, we discretized the remaining data sets using
the supervised discretization method proposed by Fayyad and Irani (1993).
After preprocessing, we got 37 data sets. The data are from various domains such
as medical diagnosis, handwriting recognition, Biology, Chemistry, etc. The number of
attributes ranges from 4 to 61; the number of classes ranges from 2 to 26; and the sample
size ranges from 80 to 20,000. Table 8.1 summarizes the characteristics of the data.
8.5.2 Experimental Settings
We implemented LTC-E, LTC-P, NB, TAN, AODE, and LCC in Java. For C4.5, we used
the J48 implementation in WEKA. The detailed settings are as follows.
• LTC: For EAST and Pyramid, we used the mild setting as described in Section 6.3.
For parameter smoothing, we set the smoothing factor α = 1.
• NB: We smoothed the parameters in the same way as for LTC. This strategy is also
recommended by WEKA.
• TAN: We followed the parameter smoothing strategy suggested by Friedman et al.
(1997). In particular, we set the smoothing factor at 5.
96
Name # Attributes # Classes Sample Size
Anneal 38 6 898Australian 14 2 690
Autos 25 7 159Balance-scale 4 3 625Breast-cancer 9 2 277
Breast-w 9 2 683Corral 6 2 128Credit-a 15 2 653Credit-g 20 2 1,000Diabetes 8 2 768Flare 10 2 1,066Glass 9 7 214Glass2 9 2 163Heart-c 13 5 296
Heart-statlog 13 2 270Hepatitis 19 2 80Ionosphere 34 2 351
Iris 4 3 150Kr-vs-kp 36 2 3,196Letter 16 26 20,000Lymph 18 4 148
Mofn-3-7-10 10 2 1,324Mushroom 22 2 5,644Pima 8 2 768
Primary-tumor 17 22 132Satimage 36 6 6,435Segment 19 7 2,310
Shuttle-small 9 7 5,800Sonar 60 2 208Soybean 35 19 562Splice 61 3 3,190Vehicle 18 4 846Vote 16 2 232Vowel 13 11 990
Waveform-21 21 3 5,000Waveform-5000 40 3 5,000
Zoo 17 7 101
Table 8.1: The 37 data sets used in the experiments.
97
• AODE: As suggested by Webb et al. (2005), we set the frequency limit on super
parent at 30. We also smoothed the parameters in the same way as for LTC.
• C4.5: We used the default setting suggested by WEKA.
• LCC: Similar to LTC, we partitioned the data by class and learned a latent class
model for each class. Starting by the latent class model with the cardinality of the
latent variable equal to 2, we greedily increased the cardinality until the AIC score
of the model ceased to increase. The EM was tuned at the same setting as EAST
and Pyramid for learning LTC. The parameters were smoothed in the same way as
for LTC.
Given a data set, we estimated the classification accuracy of an algorithm using strat-
ified 10-fold cross validation (Kohavi, 1995). All the algorithms were run on the same
training/test splits.
All the classifiers were trained on a server with two dual core AMD Opteron 2.4GHz
CPUs and tested on a machine with an Intel Pentium IV 3.4GHz CPU.
8.5.3 Effect of Parameter Smoothing
We start by investigating the effect of parameter smoothing. In the experiments, we first
ran LTC-E and LTC-P on all the data sets. We then turned off the parameter smoothing
module and re-ran the two algorithms. In the following, we compare the performance of
the smoothed and unsmoothed versions of each algorithm.
Table 8.2 shows the mean and the standard deviation of the classification accuracy of
LTC-E and LTC-P with/without parameter smoothing. We first look at the two columns
for LTC-E. For each data set, the entry with the higher accuracy is highlighted in boldface.
For each variant of LTC-E, Table 8.2 also reports the average accuracy over all the data
sets and the number of wins, i.e., the number of data sets on which it achieved higher
accuracy. Note that the sum of the number of wins of the two variants is larger than 37,
the total number of data sets. This is because the two variants achieved the same level
of accuracy on 3 data sets.
By examining Table 8.2, we can see that parameter smoothing leads to better perfor-
mance in general. The smoothed version achieved higher overall classification accuracy
and won on 35 of the 37 data sets.
To compare the two variants of LTC-E, we also conducted one-tailed paired t-test with
p = 0.05. As shown in the last row of Table 8.2, the smoothed version significantly won
the unsmoothed version on 16 data sets. Those data sets are indicated by small circles
98
Data SetLTC-E LTC-P
Smoothed Non-smoothed Smoothed Non-smoothed
Anneal 98.44±1.41 ◦ 96.66±1.57 98.66±1.02 97.77±1.17Australian 85.51±2.90 ◦ 84.20±3.51 85.07±3.21 84.20±4.07
Autos 84.83±10.89 ◦ 67.96±7.33 82.29±10.86 ◦ 68.54±9.79Balance-scale 70.88±4.09 70.88±4.09 71.03±3.85 71.03±3.85Breast-cancer 74.01±7.33 ◦ 70.77±5.15 74.37±5.71 ◦ 70.04±5.04
Breast-w 97.37±1.50 96.78±1.66 97.37±2.04 ◦ 96.93±2.52Corral 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00
Credit-a 86.53±4.09 ◦ 85.01±4.47 85.61±3.67 ◦ 84.23±3.77Credit-g 73.30±4.57 72.80±4.92 73.00±4.92 ◦ 72.30±4.85Diabetes 77.48±3.82 77.35±3.83 76.96±3.05 76.83±2.97Flare 82.65±2.58 82.37±2.39 83.49±1.79 83.40±1.68Glass 75.69±6.96 74.72±6.93 75.69±8.28 74.24±7.67Glass2 85.18±9.44 84.60±9.82 84.56±9.48 83.97±9.81Heart-c 82.79±4.84 81.41±4.33 82.45±5.71 81.08±5.84
Heart-statlog 82.59±9.57 82.22±9.85 83.33±8.05 83.33±8.05Hepatitis 91.25±11.86 88.75±9.22 90.00±12.91 91.25±8.44Ionosphere 93.17±2.74 92.31±2.34 93.45±2.69 92.32±2.66
Iris 96.00±4.66 95.33±4.50 94.67±5.26 94.67±4.22Kr-vs-kp 96.90±1.22 97.00±1.19 95.49±1.24 95.62±1.17Letter 92.30±0.44 ◦ 91.10±0.62 91.80±0.58 ◦ 91.04±0.83Lymph 85.76±6.78 81.24±11.63 87.19±4.88 ◦ 77.71±10.93
Mofn-3-7-10 93.73±2.00 93.80±2.15 92.60±1.29 92.60±1.29Mushroom 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00
Pima 76.57±4.32 76.44±4.30 77.09±4.06 77.09±3.72Primary-tumor 40.99±11.09 ◦ 28.85±12.10 45.55±13.48 ◦ 30.44±11.42
Satimage 89.65±1.69 ◦ 87.74±1.42 89.87±1.07 ◦ 88.31±1.16Segment 95.58±1.55 ◦ 93.94±1.40 96.06±1.28 ◦ 94.50±1.34
Shuttle-small 99.88±0.12 ◦ 99.43±0.26 99.86±0.14 ◦ 99.47±0.19Sonar 85.64±8.34 84.67±7.63 83.74±8.68 82.76±8.98Soybean 93.42±2.50 ◦ 74.04±4.17 93.78±2.82 ◦ 76.87±4.09Splice 94.67±1.33 94.17±2.21 92.29±1.42 ◦ 90.85±1.73Vehicle 73.63±4.53 ◦ 71.62±5.51 75.17±5.26 ◦ 72.21±6.60Vote 95.86±3.22 94.25±3.27 94.94±3.57 93.57±3.02Vowel 79.60±3.15 ◦ 78.28±2.79 79.80±4.74 80.00±3.22
Waveform-21 85.90±1.71 ◦ 85.72±1.67 85.96±1.61 ◦ 85.76±1.70Waveform-5000 86.16±1.51 ◦ 85.92±1.43 86.06±1.47 85.98±1.52
Zoo 94.09±5.09 ◦ 82.18±6.29 94.09±5.09 ◦ 82.18±6.29
Mean 86.43±4.16 83.91±4.21 86.31±4.19 83.87±4.21# Wins 35 5 34 10
# Sig. Wins 16 0 16 0
Table 8.2: The classification accuracy of LTC-E and LTC-P with/without parametersmoothing. Boldface numbers denote higher accuracy. Small circles indicate significantwins.
99
in the table. Moreover, the smoothed version never significantly losed to the unsmoothed
version.
The conclusions for LTC-E carry on to LTC-P. With parameter smoothing turned on,
LTC-P achieved higher overall accuracy and won on 34 out of the 37 data sets. The
smoother version of LTC-P also significantly outperformed the unsmoothed version (16
wins/0 lose).
In summary, the smoothing technique often leads to improvement in classification
accuracy and never significantly degrades the accuracy. In the following experiments, we
conduct parameter smoothing by default.
8.5.4 LTC-E versus LTC-P
We next compare LTC-E with LTC-P. We consider their performance on two aspects,
namely classification accuracy and computational efficacy.
We first compare the classification accuracy. For clarity, we reproduce the accuracy
of LTC-E and LTC-P in Table 8.3. Again, boldface numbers denote higher accuracy,
while small circles indicate significant wins according to t-test with p = 0.05. From Table
8.3, we can see that LTC-E performs slightly better than LTC-P. In particular, LTC-E
achieved slightly higher overall classification accuracy. It won LTC-P on 21 data sets but
losed on 20 data sets. However, the difference between the two algorithms on most of the
data sets was statistically insignificant. See the last line in Table 8.3 for the number of
significant wins.
In terms of computational efficiency, LTC-P compares more favorably to LTC-E. Fig-
ure 8.2 plots the training time of LTC-E and LTC-P on the 37 data sets. Note that the
data sets are sorted in an ascending order with respect to the time that LTC-E spent on
them. From this figure, we can see that LTC-P ran consistently faster than LTC-E. On
10 of the 37 data sets, LTC-P was faster than LTC-E by at least an order of magnitude.
The largest difference occurred on Kr-vs-kp where LTC-P was more than 30 times faster
than LTC-E.
Taking both sides into consideration, we argue that LTC-P achieves a better trade-
off between classification performance and computational efficiency. Therefore, we only
consider this algorithm in the following text. We will refer to it as LTC for short.
8.5.5 Comparison with the Other Algorithms
We now compare LTC with NB, TAN, AODE, C4.5, and LCC. The classification accuracy
of those algorithms are shown in Table 8.4. From this table, we can see that LTC achieved
100
Data Set LTC-E LTC-P
Anneal 98.44±1.41 98.66±1.02Australian 85.51±2.90 85.07±3.21
Autos 84.83±10.89 82.29±10.86Balance-scale 70.88±4.09 71.03±3.85Breast-cancer 74.01±7.33 74.37±5.71
Breast-w 97.37±1.50 97.37±2.04Corral 100.00±0.00 100.00±0.00Credit-a 86.53±4.09 85.61±3.67Credit-g 73.30±4.57 73.00±4.92Diabetes 77.48±3.82 76.96±3.05Flare 82.65±2.58 83.49±1.79Glass 75.69±6.96 75.69±8.28Glass2 85.18±9.44 84.56±9.48Heart-c 82.79±4.84 82.45±5.71
Heart-statlog 82.59±9.57 83.33±8.05Hepatitis 91.25±11.86 90.00±12.91Ionosphere 93.17±2.74 93.45±2.69
Iris 96.00±4.66 94.67±5.26Kr-vs-kp 96.90±1.22 ◦ 95.49±1.24Letter 92.30±0.44 ◦ 91.80±0.58Lymph 85.76±6.78 87.19±4.88
Mofn-3-7-10 93.73±2.00 ◦ 92.60±1.29Mushroom 100.00±0.00 100.00±0.00Pima 76.57±4.32 77.09±4.06
Primary-tumor 40.99±11.09 45.55±13.48 ◦Satimage 89.65±1.69 89.87±1.07Segment 95.58±1.55 96.06±1.28 ◦
Shuttle-small 99.88±0.12 99.86±0.14Sonar 85.64±8.34 83.74±8.68
Soybean 93.42±2.50 93.78±2.82Splice 94.67±1.33 ◦ 92.29±1.42Vehicle 73.63±4.53 75.17±5.26Vote 95.86±3.22 94.94±3.57Vowel 79.60±3.15 79.80±4.74
Waveform-21 85.90±1.71 85.96±1.61Waveform-5000 86.16±1.51 86.06±1.47
Zoo 94.09±5.09 94.09±5.09
Mean 86.43±4.16 86.31±4.19# Wins 21 20
# Sig. Wins 4 2
Table 8.3: Comparison of classification accuracy between LTC-E and LTC-P.
101
0 10 20 30 4010
0
102
104
106
Data Set
Tim
e (S
econ
d)
LTC−ELTC−P
Figure 8.2: The training time of LTC-E and LTC-P.
the best overall accuracy, followed by AODE, TAN, LCC, C4.5, and NB, in that order. In
terms of the number of wins, LTC was also the best (13 wins), with AODE (10 wins) and
TAN (7 wins) being the two runners-up. If we consider only generative approaches and
ignore C4.5, the difference is even larger. LTC achieves 3 additional wins on Kr-vs-kp,
Mofn-3-7-10, and Vote, i.e., 16 wins in total. The other algorithms remain unaffected.
We also conducted one-tailed paired t-test to compare LTC with each of the other
algorithms. Again, we set the p-value at 0.05. The number of significant wins, ties, and
loses is given in Table 8.5. It shows that LTC significantly outperformed NB (17 wins/3
loses) and C4.5 (12/3). LTC was also better than TAN (8/4), AODE (7/5), and LCC
(7/2).
Besides accuracy, classification efficiency is also a concern. As mentioned in Section
8.3, theoretically, making prediction with LTC takes time O(|X| · r), where |X| is the
number of attributes and r is the number of classes. Figure 8.3 reports the average time
that it takes for LTC to classify an instance in each data set used in our experiments. The
data sets are sorted in ascending order with respect to the running time. The running
time ranges from 1 ms to 38 ms. We argue that LTC is efficient enough for practical use.
Figure 8.3 also shows the running time of TAN, AODE, and LCC. LTC was consistently
slower than TAN. Depending on the number of attributes and the cardinalities of latent
variables, LTC was slower than AODE on some data sets and faster on the others. In
most cases, the difference was small. We also see that LTC was almost as efficient as
LCC.
102
Domain LTC NB TAN AODE C4.5 LCC
Anneal 98.66±1.02 96.10±2.54 98.22±1.68 98.44±1.41 98.77±0.98 98.78±1.43Australian 85.07±3.21 85.51±2.65 85.22±5.19 86.09±3.50 85.65±4.07 84.93±3.82
Autos 82.29±10.86 72.88±10.12 83.67±7.29 80.42±8.40 78.58±8.54 77.92±13.76Balance-scale 71.03±3.85 70.71±4.08 70.39±4.17 69.59±4.01 69.59±4.27 69.28±4.10Breast-cancer 74.37±5.71 75.41±6.44 67.53±5.27 76.85±8.48 74.39±7.34 71.51±9.51
Breast-w 97.37±2.04 97.51±2.19 96.78±2.16 97.36±2.04 95.76±2.61 97.52±1.53Corral 100.00±0.00 85.96±7.05 100.00±0.00 89.10±8.98 94.62±8.92 97.63±3.82
Credit-a 85.61±3.67 87.29±3.53 86.38±3.27 87.59±3.51 86.99±4.48 86.68±3.30Credit-g 73.00±4.92 75.80±4.32 73.50±3.63 77.10±4.38 72.10±4.46 72.90±3.90Diabetes 76.96±3.05 77.87±3.50 78.77±3.32 78.52±4.11 78.26±3.97 76.44±4.29Flare 83.49±1.79 80.30±3.42 82.84±2.26 82.46±2.31 82.09±1.80 83.31±2.53Glass 75.69±8.28 74.37±8.97 77.60±7.85 76.19±7.41 73.94±9.76 75.71±8.50Glass2 84.56±9.48 83.97±8.99 87.06±8.03 83.97±9.91 84.01±7.32 85.18±9.44Heart-c 82.45±5.71 84.11±7.85 82.79±5.54 83.77±6.80 74.66±6.49 80.38±3.27
Heart-statlog 83.33±8.05 83.33±6.36 81.85±7.08 81.85±6.86 81.85±5.91 79.63±8.24Hepatitis 90.00±12.91 85.00±15.37 88.75±13.76 85.00±12.91 90.00±14.19 86.25±12.43Ionosphere 93.45±2.69 90.60±3.83 92.60±4.27 92.31±2.34 89.17±5.35 92.31±2.70
Iris 94.67±5.26 94.00±5.84 94.00±5.84 93.33±5.44 94.00±4.92 94.00±5.84Kr-vs-kp 95.49±1.24 ◦ 87.89±1.81 92.24±2.24 91.18±0.83 99.44±0.48 92.49±1.72Lymph 87.19±4.88 83.67±6.91 82.38±7.41 85.62±8.66 78.33±10.44 82.95±9.07
Mofn-3-7-10 92.60±1.29 ◦ 85.35±1.53 91.31±2.00 88.97±2.60 100.00±0.00 92.38±1.85Mushroom 100.00±0.00 97.41±0.72 99.98±0.06 100.00±0.00 100.00±0.00 100.00±0.00
Pima 77.09±4.06 78.13±4.24 78.78±4.50 78.65±3.81 78.38±2.90 76.83±3.53Primary-tumor 45.55±13.48 47.14±11.59 42.53±11.57 47.14±11.00 43.24±10.55 45.60±11.04
Segment 96.06±1.28 91.52±1.60 95.89±1.37 95.63±1.23 95.32±1.63 95.41±1.73Shuttle-small 99.86±0.14 99.34±0.27 99.79±0.11 99.83±0.11 99.59±0.19 99.86±0.14
Sonar 83.74±8.68 85.62±5.41 85.64±8.70 87.07±6.31 79.81±8.14 84.14±11.24Soybean 93.78±2.82 91.64±4.44 93.06±3.31 91.99±4.22 91.82±3.75 91.28±3.70Splice 92.29±1.42 95.36±1.00 95.33±1.39 96.21±1.07 94.36±1.58 96.11±1.60Vehicle 75.17±5.26 62.65±4.15 73.04±4.52 73.06±4.65 71.99±3.45 73.76±4.64Vote 94.94±3.57 ◦ 89.91±4.45 93.36±4.84 94.03±4.07 95.18±4.48 94.71±2.67Vowel 79.80±4.74 67.07±6.14 88.59±2.61 81.92±4.11 80.91±2.31 79.49±3.44
Waveform-21 85.96±1.61 81.76±1.49 82.92±1.45 86.60±1.26 75.44±2.10 86.06±1.71Waveform-5000 86.06±1.47 80.74±1.38 82.04±1.25 86.36±1.65 76.48±1.47 86.10±1.79
Zoo 94.09±5.09 93.18±7.93 94.09±6.94 94.18±6.60 92.18±8.94 94.18±6.60zzLetter 91.80±0.58 74.04±1.04 86.43±0.67 88.91±0.50 78.63±0.62 92.47±0.65
zzzSatimage 89.87±1.07 82.42±1.51 88.16±0.99 89.26±0.59 84.37±1.34 89.20±1.22
Mean 86.31±4.19 83.12±4.72 85.77±4.23 85.85±4.49 84.32±4.59 85.50±4.62# Best 13 (16) 3 7 10 5 4
Table 8.4: The classification accuracy of the tested algorithms. The 3 entries indicatedby small circles become the best after taking out C4.5.
103
NB TAN AODE C4.5 LCC
# Wins 17 8 7 12 7# Ties 17 25 25 22 28# Loses 3 4 5 3 2
Table 8.5: The number of times that LTC significantly won, tied with, and losed to theother algorithms.
0 10 20 30 4010
−4
10−3
10−2
10−1
Data Set
Tim
e (S
econ
d)
LTCTANAODELCC
Figure 8.3: The classification time of different classifiers.
8.5.6 Appreciating Learned Models
One advantage of LTC is that it can capture concepts underlying domains and automati-
cally discover interesting subgroups within each class. In this section, the readers will see
one such example.
The example is involved with the Corral data set. It contains two classes true and
false, and six boolean attributes A0, A1, B0, B1, Irrelevant, and Correlated. The
target concept is (A0∧A1)∨ (B0∧B1). Irrelevant is an irrelevant random attribute, and
Correlated is loosely correlated to the class variable.
We learned an LTC from the Corral data and obtained two LTMs, one for each class.
We denote the LTMs byMt andMf , respectively. Their structures are shown in Figure
8.4. The numbers in parentheses denote the cardinalities of latent variables. The width
of an edge denote the mutual information between the incident nodes.
The model Mt contains one binary latent variable Yt. Yt partitions the samples of
104
(a) Mt (b)Mf
Figure 8.4: The structures of the LTMs for Corral data.
the true class into two groups, each corresponding to one of its two states. We call those
groups latent classes. Similarly, the model Mf contains two binary latent variables Yf1
and Yf2. Each latent variable partitions the samples of the false class into two latent
classes in a peculiar way. We look into each latent class and obtain some interesting
findings: (1) the latent classes Yt = 1 and Yt = 2 correspond to the two components of
the concept, A0 ∧A1 and B0∧B1, respectively; (2) the latent classes Yf1 = 1 and Yf1 = 2
correspond to ¬A0 and ¬A1, while the latent classes Yf2 = 1 and Yf2 = 2 correspond to
¬B0 and ¬B1; (3) the latent variables Yf1 and Yf2 jointly enumerate the four cases when
the target concept (A0 ∧ A1) ∨ (B0 ∧ B1) does not satisfy. The details are presented in
the following.
We first notice that in both models, the four attributes A0, A1, B0, and B1 are closely
correlated to their parents. In contrast, Irrelevant and Correlated are almost indepen-
dent of their parents. This is interesting as both models correctly picked the four relevant
attributes to the target concept. Henceforth, we only focus on those four attributes.
To understand the characteristics of each latent class, we examine the conditional dis-
tribution of each attribute, i.e., P (X|Y = 1) and P (X|Y = 2) for all X ∈ {A0, A1, B0, B1}and Y ∈ {Yt, Yf1, Yf2}. Those distributions are plotted in Figure 8.5. The height of a bar
indicates the corresponding probability value.
We start by the latent classes associated with Yt. In latent class Yt = 1, A0 and
A1 always take value true, while B0 and B1 emerge at random. Clearly, this group of
instances belong to class true because they satisfy A0 ∧ A1. In contrast, in latent class
Yt = 2, B0 and B1 always take value true, while A0 and A1 emerge at random. Clearly,
this group corresponds to the concept B0 ∧ B1.
We next examine the two latent variables in Mf . It is clear that A0 never occurs
in latent class Yf1 = 1, while A1 never occurs in latent class Yf1 = 2. Therefore, the
two latent classes correspond to ¬A0 and ¬A1, respectively. Yf1 thus enumerates the two
cases when A0 ∧ A1 does not satisfy. Similarly, we find that B0 never occurs in latent
class Yf2 = 1, while B1 never occurs in latent class Yf2 = 2. Therefore, the two latent
105
A0 A1 B0 B10
0.2
0.4
0.6
0.8
1
Pro
babi
lity
(a) Yt = 1: A0 ∧A1
A0 A1 B0 B10
0.2
0.4
0.6
0.8
1
Pro
babi
lity
(b) Yt = 2: B0 ∧B1
A0 A1 B0 B10
0.2
0.4
0.6
0.8
1
Pro
babi
lity
(c) Yf1 = 1: ¬A0
A0 A1 B0 B10
0.2
0.4
0.6
0.8
1
Pro
babi
lity
(d) Yf1 = 2: ¬A1
A0 A1 B0 B10
0.2
0.4
0.6
0.8
1
Pro
babi
lity
(e) Yf2 = 1: ¬B0
A0 A1 B0 B10
0.2
0.4
0.6
0.8
1
Pro
babi
lity
(f) Yf2 = 2: ¬B1
Figure 8.5: The attribute distributions in each latent class and the corresponding concept.
classes correspond to ¬B0 and ¬B1, respectively. Yf2 thus enumerates the two cases when
B0 ∧B1 does not satisfy. Consequently, Yf1 and Yf2 jointly represent the four cases when
the target concept (A0 ∧A1) ∨ (B0 ∧B1) does not satisfy.
8.6 Related Work
There are a large body of literatures on generative classifier that attempt to improve
classification accuracy by modeling attribute dependencies. They mainly divide into two
categories: Those directly model relationship among attributes, and those attribute such
relationship to latent variables. Besides TAN and AODE, examples from the first cate-
gory include general Bayesian network classifier and Bayesian multinet (Friedman et al.,
106
1997). The latter learns a Bayesian network for each class and uses them jointly to make
prediction. Our method is based on the similar idea, but we learn an LTM to represent
the joint distribution of each class.
Our method falls into the second category. In this category, various latent variable
models have been tested for continuous data. To give two examples, Monti and Cooper
(1995) combine finite mixture model with naıve Bayes classifier. The resultant model is a
continuous counterpart of LCC. Langseth and Nielsen (2005) propose latent classification
model. It uses mixture of factor analyzers to represent attribute dependencies.
In contrast, we are aware of much less work on categorical data. The one that is the
most closely related to ours is the hierarchical naıve Bayes model (HNB) proposed by
Zhang et al. (2004). HNB also exploits LTM to model the relationship among attributes.
It differs from LTC in two aspects. First, HNB assumes that the LTM structures are iden-
tical for all classes, while LTC describes different classes using different LTMs. Therefore,
HNB cannot reveal diversity across classes for domains like Corral. Second, HNB models
the attribute dependencies using a forest of LTMs, each over an exclusive subset of at-
tributes. In contrast, LTC connects all attributes using one single tree. Recently, Hinton
et al. (2006) propose the notion of deep belief net (DBN). DBN models the attribute
dependencies using multiple layers of densely connected latent variables. It is designed
for image recognition problem and only handles binary attributes.
8.7 Summary
We propose a novel generative classifier, namely, latent tree classifier. It exploits LTMs
to estimate the class-conditional distributions of each class, and uses the Bayes rule to
make classification. We considered both EAST and Pyramid for learning LTC. Empirical
results suggest that Pyramid makes a better tradeoff between classification accuracy and
computational efficiency than EAST.
LTMs can capture complex relationship among attributes within each class. Therefore,
LTCs usually approximate generative distributions well and often yield good classification
accuracy. In particular, we empirically show that LTC compares favorably to mainstream
classification algorithms including NB, TAN, AODE, and C4.5. In terms of classification
efficiency, LTC is good enough for real-world applications. Though it is slower than NB
and TAN, it is as efficient as AODE. We further demonstrate an advantage of LTCs, i.e.,
they can reveal underlying concepts and discover interesting subgroups within each class.
As far as we know, this advantage is unique to our method.
107
CHAPTER 9
CONCLUSIONS AND FUTURE WORK
We conclude this thesis by providing a summary of contributions and discussing several
possible future directions.
9.1 Summary of Contributions
We study the use of latent tree models for density estimation of discrete random vari-
ables. LTMs can represent complex relationships among manifest variables and yet are
computationally simple to work with. They were recognized as potentially good models
for density estimation two decades ago. However, the potentials have not been realized
before this thesis due to the lack of efficient algorithms for LTMs. Only special LTMs,
such as latent class models, were used for density estimation previously.
This thesis is the first to investigate the practical use of unrestricted LTMs for density
estimation. Our contributions lie in two aspects, one in developing efficient learning
algorithms for LTMs, the other in exploring novel applications of the density estimation
techniques that we develop.
On the algorithm side, we first test EAST, the previous state-of-the-art algorithm for
learning LTMs, for the task of density estimation. The results show that EAST can yield
good estimates and reveal interesting latent structures. However, it is computationally
expensive and can only be applied to small scale problems.
We then develop two algorithms that are more efficient than EAST. The first algorithm
is HCL. It is a special-purpose algorithm that requires a predetermined bound on the
complexity of the resulting LTM. It is faster than EAST by orders of magnitude but
yields significantly poorer estimates than EAST does. The second algorithm that we
develop is Pyramid. In contrast to HCL, it is a general-purpose algorithm. It is slower
than HCL and is more efficient than EAST. The quality of the estimates produced by
Pyramid is only slightly lower than those produced by EAST. As such, Pyramid provides
a better tradeoff than EAST between estimation quality and computational efficiency.
On the application side, we study the use of LTMs to BN approximate inference
and statistical classification. In the first application, we propose a bounded-complexity
approximate method for BN inference. The idea is to build an LTM to approximate the
108
distribution represented by a BN using the HCL algorithm, and make inference in the LTM
instead of the original BN. With our scheme one can trade off between the approximation
accuracy and the inferential complexity. Our scheme achieves good accuracy at low costs
in all the networks that we examined. In particular, it consistently outperforms LBP.
In the second application, we use LTMs to estimate class-conditional distributions,
and apply Bayes rule to make prediction. This leads to a novel method for classification
called latent tree classifier (LTC). We empirically show that LTC achieves comparable or
higher classification accuracy than mainstream algorithms including NB, TAN, AODE,
and C4.5. In terms of the speed of online classification, LTC is slower than NB and
TAN, and is as efficient as AODE. It is fast enough for real-world applications. We also
demonstrate that LTC can reveal underlying concepts and discover interesting subgroups
within each class. As far as we know, the second feature is unique to our method.
9.2 Future Work
Possible future work falls in three directions. We discuss them in details in the following
subsections.
9.2.1 Other Applications
The first direction is to explore other applications of the developed density estimation
techniques. An ongoing research is to investigate their usefulness in ranking problem.
Given a set of labeled training data, the problem is to build a ranker to sort the test
data so that the positive samples appear before the negative samples. It has extensive
applications in marketing research, social science, information retrieval, etc.
One approach to ranking problem is to construct an estimate to the generative dis-
tribution P (X, C), and then sort the test data in a descending order with respect to the
posterior probability P (C = +|X = x) that a sample x belongs to the positive class.
Intuitively, the more accurate the estimate, the better the ranking. A natural idea is
thus to apply LTC to this problem. Since LTC can approximate the true distribution
underlying the data well, we expect it to be a good ranker also. We are testing this idea
on direct marketing data such as the CoIL Challenge 2000 (van der Putten et al., 2000),
where the task is to rank new customers so that those who would buy a particular product
appear before those who would not. We are also working with marketing researchers to
see whether LTC can discover interesting subgroups of the customers.
109
9.2.2 Handling Continuous Data
LTMs assume categorical random variables and deal with only discrete data. It would be
interesting to adapt LTM to handle continuous data. One possible solution is as follows.
We use leaf nodes to represent continuous manifest variables, and still use internal nodes
to represent discrete latent variables. We still model the direct dependencies between
latent variables using conditional probability tables. In contrast, we now model the direct
dependencies between manifest variables and latent variables using conditional Gaussian
distribution. Formally, let Y be a latent variable and X be the set of manifest variables
connected to Y . The conditional distribution of X given Y is defined as a Gaussian
distribution
P (X|Y = j) = N (µi,Σi), j = 1, 2, . . . , |Y |,
with mean µi and covariance matrix Σi. One can make certain assumptions about the
structure of the covariance matrix Σi, e.g., identity or diagonal matrices. One can also
consider to infer such structures from data automatically.
We refer to the models defined in this way as the Gaussian latent tree models (GLTMs).
It can be treated as a generalization to the finite Gaussian mixture models (GMMs)
(McLachlan & Peel, 2000). The latter contains only one single latent variable and is
commonly used in literatures to model density functions over continuous variables. What
GLTM is to GMM is that LTM is to LCM. Given the fact that LTM consistently outper-
forms LCM in density estimation tasks with discrete data, we expect GLTM to improve
over GMM in continuous case as well.
Like in the case of LTM, to learn a GLTM from data, we need to determine the latent
variables, the tree structure that connects latent variables and manifest variables, and the
parameters including the CPTs for latent variables and the mean µi and the covariance
matrix Σi for manifest variables. We are currently considering to adapt the EAST and
Pyramid algorithms for this purpose.
9.2.3 Generalization to Partially Observed Trees
A restriction imposed on LTMs is that manifest variables must reside at leaf nodes.
As a consequence, interactions between manifest variables are not directly modeled but
via the latent variables. On one hand, this property enables LTMs to capture high-
order dependencies among manifest variables using a tree structure. On the other hand,
however, it makes LTMs inefficient to model low-order dependencies.
Consider the case where the generative distribution can be well approximated using
a tree model without latent variables. In this case, the random variables exhibit only
110
second-order dependencies. Now imagine to construct an LTM to estimate the generative
distribution. Since the dependencies between manifest variables are indirectly modeled by
latent variables, we usually need to set the cardinalities of latent variables at large values
in order to achieve good estimation. The resulting LTM can be much more complex than
the plain tree model. To give a concrete example, consider the results for the Mildew
network in Section 7.5. In this example, Chow-Liu tree yields higher approximation
accuracy than LTM while the former is much simpler than the latter.
One way to relax the restriction imposed on LTMs is to allow manifest variables to
enter the interior of models. This results in a class of models that we refer to as the
partially observed trees, or POTs for short. POTs are more flexible than LTMs. As a
matter of fact, POTs subsume both LTMs and Chow-Liu trees as special cases. Therefore,
POTs have stronger expressive power than the latters and could lead to better solutions
to density estimation problem.
The model space of POTs is larger than that of LTMs. Therefore, the problem of
learning POTs should be at least as hard as that of learning LTMs. One possible solution
is to adapt the EAST algorithm. We can enhance EAST search by introducing additional
search operators that (1) pushes manifest variables at leaf nodes into the interior of the
current model, and (2) pulls manifest variables at interior nodes out. Since more search
operators are involved, we expect the resulting algorithm to be slower than EAST.
A potentially more efficient way to learn POTs is to start with the optimal Chow-
Liu tree and then introduce latent variables to the interior of the model. To realize this
solution, ones need to develop heuristic to determine whether it is beneficial to introduce
new latent variables and how the latent variables should be added to current model.
111
Bibliography
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions
on Automatic Control, 19 (6), 716–723.
Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. http://www.
ics.uci.edu/~mlearn/MLRepository.html.
Bartholomew, D., & Knott, M. (1999). Latent variable models and factor analysis (2nd
edition). Arnold, London.
Bishop, C. M., Lawrence, N., Jaakkola, T., & Jordan, M. I. (1997). Approximating
posterior distributions in belief networks using mixtures. In Advances in Neural
Information Processing Systems 10, pp. 416–422.
Chen, T. (2008). Search-Based Learning of Latent Tree Models. Ph.D. thesis, The Hong
Kong University of Science and Technology.
Chen, T., Zhang, N. L., & Wang, Y. (2008). Efficient model evaluation in the search-
based approach to latent structure discovery. In Proceedings of the 4th European
Workshop on Probabilistic Graphical Models, pp. 57–64.
Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal
of Machine Learning Research, 3, 507–554.
Chickering, D. M., & Heckerman, D. (1997a). Efficient approximations for the marginal
likelihood of Bayesian networks with hidden variables. Machine Learning, 29 (2-3),
181–212.
Chickering, D. M., & Heckerman, D. (1997b). Efficient approximations for the marginal
likelihood of Bayesian networks with hidden variables. Machine Learning, 29, 181–
212.
Choi, A., & Darwiche, A. (2006). A variational approach for approximating Bayesian
networks by edge deletion. In Proceedings of the 22nd Conference on Uncertainty
in Artificial Intelligence, pp. 80–89.
Chow, C. K., & Liu, C. N. (1968). Approximating discrete probability distributions with
dependence trees. IEEE Transactions on Information Theory, 14 (3), 462–467.
Cooper, F. G. (1990). The computational complexity of probabilistic inference using
Bayesian belief networks. Artificial Intelligence, 42, 393–405.
Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley-
Interscience, New York.
112
Dagum, P., & Luby, M. (1993). Approximating probabilistic inference in Bayesian belief
networks is NP-hard. Artificial Intelligence, 60, 141–153.
Darwiche, A. (2001). Recursive conditioning. Artificial Intelligence, 125 (1–2), 5–41.
Dechter, R. (1996). Bucket elimination: A unifying framework for probabilistic inference.
In Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp.
211–219.
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier
under zero-one loss. Machine Learning, 29, 103–130.
Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley,
New York.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued
attributes for classification learning. In Proceedings of the 13th International Joint
Conference on Artificial Intelligence, pp. 1022–1027.
Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and
density estimation. Journal of the American Statistical Association, 97 (458), 611–
631.
Frey, B. J., & MacKay, D. J. C. (1997). A revolution: Belief propagation in graphs with
cycles. In Advances in Neural Information Processing Systems 10.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine
Learning, 29 (2-3), 131–163.
Green, P. (1999). Penalized likelihood. Encyclopaedia of Statistical Science, Update Vol-
umn 3, 578–586.
Guindon, S., & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate
large phylogenies by maximum likelihood. Systematic Biology, 52 (5), 696–704.
Heckerman, D. (1995). A tutorial on learning with Bayesian networks. Tech. rep. MSR-
TR-95-06, Microsoft Research.
Henrion, M. (1988). Propagating uncertainty in Bayesian networks by probabilistic logic
sampling. In Uncertainty in Artificial Intelligence 2, pp. 317–324.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 17 (7), 1527–1554.
Jebara, T. (2004). Machine learning : discriminative and generative. Kluwer Academic
Publishers, Boston.
Jensen, F., & Andersen, S. K. (1990). Approximations in Bayesian belief universes for
knowledge based systems. In Proceedings of the 6th Conference on Uncertainty in
Artificial Intelligence, pp. 162–169.
113
Jensen, F. V., Lauritzen, S., & Olesen, K. (1990). Bayesian updating in recursive graphical
models by local computation. Computational Statistics Quarterly, 4, 269–282.
Jordan, M. I. (Ed.). (1998). Learning in graphical models. Kluwer Academic Publishers,
Boston.
Kindermanna, R., & Snell, J. L. (1980). Markov Random Fields and Their Applications.
American Mathematical Society, Providence, RI.
Kjærulff, U. (1994). Reduction of computational complexity in Bayesian networks through
removal of weak dependences. In Proceedings of the 10th Conference on Uncertainty
in Artificial Intelligence, pp. 374–382.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation
and model selection. In Proceedings of the 14th International Joint Conference on
Artificial Intelligence, pp. 1137–1145.
Langseth, H., & Nielsen, T. D. (2005). Latent classification models. Machine Learning,
59 (3), 237–265.
Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities
on graphical structures and their application to expert systems. Journal of Royal
Statistics Society (Series B), 50 (2), 157–224.
Lazarsfeld, P. F., & Henry, N. W. (1968). Latent Structure Analysis. Houghton Mifflin,
Boston.
Loftsgaarden, D. O., & Quesenberry, C. P. (1965). A nonparametric estimate of a multi-
variate density function. Annal of Mathematical Statistics, 36, 1049–1051.
Lowd, D., & Domingos, P. (2005). Naive Bayes models for probability estimation. In
Proceedings of the 22nd International Conference on Machine Learning, pp. 529–
536.
McLachlan, G., & Peel, D. (2000). Finite Mixture Models. John Wiley and Sons.
Monti, S., & Cooper, G. F. (1995). A Bayesian network classifier that combines a finite
mixture model and a naive Bayes model.. In Proceedings of the 11th Conference on
Uncertainty in Artificial Intelligence, pp. 447–456.
Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for ap-
proximate inference: An empirical study. In Proceedings of the 15th Conference on
Uncertainty in Artificial Intelligence, pp. 467–475.
Nachmana, I., Elidan, G., & Friedman, N. (2004). “Ideal parent” structural learning for
continuous variable networks. In Proceedings of the 12th Conference on Uncertainty
in Artificial Intelligence, pp. 400–409.
114
Ng, A., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison
of logistic regression and naive Bayes. In Advances in Neural Information Processing
Systems, Vol. 14.
Niblett, T. (1987). Constructing decision trees in noisy domains. In Proceedings of the
Second European Working Session on Learning, pp. 67–78.
Parzen, E. (1962). On estimation of a probability density function and mode. The Annals
of Mathematical Statistics, 33 (3), 1065–1076.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann Publishers, San Mateo, CA.
Pradhan, M., Provan, G., Middleton, B., & Henrion, M. (1994). Knowledge engineering
for large belief networks. In Proceedings of the 10th Conference on Uncertainty in
Artificial Intelligence, pp. 484–490.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA.
Robertson, N., & Seymour, P. D. (1984). Graph minors III: Planar tree-width. Journal
of Combinatorial Theory (Series B), 36, 49–64.
Rubinstein, Y. D., & Hastie, T. (1997). Discriminative vs informative learning. In Proceed-
ings of the 3rd International Conference on Knowledge Discovery and Data Mining,
pp. 49–53.
Sarkar, S. (1995). Modeling uncertainty using enhanced tree structures in expert systems.
IEEE Transactions on Systems, Man, and Cybernetics, 25 (4), 592–604.
Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief
networks. Journal of Artificial Intelligence Research, 4, 61–76.
Saul, L. K., & Jordan, M. I. (1996). Exploiting tractable substructures in intractable
networks. In Advances in Neural Information Processing Systems 8, pp. 486–492.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 (2),
461–464.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualiza-
tion. Wiley, New York.
Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7,
221–264.
Shenoy, P., & Shafer, G. (1990). Axioms for probability and belief-function propagation.
In Uncertainty in AI, Vol. 4, pp. 169–198.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman
and Hall, New York.
115
van der Putten, P., & van Someren, M. (Eds.). (2000). CoIL Challenge 2000: The Insur-
ance Company Case. Sentient Machine Research, Amsterdam.
van der Putten, P., de Ruiter, M., & van Someren, M. (2000). CoIL challenge 2000 tasks
and results: Predicting and explaining caravan policy ownership..
van Engelen, R. A. (1997). Approximating Bayesian belief networks by arc removal. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19 (8), 916–920.
Wang, Y., & Zhang, N. L. (2006). Severity of local maxima for the EM algorithm:
Experiences with hierarchical latent class models. In Proceedings of the 3rd European
Workshop on Probabilistic Graphical Models, pp. 301–308.
Wang, Y., Zhang, N. L., & Chen, T. (2008a). Latent tree models and approximate
inference in Bayesian network. In Proceedings of the 23rd National Conference on
Artificial Intelligence, pp. 1112–1118.
Wang, Y., Zhang, N. L., & Chen, T. (2008b). Latent tree models and approximate
inference in Bayesian networks. Journal of Artificial Intelligence Research, 32, 879–
900.
Webb, G. I., Boughton, J. R., & Wang, Z. (2005). Not so naive Bayes: Aggregating
one-dependence estimators. Machine Learning, 58, 5–24.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and
Techniques (2nd edition). Morgan Kaufmann, San Francisco.
Zhang, N. L. (2004). Hierarchical latent class models for cluster analysis. Journal of
Machine Learning Research, 5 (6), 697–723.
Zhang, N. L., & Poole, D. (1994). A simple approach to bayesian network computations.
In Proceedings of the 10th Canadian Conference on Artificial Intelligence, pp. 171–
178.
Zhang, N. L., Wang, Y., & Chen, T. (2008). Discovery of latent structures: Experience
with the CoIL challenge 2000 data set. Journal of Systems Science and Complexity,
21 (2), 172–183.
Zhang, N. L., & Kocka, T. (2004). Efficient learning of hierarchical latent class models.
In Proceedings of the 16th IEEE International Conference on Tools with Artificial
Intelligence, pp. 585–593.
Zhang, N. L., Nielsen, T. D., & Jensen, F. V. (2004). Latent variable discovery in classi-
fication models. Artificial Intelligence in Medicine, 30 (3), 283–299.
Zhang, N. L., Yuan, S., Chen, T., & Wang, Y. (2007). Hierarchical latent class models
and statistical foundation for traditional Chinese medicine. In Proceedings of the
11th Conference on Artificial Intelligence in Medicine, pp. 139–143.
116
Zhang, N. L., Yuan, S., Chen, T., & Wang, Y. (2008a). Latent tree models and diagnosis
in traditional Chinese medicine. Artificial Intelligence in Medicine, 42 (3), 229–245.
Zhang, N. L., Yuan, S., Chen, T., & Wang, Y. (2008b). Statistical validation of traditional
Chinese medicine theories. Journal of Alternative and Complementary Medicine,
14 (5), 583–587.
117