Post on 14-Feb-2020
Statistical Methods for Genome-wide Association Studies andPersonalized Medicine
by
Jie Liu
A dissertation submitted in partial fulfillmentof the requirements for the degree of
Doctor of Philosophy(Computer Sciences)
at theUNIVERSITY OF WISCONSIN-MADISON
2014
Date of final oral examination: 05/16/14 (9am)Room for final oral examination: CS 4310Committee in charge:
C. David Page Jr., Professor, Biostatistics and Medical InformaticsXiaojin Zhu, Associate Professor, Computer SciencesJude Shavlik, Professor, Computer SciencesElizabeth Burnside, Associate Professor, RadiologyChunming Zhang, Professor, Statistics
i
Abstract
In genome-wide association studies (GWAS), researchers analyze the genetic variation across
the entire human genome, searching for variations that are associated with observable traits or
certain diseases. There are several inference challenges in GWAS, including the huge number
of genetic markers to test, the weak association between truly associated markers and the traits,
and the correlation structure between the genetic markers. This thesis mainly develops statistical
methods that are suitable for genome-wide association studies and their clinical translation for
personalized medicine.
After we introduce more background and related work in Chapters 1 and 2, we further discuss
the problem of high dimensional statistical inference, especially capturing the dependence among
multiple hypotheses, which has been under-utilized in classical multiple testing procedures. Chap-
ter 3 proposes a feature selection approach based on a unique graphical model which can leverage
correlation structure among the markers. This graphical model-based feature selection approach
significantly outperforms the conventional feature selection methods used in GWAS. Chapter 4
reformulates this feature selection approach as a multiple testing procedure that has many elegant
properties, including controlling false discovery rate at a specified level and significantly improv-
ing the power of the tests by leveraging dependence. In order to relax the parametric assumption
within the graphical model, Chapter 5 further proposes a semiparametric graphical model for mul-
tiple testing under dependence, which estimates f1 adaptively. This semiparametric approach is
still effective to capture the dependence among multiple hypotheses, and no longer requires us
to specify the parametric form of f1. It exactly generalizes the local FDR procedure [38] and
ii
connects with the BH procedure [12].
These statistical inference methods are based on graphical models, and their parameter learn-
ing is difficult due to the intractable normalization constant. Capturing the hidden patterns and
heterogeneity within the parameters is even harder. Chapters 6 and 7 discuss the problem of learn-
ing large-scale graphical models, especially dealing with issues of heterogeneous parameters and
latently-grouped parameters. Chapter 6 proposes a nonparametric approach which can adaptively
integrate, during parameter learning, background knowledge about how the different parts of the
graph can vary. For learning latently-grouped parameters in undirected graphical models, Chapter
7 imposes Dirichlet process priors over the parameters and estimates the parameters in a Bayesian
framework. The estimated model generalizes significantly better than standard maximum likeli-
hood estimation.
Chapter 8 explores the potential translation of GWAS discoveries to clinical breast cancer
diagnosis. With support from the Wisconsin Genomics Initiative, we genotyped a breast cancer
cohort at Marshfield Clinic and collected corresponding diagnostic mammograms. We discovered
that, using SNPs known to be associated with breast cancer, we can better stratify patients and
thereby significantly reduce false positives during breast cancer diagnosis, alleviating the risk of
overdiagnosis. This result suggests that when radiologists are making medical decisions from
mammograms (such as suggesting follow-up biopsies), they can consider these risky SNPs for
more accurate decisions if the patients’ genotype data are available.
Contents
Abstract i
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 7
2.1 Hypothesis Testing for Case-control Association Studies . . . . . . . . . . . . . 7
2.1.1 Single-marker Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Parametric Multiple-maker Methods . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Nonparametric Multiple-maker Methods . . . . . . . . . . . . . . . . . 16
2.2 Multiple Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Error Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 P -value Thresholding Methods . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Local False Discovery Rate Methods . . . . . . . . . . . . . . . . . . . 20
2.2.4 Local Significance Index Methods . . . . . . . . . . . . . . . . . . . . . 21
2.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Maximum Likelihood Parameter Learning . . . . . . . . . . . . . . . . . 23
2.3.2 Bayesian Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . 29
iii
iv
2.3.3 Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Feature and Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 High-Dimensional Structured Feature Screening Using Markov Random Fields 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Feature Relevance Network . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 The Construction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 The Inference Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.4 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Experiments on CGEMS Data . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.3 Validating Findings on Marshfield Data . . . . . . . . . . . . . . . . . . 51
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Multiple Testing under Dependence via Parametric Graphical Models 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Terminology and Previous Work . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 The Multiple Testing Procedure . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Posterior Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.4 Parameters and Parameter Learning . . . . . . . . . . . . . . . . . . . . 60
4.3 Basic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Simulations on Genetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
v
5 Multiple Testing under Dependence via Semiparametric Graphical Models 76
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.1 Graphical models for Multiple Testing . . . . . . . . . . . . . . . . . . . 80
5.3.2 Nonparametric Estimation of f1 . . . . . . . . . . . . . . . . . . . . . . 81
5.3.3 Parametric Estimation of φ and π . . . . . . . . . . . . . . . . . . . . . 82
5.3.4 Inference of θ and FDR Control . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Connections with Classical Multiple Testing Procedures . . . . . . . . . . . . . 84
5.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Learning Heterogeneous Hidden Markov Random Fields 94
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 HMRFs And Homogeneity Assumption . . . . . . . . . . . . . . . . . . 96
6.2.2 Heterogeneous HMRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Parameter Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 Contrastive Divergence for MRFs . . . . . . . . . . . . . . . . . . . . . 98
6.3.2 Expectation-Maximization for Learning Conventional HMRFs . . . . . . 99
6.3.3 Learning Heterogeneous HMRFs . . . . . . . . . . . . . . . . . . . . . 102
6.3.4 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
vi
7 Bayesian Estimation of Latently-grouped Parameters in Graphical Models 115
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Maximum Likelihood Estimation and Bayesian Estimation for MRFs . . . . . . 117
7.3 Bayesian Parameter Estimation for MRFs with Dirichlet Process Prior . . . . . . 118
7.3.1 Metropolis-Hastings (MH) with Auxiliary Variables . . . . . . . . . . . 119
7.3.2 Gibbs Sampling with Stripped Beta Approximation . . . . . . . . . . . . 123
7.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.1 Simulations on Tree-structure MRFs . . . . . . . . . . . . . . . . . . . . 128
7.4.2 Simulations on Small Grid-MRFs . . . . . . . . . . . . . . . . . . . . . 128
7.4.3 Simulations on Large Grid-MRFs . . . . . . . . . . . . . . . . . . . . . 132
7.5 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8 Genetic Variants Improve Personalized Breast Cancer Diagnosis 138
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3.1 Performance of Combined Models . . . . . . . . . . . . . . . . . . . . . 145
8.3.2 Performance of Genetic Models . . . . . . . . . . . . . . . . . . . . . . 147
8.3.3 Comparing Breast Imaging Model and Genetic Model . . . . . . . . . . 147
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9 Future Work 151
Chapter 1
Introduction
1.1 Background
The human genome project, which was completed in 2003, made it possible for us, for the first
time, to read the complete genetic blueprint of human beings. Since then, researchers started
looking into the germline genetics variants which are associated with the heritable diseases and
traits among humans, known as genome-wide association studies (GWAS). GWAS analyze the
genetic variation across the entire human genome, searching for variations that are associated
with observable traits or certain diseases. In machine learning terminology, typically an example
in GWAS is a human, the response variable is a disease such as breast cancer, and the features
(or variables) are the single positions in the entire genome where individuals can vary, known as
single-nucleotide polymorphisms (SNPs). The primary goal in GWAS is to identify all the SNPs
that are relevant to the diseases or the observable traits.
GWAS are characterized by high-dimension. The human genome has roughly 3 billion po-
sitions, roughly 3 million of which are SNPs. State-of-the-art technology enables measurement
of a million SNPs in one experiment for a cost of hundreds of US dollars. Although this means
the full set of known SNPs cannot be measured in one experiment at present, SNPs that are close
together on the genome are often highly correlated. Hence the omission of some SNPs is not as
1
2
much of a problem as one might first think. Instead, we have the problem of strong-correlation
among our features: most SNPs are very highly correlated with one or more nearby SNPs, with
squared Pearson correlation coefficients well above 0.8.
Another problem making GWAS especially challenging is weak-association, namely the truly
relevant markers are very rare and only weakly associated to the response variable. The first
reason is that most diseases have both a genetic and environmental component. Because of the
environmental component, we cannot expect to achieve anywhere near 100% accuracy in GWAS.
For example, it is estimated that genetics accounts for only about 27% of breast cancer risk [102].
Therefore, given equal numbers of breast cancer patients and controls without breast cancer, the
highest predictive accuracy we can reasonably expect from genetic features alone is about 63.5%,
obtainable by correctly predicting the controls and correctly recognizing 27% of the cancer cases
based on genetics. Furthermore, breast cancer and many other diseases are polygenic, and there-
fore the genetic component is spread over multiple genes. Based on these two observations, we
expect the contribution from any one feature (SNP) toward predicting disease to be quite small.1 Indeed, one published study [82] identified only 4 SNPs associated with breast cancer. When
the most strongly associated SNP (rs1219648) is tested for its predictive accuracy on this same
training set from which it was identified (almost certainly yielding an overly-optimistic accuracy
estimate), the model based on this SNP is only 53% accurate, where majority-class or uniform
random guessing is 50% accurate. Adding credibility is another published study [33] on breast
cancer which identified 11 SNPs from a different dataset. They report the individual odds ratios
for the 11 SNPs are estimated to be around 0.95 - 1.26, and most of them are not identified to be
significant in the former study [82]. Therefore, for breast cancer and other diseases, we expect the
signal from each relevant feature to be very weak.
The combination of high-dimension and weak-association makes it extremely difficult to de-
tect the truly associated genetic markers. Suppose a truly relevant genetic marker is weakly asso-
1Rare alleles for a few SNPs, such as those in BRCA1 and BRCA2 genes, have large effect but are very rare. Othersthat are common have only a weak effect.
3
ciated with the class variable. If its odds ratio is around 1.2, given one thousand cancer cases and
one thousand controls, this marker will not look significantly different between cases and controls,
that is, among examples of different classes. At the same time, if we have an extremely large num-
ber of features, and relatively little data, many irrelevant markers may look better than this relevant
marker by chance alone, especially given even a modest level of noise as occurs in GWAS. Related
work [187] provides a formula to assess the false positive report probability (FPRP), the proba-
bility of no true association between a genetic variant and disease given a statistically significant
finding. If we assume there are around 1000 truly associated SNPs out of the total 500, 000 and
keep the significance level to be 0.05, the FPRP will be around 99%. This means almost all the
selected features in this case are false positives.
Hypothesis testing is one important statistical inference method for genetic association analy-
sis, since one can simply test the significance of association between one genetic marker and the
response variable. However in GWAS, there are usually hundreds of thousands of genetic markers
to test at the same time. Suppose that we have genotyped a total number of m SNPs, and we
have performed m tests simultaneously with each test applying to one genetic marker. In such
a multiple testing situation, we can categorize the results from the m tests as in Table 1.1. One
important criterion, false discovery rate (FDR), defined as E(N10/R|R > 0)P (R > 0), depicts
the expected proportion of incorrectly rejected null hypotheses (or type I errors) . Another crite-
rion, false non-discovery rate (FNR), defined as E(N01/S|S > 0)P (S > 0), depicts the expected
proportion of incorrectly non-accepted non-null hypotheses (or type II errors).
H0 not rejected H0 rejected Total
H0 true N00 N10 m0
H0 false N01 N11 m1
Total S R m
Table 1.1: The classification of tested hypothesis
A multiple testing procedure is termed valid if it controls FDR at the prespecified level α,
4
and is termed optimal if it also has the smallest FNR among all the valid procedures at level α.
Most FDR controlling procedures focus on the validity issue and assume the tests are indepen-
dent. However in GWAS, the tests for highly correlated SNPs are dependent due to this linkage
disequilibrium between them.
On the clinical translation side, high hopes for using genetic profiling for personalized medicine
have been, in part, driven by the rapid progress of genome-wide association studies, which con-
tinue identifying more common genetic variants associated with diseases with high population
prevalence. At the same time, large multi-relational databases containing variables that are infor-
mative of disease risk are increasingly available, providing the opportunity for informatics tools
to better stratify individuals for appropriate healthcare decisions and explore disease mechanism
and behavior. Coincident to this, policy-makers have recommended that interventions, like breast
cancer screening with mammography, be increasingly based on individualized risk and shared
decision-making [132, 158]. The opportunity to use this data to interpret genetic/phenotype asso-
ciation, explain family aggregation of heritable diseases, and shed light on disease mechanism or
natural history is just becoming possible.
1.2 Contributions
The first contribution of this thesis is in the area of high dimensional statistical inference, es-
pecially dealing with the dependence among multiple hypotheses, which has been ignored or
under-utilized in classical multiple testing procedures. This line of work is motivated by a real-
world genome-wide association study (GWAS) on breast cancer. With NCI’s CGEMS dataset
[82], which contains 528,173 genetic markers (single-nucleotide polymorphisms, or SNPs) for
1,145 patients and 1,142 controls, the goal is to identify the genetic markers that are associated
with breast cancer. We propose a feature selection approach based on a unique graphical model
which can leverage correlation structure among the markers. This graphical model-based feature
selection approach significantly outperforms the conventional feature selection methods used in
GWAS. The method can be further reformulated as a multiple testing procedure that has many ele-
5
gant properties, including controlling false discovery rate at a specified level, significantly improv-
ing the power of the tests by leveraging dependence, and generalizing classical multiple testing
procedures such as the Benjamini-Hochberg procedure [12] and the local FDR procedure [38].
The second contribution of this thesis is in the area of learning large-scale graphical models,
especially dealing with issues of latently-grouped parameters and heterogeneous parameters. This
contribution is motivated by the need for efficient, effective parameter learning in our aforemen-
tioned graphical model-based inference approaches. Parameter learning of undirected graphical
models is difficult due to the intractable normalization constant, and capturing the hidden patterns
and heterogeneity within the parameters is even harder. For learning latently-grouped parameters
in undirected graphical models, we impose Dirichlet process priors over the parameters and es-
timate the parameters in a Bayesian framework. The estimated model generalizes significantly
better than standard maximum likelihood estimation. We also propose a nonparametric approach
which can adaptively integrate, during parameter learning, background knowledge about how the
different parts of the graph can vary.
Last but not least, the thesis also explores the potential translation of GWAS discoveries to clin-
ical breast cancer diagnosis. With support from the Wisconsin Genomics Initiative, we genotype
a breast cancer cohort at Marshfield Clinic and collected corresponding diagnostic mammograms.
We discover that, using SNPs known to be associated with breast cancer, we can better stratify
patients and thereby significantly reduce false positives during breast cancer diagnosis, alleviating
the risk of overdiagnosis. This result suggests that when radiologists are making medical decisions
from mammograms (such as suggesting follow-up biopsies), they can consider these risky SNPs
for more accurate decisions if the patients’ genotype data are available.
1.3 Thesis Statement
The dependence in multiple testing can be effectively captured by a Markov-random-field-coupled
mixture model (a.k.a. hidden Markov random field), with FDR controlled at a nominal level and
FNR reduced significantly. The hidden pattern among the Markov random fields can be recovered
6
during parameter learning with a Bayesian estimation approach. The heterogeneity in the hidden
Markov random fields can also be captured by a nonparametric method. Using SNPs known to be
associated with breast cancer, we can stratify breast cancer patients at the time of mammograms,
and thereby significantly reduce false positives during breast cancer diagnosis, alleviating the risk
of overdiagnosis.
Chapter 2
Related Work
This thesis covers many topics that are related to multiple hypothesis testing, graphical models and
variable selection. This chapter summarizes the related work as follows. Section 2.1 reviews a va-
riety of hypothesis testing procedures used in the GWAS community, including the single-marker
methods in Subsection 2.1.1, the parametric multiple-marker methods in Subsection 2.1.2, and
the nonparametric multiple-marker methods in Subsection 2.1.3. Section 2.2 further summarizes
many aspects of multiple testing procedures, including the evaluation criteria in Subsection 2.2.1
and different types of procedures in Subsections 2.2.2, 2.2.3, and 2.2.4. Section 2.3 summarizes
related work for graphical models, including maximum likelihood estimation in Subsection 2.3.1,
Bayesian estimation in Subsection 2.3.2 and inference algorithms in Subsection 2.3.3. Since the
proposed method is also related to variable selection, relevant approaches are also summarized in
Section 2.4.
2.1 Hypothesis Testing for Case-control Association Studies
2.1.1 Single-marker Methods
In a case-control genetic association study, single-marker analysis which tests the association
between the response variable and an individual SNP is often used. In such a hypothesis test,
7
8
the null hypothesis is that the SNP is not associated to the response variable. The alternative
hypothesis is that the SNP is associated. Assume that there are r cases and s controls in a case-
control genetic association study, and that there are two alleles,G and g, at a given SNP locus with
three possible genotypes, namely gg,Gg andGG. Further assume that there are no missing values
and we can observe the genotype counts as in Table 2.1. Approaches that perform hypothesis
testing on genotype counts are called genotype-based methods. The following subsections discuss
several typical genotype-based methods and the connections between them. Those genotype-based
methods include
• Genotype-based Pearson’s χ2 test
• Cochran-Armitage’s trend test
• Likelihood-ratio test, Wald test, and score test with logistic regression
Genotypes gg Gg GG Total
Case r0 r1 r2 rControl s0 s1 s2 s
Total n0 n1 n2 n
Table 2.1: Genotype counts at a given SNP in a case-control genetic association study.
From Table 2.1, we can easily get the counts for the two alleles at the given SNP locus, as
shown in Table 2.2. Therefore, we can carry out the hypothesis test on the allele level. Hypothesis
test methods on the allele level are referred as allele-based methods. The next subsections dis-
cuss several typical allele-based methods and the connections between them. Those allele-based
methods include
• Two-proportion z-test
• Allele-based Pearson’s χ2 test
9
Alleles g G Total
Case u0(= 2r0 + r1) u1(= 2r2 + r1) u(= 2r)Control v0(= 2s0 + s1) v1(= 2s2 + s1) v(= 2s)
Total m0(= 2n0 + n1) m1(= 2n2 + r1) m(= 2n)
Table 2.2: Allele counts at a given SNP in a case-control genetic association study.
Two-proportion z-test
We assume that the alleles are Bernoulli distributed. Further we assume that we have r i.i.d. cases
and s i.i.d. controls with counts of alleles in Table 2.1. F+A denotes the random variable of the
alleles in the positive samples. F−A denotes the random variable of the alleles in the negative
samples.
F+A ∼ Bernoulli(p
+A), F−A ∼ Bernoulli(p
−A). (2.1)
p+A and p−A are the population probability that FA is 1 (corresponding to the alleleG) in the positive
and negative population, respectively. pA is the population probability that FA is 1 in the whole
population. Accordingly, p+A, p−A and pA are sample-based version of p+
A, p−A and pA. We can
calculate p+A, p−A and pA from Table 2.2 as
p+A =
u1
u, p+A =
v1
v, pA =
m1
m. (2.2)
In [40], if we approximate
p+A(1− p+
A)/u+ p−A(1− p−A)/v = pA(1− pA)(u+ v)/uv, (2.3)
then the test statistic for FA is
10
SA =p+A − p
−A√
u+vuv
√pA(1− pA)
. (2.4)
SA is approximately normally distributed with variance 1 and mean λA√
2uvu+v , where
λA =p+A − p
−A√
2√pA(1− pA)
. (2.5)
λA
√2uvu+v is termed the non-centrality parameter. Under H0, SA is approximately standard
normally distributed. Under H1, SA is approximately normally distributed with variance 1 and
mean λA√
2uvu+v . The power of the test, the probability of identifying associated feature FA, at
some significance level α is,
P (α, λA
√2uv
u+ v) = 1− 1√
2π
∫ Φ−1(1−α/2)+λA
√2uvu+v
Φ−1(α/2)+λA
√2uvu+v
e−x2
2 dx, (2.6)
where Φ−1 is the quantile function of the standard normal distribution [40]. For any given signif-
icance level α, the power of the test is entirely determined by the non-centrality parameter. For a
given sample set, the larger λA we have, the larger the power of the test is.
Allele-based Pearson’s χ2 test
Pearson’s χ2 test can be used to test whether or not an observed frequency distribution differs
from a theoretical distribution. In the context of allele-based association analysis, it tests whether
or not the observed frequency distribution of the minor allele in cases differs from that in controls.
Based on the counts from the Table 2.2, the test statistic is
SABP =1∑i=0
[(ui − umi/m)2
umi/m+
(vi − vmi/m)2
vmi/m]. (2.7)
Under the null hypothesis, SABP has an asymptotic χ2 distribution with 1 degree of free-
11
dom. Under the alternative hypothesis, SABP has an asymptotic non-central χ2 distribution with
1 degree of freedom, and the non-centrality parameter δABP is
δABP = uv(p+A − p
−A)2[
1
up+A + vp−A
+1
u(1− p+A) + v(1− p−A)
]. (2.8)
The power of the test is only determined by the non-centrality parameter δABP . In fact, δABP
is the square of the non-centrality parameter λA√
2uvu+v in two-proportion z-test under the approx-
imation (2.3) in the two-proportion z-test.
Genotype-based Pearson’s χ2 test
Pearson’s χ2 test can also be used in the context of genotype-based association analysis. It tests
whether or not the observed frequency distribution of the three genotypes in cases differs from
that in controls. Based on the counts from Table 2.1, the test statistic is
SGBP =2∑i=0
[(ri − rni/n)2
rni/n+
(si − sni/n)2
sni/n]. (2.9)
Under the null hypothesis, SGBP has an asymptotic χ2 distribution with 2 degrees of free-
dom. Under the alternative hypothesis, SGBP has an asymptotic non-central χ2 distribution with
2 degrees of freedom, and the non-centrality parameter δGBP [121],
δGBP = rs[(p+gg − p−gg)2
rp+gg + sp−gg
+(p+gG − p
−gG)2
rp+gG + sp−gG
+(p+GG − p
−GG)2
rp+GG + sp−GG
], (2.10)
where the distribution of the genotypes is multinomial with the parameter vector (r; p+gg, p
+gG, p
+GG)
for cases and with the parameter vector (s; p−gg, p−gG, p
−GG) for controls. The power of the test is
only determined by the non-centrality parameter δGBP .
12
Cochran-Armitage’s trend test
The Cochran-Armitage trend test [31, 3] is usually used in categorical data analysis to test the
presence of an association between a binary variable and a variable with k categories where k is
usually greater than 2. It modifies the Pearson’s χ2 test to incorporate a suspected trend in the
effects of the k levels of the second variable. Therefore in genotype-based association analysis,
we need to associate a vector of scores (x0, x1, x2) to each genotype to specify the trend we
want to test. The scores (x0, x1, x2) are equivalent to scores (0, x, 1) by a linear transformation
(x = (x1 − x0)/(x2 − x0)).
For the scores, researchers usually use the penetrance of the genotypes or equivalent scores via
a linear transformation. Denote the penetrance of gg, gG and GG by f0, f1 and f2, respectively.
Then the relative risk are defined to be γi = fi/f0 for i = 0, 1, 2. Similarly, we define δi =
(1 − fi)/(1 − f0) which can be regarded as relative resistance to the disease. Further denote the
population genotype probabilities as g0 = Pr(gg), g1 = Pr(gG) and g2 = Pr(GG). Then by the
Bayes rule, we can express the genotype probabilities in cases as pi = γigi∑i γigi
and the genotype
probabilities in controls as qi = δigi∑i δigi
. The null hypothesis is pi = qi for i = 0, 1, 2 which is
equivalent to γ1 = γ2 = 1. The alternative hypothesis can be either γ2 > γ1 ≥ 1 or γ2 ≥ γ1 > 1.
When the scores are (0, 1/2, 1), the trend we test is from an additive model with γ1 = (1 +
γ2)/2. When the scores are (0, 0, 1), the trend we test is from an recessive model with γ1 = 1.
When the scores are (0, 1, 1), the trend we test is from an dominant model with γ1 = γ2. The
multiplicative model (γ2 = γ21 ) and the additive model are asymptotically equivalent as (γ1, γ2)
approaches the null value (1, 1).
With the scores (x0, x1, x2), the Cochran-Armitage test statistic [163, 52] is
ZCATT =U√
V ar(U), (2.11)
where
13
U =1
n
2∑i=0
xi(s× ri − r × si). (2.12)
Under the null hypothesis, the expectation of U is 0, and the variance of U is
V arH0(U) = nσ20 =
rs
n[
2∑i=0
x2i qi − (
2∑i=0
xiqi)2]. (2.13)
Therefore, ZCATT is asymptotic normally distributed and Z2CATT has an asymptotic distri-
bution of χ21. In applications, one may use qi = ni/n to estimate σ2
0 when qi is unknown. This
gives
ˆV arH0(U) = nσ20 =
rs
n3[n
2∑i=0
x2ini − (
2∑i=0
xini)2]. (2.14)
Under the alternative hypothesis, the expectation of U is EH1(U) = nµ1, and the variance is
V arH1(U) = nσ21 , where
µ1 =rs
n2
2∑i=0
xi(pi − qi), (2.15)
and
σ21 =
rs2
n3[
2∑i=0
x2i pi − (
2∑i=0
xipi)2] +
r2s
n3[
2∑i=0
x2i qi − (
2∑i=0
xiqi)2]. (2.16)
Therefore under the alternative hypothesis, ZCATT is asymptotic normally distributed with
variance 1 and mean λCATT
λCATT =nµa√nσ2
1
=rs∑2
i=0 xi(pi − qi)√rs2[
∑2i=0 x
2i pi − (
∑2i=0 xipi)
2] + r2s[∑2
i=0 x2i qi − (
∑2i=0 xiqi)
2].
(2.17)
In other words, under the alternative hypothesis, Z2CATT is asymptotic 1-df χ2 distributed with
14
the non-centrality parameter δCATT
δCATT =rs[∑2
i=0 xi(pi − qi)]2
s[∑2
i=0 x2i pi − (
∑2i=0 xipi)
2] + r[∑2
i=0 x2i qi − (
∑2i=0 xiqi)
2]. (2.18)
The trend test has higher power than the Pearson’s χ2 test when the suspected trend is correct,
but the ability to detect unsuspected trends is sacrificed.
Tests with logistic regression
Many GWAS applications such as [82] employ logistic regression followed by a hypothesis test
to identify associated SNPs. A first step builds a logistic regression model in formula (2.19) to
predict disease from each SNP individually; in such a model the SNP is coded by two indicator
variables, one for heterozygous carrier of the minor allele (X1) and one for homozygous carrier
of the minor allele (X2). In other words, we convert AA into “X1=0, X2=0”, AB into “X1=1,
X2=0”, and BB into “X1=0, X2=1” where A stands for the common allele at this locus and B
stands for the minor allele. The dichotomous response variable Y is coded as 1 for cases and 0 for
controls.
logP (Y = 1|X1, X2)
1− P (Y = 1|X1, X2)= β0 + β1X1 + β2X2. (2.19)
In the second step, a hypothesis test is performed to test the fit of each logistic model and to
return a P-value for each SNP. In the test, the null hypothesisH0 is that the SNP is not associated,
namely β1 and β2 are zeros. The alternative hypothesisH1 is that the feature is associated, namely
either β1 or β2 are nonzero. The likelihood ratio test is the most commonly used method, and the
test statistic is
S = 2(logL1 − logL0), (2.20)
15
where logL1 and logL0 are the log-likelihood under H1 and H0 respectively. Under H0, the test
statistic has an asymptotic χ2 distribution with 2 degrees of freedom. Under H1, the test statistic
has an asymptotic non-central χ2 distribution with 2 degrees of freedom. The score test and Wald
test are similar test procedures that are sometimes used.
2.1.2 Parametric Multiple-maker Methods
Analyzing the genetic association of disease with one individual marker at a time (such as the
single-marker methods in Section 2.1.1) can have limited power due to the relatively small genetic
effects and the ignorance of the interaction between SNPs. Therefore, it is of certain interest to
test multiple SNPs (e.g., all the SNPs in a gene or a pathway) at a time, namely to test whether
any of the SNPs in the set are associated to the disease. The first class of multiple-marker meth-
ods is based on individual-marker methods. In particular, a universal procedure is to apply one
individual-marker method first to test the individual significance of each SNP, and then to cor-
rect the multiple testing via Bonferroni correction, or via Monte Carlo [103], or via estimating
the effective number of tests [30, 134, 124]. For instance, one possible test for joint association
of multiple SNPs in a gene is the maximum of the single SNP χ2 statistic, which is known as
“max-single” [157]. The max-single test is likely to be powerful if there is only a single marker
strongly associated with disease [157]. Still, this class of multiple-marker methods relies heav-
ily on single-marker methods, and cannot accommodate complex genetic effects and interaction
effects, resulting in limited power in certain circumstances.
Another class of multiple-marker methods is based on multivariate regression which allow for
simultaneous analysis of multiple markers. One well-known procedure is multivariate Hotelling’s
T 2 test [45]. However, those methods often offer little benefit over multiple-marker methods based
on individual-marker methods [27, 149] because of the large number of degrees of freedom.
16
2.1.3 Nonparametric Multiple-maker Methods
To improve power over the above standard (parametric) multiple-marker methods (either based on
individual-marker methods or based on multivariate methods), a class of nonparametric or semi-
parametric methods have been proposed. Those methods include the Zglobal test [157], pseudo F
test [199], kernel-based association test (KBAT) [125] and kernel-machine test [201, 202].
The Zglobal test [157] is based on the motivation that the genetic similarity measured on asso-
ciated SNPs should yield higher similarity scores for cases than for controls, whereas the genetic
similarity measured on non-associated SNPs should yield comparable similarity scores for cases
and controls. Therefore, the Zglobal test essentially measures the average genetic score for all pairs
of cases and compares this to the average genetic score for all pairs of controls. This approach
uses the U -statistic to measure genetic similarity within a group. The key steps of deriving the test
statistic are as follows. First, we need to calculate the contrast vector
δ = Ud − Uc, (2.21)
where Ud and Uc are the similarity vectors for the diseased cases and controls respectively. The
length of the δ, Ud and Uc are the same as the number of markers considered in the test. Finally
the test statistic is
Zglobal =w′δ√w′Vow
, (2.22)
where w′ is the optimal weight vector and Vo is the covariance matrix of δ under the null hypoth-
esis. Zglobal is asymptotically standard normally distributed under the null hypothesis.
Wessel and Schork [199] summarize seven different measures for evaluating the genetic simi-
larity (or distance) between any pair of people based on a prespecified number of genetic markers.
Suppose we have complete data for L genetic markers and M phenotype variables (such as dis-
ease status, age, blood pressure) for a group of N people. We use an N ×N matrix D to denote
distance matrix for the group if any of the measures is used. Let H = X(X ′X)−1X ′ be the
17
projection matrix (essentially a similarity of phenotypes) where X is the N ×M covariate matrix
for the phenotypes. Compute the matrix A = (aij) = (−d2ij/2) and its centered matrix G. The
test statistic of the pseudo F test is
F =tr(HGH)
tr[(I −H)G(I −H)]. (2.23)
Generally, permutation tests can be used to determine the significance of the test.
Wu and his colleagues [201] propose the kernel-machine test to test the relevance of a SNP set
under the semiparametric logistic kernel-machine regression models. The test statistic Q, which
involve the similarity matrix K, is scaled χ2 distributed under the null hypothesis with a scale
parameter κ and degrees of freedom ν. Procedures for estimating κ and ν are also provided in the
paper [201].
2.2 Multiple Testing
2.2.1 Error Criteria
When performing multiple tests in many applications, researchers tend to focus on the most sig-
nificant results and use them to support their conclusions. Such unguarded multiplicity selection
always results in increase of the false rejections of the null hypotheses. Many classical multi-
ple testing (correction) procedures (a.k.a. multiple-comparison procedure, or MCPs), such as the
well-known Bonferroni correction, are proposed to control the probability of any Type I error
existing in the multiple tests, also known as familywise error rate (FWER). Suppose we carry
out m tests whose results can be categorized as Table 5.1. The familywise error rate is defined
as P (N10 ≥ 1). By contrast, the uncorrected error rate, namely the per comparison error rate
(PCER), is E(N10/m) [12].
In large-scale multiple testing problems (when m is large), the existence of false rejections
is quite common and the FWER is no longer very informative. Under such circumstances, we
may fail to reject many false null hypotheses if we still want to control the FWER at a certain
18
H0 rejected H0 not rejected Total
H0 true N00 N10 m0
H0 false N01 N11 m1
Total S R m
Table 2.3: The classification of tested hypothesis
level. Furthermore, we need not only consider whether any error is made, but also the number
of incorrect rejections [12]. Accordingly, one important criterion is false discovery rate (FDR),
which depicts the expected proportion of incorrectly rejected null hypotheses (or type I errors)
[12]. In terms of random variables, FDR is defined as E(N10/R|R > 0)P (R > 0) which is
the expected value of false discovery proportion (FDP). False discovery proportion is defined as
N10/R.
Another criterion, false non-discovery rate (FNR), defined as E(N01/S|S > 0)P (S > 0),
depicts the expected proportion of incorrectly non-accepted non-null hypotheses (or type II errors)
[62]. The marginal versions of FDR and FNR were also proposed [62]. The marginal false
discovery rate (mFDR), defined as E(N10)/E(R), is asymptotically equivalent to the FDR [62],
namely
mFDR = FDR+O(m−1/2). (2.24)
Similarly, marginal false non-discovery rate (mFNR) is defined to be E(N01)/E(S). Note
that FNR and mFNR are related to the efficiency of multiple testing procedures whereas other
criteria aforementioned are related to the validity of the procedures.
In different settings, there exist other versions of FDR such as the weighted FDR in weighted
multiple testing [13] and FDR for clusters (FDRcluster) which can be used when the m tests can
be further partitioned into homogeneous clusters [11].
19
2.2.2 P -value Thresholding Methods
A common class of multiple testing procedures is based on p-value thresholding. The Benjamini
and Hochberg’s procedure (usually referred as BH procedure) [12] rejects individual null hypothe-
ses by thresholding the P -value with the objective of maximizing the number of true positives
while controlling the proportion of false positives in all the rejections. The BH procedure is a
distribution-free and finite-sample procedure. Let P(1) < ... < P(m) be the ordered P -values
from the m tests and P(0) = 0. The BH procedure rejects any null hypothesis whose P -value
satisfies P ≤ T with
T = max{P(i)|P(i) ≤αi
m}, (2.25)
which controls the false discovery rate at the level αm0m . The threshold T for the BH procedure
can be also written as
T = sup{t| t
Gm(t)≤ α}, (2.26)
where Gm(t) is the empirical cumulative distribution function of Pi [12, 166, 62].
Note that the FDR is controlled by the threshold T at the level αm0m , which is more stringent
than the nominal level α. Therefore, the efficiency of the BH procedure can be further improved
by a little bit by setting the threshold at
T ′ = sup{t| m0t
mGm(t)≤ α}, (2.27)
at the cost of accurately estimating the number of true null hypotheses m0 [14]. Other P -value
thresholding methods were also proposed, including the adaptive procedures [166, 15], the plug-in
procedure [63] and the augmentation procedure [183].
20
2.2.3 Local False Discovery Rate Methods
Local false discovery rate methods were first introduced in [38]. Suppose that we perform m
tests at the same time with the null hypotheses H1, ..., Hm, and the corresponding test statistics
s1, ..., sm. Consider the Bayesian two-class model, namely the m hypotheses are divided into
two classes (null or nonnull) occurring with probabilities p0 = P (null) and p1 = P (nonnull).
Further assume that the density functions of the test statistics are f0 if the corresponding hypothesis
is null and f1 if nonnull. Therefore, the Bayes posterior probability that a hypothesis is null given
its test statistic s is defined to be the local false discovery rate, namely
fdr(s) = P (null|s) =p0f0(s)
p0f0(s) + p1f1(s). (2.28)
If we use tail areas (such as P -values) to derive the Bayes posterior probability, we end up
with the Benjamini-Hochberg false discovery rate. Let us use F0 and F1 to denote the cumulative
distribution functions (cdf) corresponding to f0 and f1. The posterior probability of a hypothesis
being null given that its test statistic S is less than some value s is
FDR(s) = P (null|S ≤ s) =p0F0(s)
p0F0(s) + p1F1(s). (2.29)
[166] and [37] give the connection of the frequentist FDR control [12] and the Bayesian FDR
in formula (2.29). The key difference between false discovery rate and local false discovery rate
is that FDR is based on the tail distributions whereas local false discovery rate is based on the
densities. For example, suppose we use any P-value thresholding method to reject the 10 null
hypotheses with the most extreme test statistics s(1), ...,s(10). FDR(s(10)) depicts the probability
of false rejection among the 10 rejected hypotheses, whereas fdr(s(i)) (for i = 1, ..., 10) tells
us the probability of false rejection for each of the 10 rejected hypotheses. FDR is a conditional
expectation of fdr; namely FDR(s) is the average of fdr(S) for all S ≤ s [37]. Therefore, we
can regard FDR as one error criterion for P -value thresholding methods, and local false discovery
rate works more like P -values, with which we can make inference decisions. Generally, the local
21
false discovery rate methods are expected to work better than the P -value thresholding methods.
The reason is that when determining the level of significance of a single hypothesis, the P -value
thresholding methods only consider the hypotheses separately whereas the local false discovery
rate methods can consider the m hypotheses simultaneously and incorporate the distributional
information of the m test statistics.
2.2.4 Local Significance Index Methods
Although local false discovery rate methods can incorporate the distributional information of the
m test statistics, they still use the individual test statistic si to determine the significance level of
the null hypothesisHi. Local significance index methods [172] generalize the local false discovery
rate methods by considering all the m test statistics (especially informative ones) when determin-
ing the significance level of single hypotheses. This makes them extremely useful if there exists
a certain level dependency structure among the m hypothesis, such as when null hypotheses or
nonnull hypotheses exist in clumps, chains, graphs and hierarchies. Formally, the local index of
significance (LIS) for hypothesis i is defined as
LISi = Pϑ(Hi is null|all the observations at m hypotheses), (2.30)
where ϑ are the parameters for specifying the dependency structure of them hypotheses. In [172],
Sun and Cai studied the situation in which the dependency structure comes from a chain structure,
and used the hidden Markov models to parameterize the conditional independence. Then Sun
and Cai used the forward-backward procedure, which is an inference algorithm special for hidden
Markov models, to calculate the local significance indices for all the m hypotheses. Finally, their
procedure employed a decision rule of the form δ = [I(LISi < λ) : i = 1, ...,m] as the final out-
put, where λ is the cut-off point. In their paper, they also gave an adaptive procedure to determine
λ for the a given FDR level. Let Rλ =∑m
i=1 I(LISi < λ), Vλ =∑m
i=1 I(LISi < λ,Hi is null),
and Q(λ) = E(Vλ)/E(Rλ) be the number of rejections, the number of false rejections and the
marginal false discovery rate yielded by the decision rule δ = [I(LISi < λ) : i = 1, ...,m]. It can
22
be shown that, if k hypotheses are rejected, the marginal false discovery rate can be approximated
by
Q(k) =1
k
k∑i=1
I(LIS(i)). (2.31)
Note that the approximated mFDR is the average of the LIS for rejected hypotheses, which is
similar to the relation between FDR and local false discovery rate.
However, Sun and Cai only study the situation when the form of dependency is an HMM. We
continue with their ”local-index-of-significance” framework, but generalize the dependency form
to some general form such as pairwise Markov random fields.
2.3 Graphical Models
Graphical models [192] are probabilistic models representing the conditional independence be-
tween random variables via a variety of graphs. General graphical models consist of Bayesian
networks, which are directed [138, 88], Markov random fields (a.k.a. Markov networks) [91],
which are undirected, and factor graphs [109], which emphasizes the factorization of the distri-
bution they depict. Essentially, graphical models represent a joint probability of all the variables
compactly, with the conditional independence expressed by graphs. For a Bayesian network on a
set of d variables x = (x1, ..., xd) with conditional independence specified by a directed acyclic
graph, the joint probability is
P (x) =∏i=1
P (xi|xπi), (2.32)
where P (xi|xπi) is the local conditional probability for xi given its parents, and πi is the set of
indices for the parent nodes of xi. For a Markov random field on a set of variables xi’s with whose
conditional independences specified by a undirected graph, the joint probability is
23
P (x) =1
Z
∏φC(xC), (2.33)
where φC(xC) is a potential function for all variables in a clique xC and 1/Z is a normalization
constant. A Markov random field is said to be pairwise if the potentials between variables are
only for pairs of variables. Pairwise Markov random fields are related to Potts models [200].
In addition, if every variable in a pairwise Markov random field only has two states (possible
values), it is also an Ising model [91]. A part of work on graphical models is learning, such
as parameter learning and structure learning. Another part is inference, such as calculating the
marginal probabilities of variables and finding the most probably states of variables.
2.3.1 Maximum Likelihood Parameter Learning
Undirected graphical models (a.k.a. Markov random fields or Markov networks) are useful models
in many applications, but parameter learning of undirected graphical models is difficult due to the
global normalizing constant (partition function). Suppose for simplicity that we have a pairwise
Markov random field on a random vector X ∈ X d described by an undirected graphG(V,E) with
the node set V and the edge set E. X = {0, 1, ...,m− 1} is a discrete space. The probability of a
sample x given a known parameter vector θ = {θα|α ∈ I} (I is some index set) is
P (x;θ) = exp{θTφ(x)−A(θ)
}, (2.34)
where φ = {φα|α ∈ I} is a vector of sufficient statistics, and A(θ) is the log partition function as
follows,
A(θ) = log∑x∈X d
exp{θTφ(x)
}. (2.35)
There are nice properties of the log partition function A(θ) as follows. First, for any index
α ∈ I,
24
∂A(θ)
∂θα= Eθφα =
∑x∈X d
P (x;θ)φα(x). (2.36)
Second, for any indices α, β ∈ I,
∂2A(θ)
∂θα∂θβ= Eθφαφβ − EθφαEθφβ. (2.37)
Assume that we have n independent samples X = {x1,x2, ...,xn} generated from (2.34), and
we want to estimate the parameters θ. The maximum likelihood estimate (MLE) is the common
method which maximizes the log-likelihood function given as
L(θ|X) ∝ 1
n
n∑j=1
θTφ(xj)−A(θ). (2.38)
The partial derivative of L(θ|X) with respect to θα is
∂L(θ|X)
∂θα=
1
n
n∑j=1
φα(xj)− Eθφα. (2.39)
From (2.37), the Hessian matrix is positive semidefinite because it is essentially the covariance
matrix of φ. Therefore, A(θ) is convex and L(θ|X) is concave. Therefore, we can use gradient
ascent to find the global maximum of the likelihood function and find the MLE of θ. However,
the problem is that A(θ) is usually intractable according to (2.35).
The partial derivative of L(θ|X) with respect to θα is
∂L(θ|X)
∂θα=
1
n
n∑j=1
φα(xj)− Eθφα
= EXφα − Eθφα.
(2.40)
When the partial derivatives arrive at 0, we arrive at the global minimizer of L(θ|X). (There
might be more than one global minimizers if (2.34) is over-complete representation.) From (2.40),
25
we are looking for the estimate of parameters θ which can match the empirical moments from
observed samples, and the method is called moment matching. Therefore, the key question now is
to calculate the moments of statistics for a specific parameter vector θ. If we can do that, we can
use gradient ascent to search the global minimizer of the log-likelihood. However except for sim-
ple models such as the tree structured graphs, exact maximum likelihood learning is intractable,
because exact computation of Eθφα takes time that is exponential in the treewidth of the graph
[153]. Another type of Markov random fields with simple close-form MLE of parameters (with
complete data) is chordal Markov networks [94].
Sampling Based Methods
There have been a few methods proposed to solve the problem of calculating moments for a
specific θ and use gradient ascent to find the MLE of the parameters. They are MCMC-MLE
[66, 218], contrastive divergence [80] and particle-filtered MCMC-MLE [4]. Essentially, all these
methods use iterative gradient ascent search to find the MLE of the parameters, namely in the
iterations the update of parameters is
θi+1 = θi + η∇L(θi|X)
= θi + η(EXφ− Eθiφ),(2.41)
where η is the learning rate.
The key difference among these methods is how to sample (particles) and compute Eθiφ from
the samples. MCMC-MLE uses importance sampling to generate particles and compute Eθiφ as
follows,
Eθiφ ≈1
s
s∑j=1
wjiφ(xj0), (2.42)
where s is the number of particles, and wji is the weight for particle xj0 in the iteration i. It can be
26
shown that
wji =exp{(θi − θ0)Tφ(xj)}
1s
∑sk=1 exp{(θi − θ0)Tφ(xk)}
, (2.43)
where θ0 is the parameters under which the particles xj0 (j = 1, ..., s) are generated. Note that
in the iterations in (2.41), the particles xj0 (j = 1, ..., s) stay the same. However, the weights are
changed according to (2.43) as we update θi. The use of importance sampling allows us to reuse
the particles, but the weights of the particles might suffer from degeneracy when θi is faraway
from θ0.
In contrast, contrastive divergence (CD) methods generate samples (particles) according to
θi using a Markov chain. Usually, the chain need to reach equilibrium to generate an accurate
sample, but CD’s rationale is that only a rough estimate of the gradient is necessary to determine
the direction to update the parameters. Accordingly, two versions of CD methods have been
proposed. One is CD-n which generates a sample by running Markov chain for n steps under
parameter θi (starting from a training sample) in iteration i. The other one is persistent contrastive
divergence or PCD-n [178] which advances the particles (from last iteration under parameters
θi−1) for n step in the new parameters θi. Since n is usually chosen to be 1 in CD-n, the Markov
chains for generating particles are usually far from equilibrium. Because θi is close to θi+1 when
the learning rate is small, persistent Markov chains are attractive. [179] discussed the interaction
between the learning rate of the parameters and mixing rate of the Markov chains, and therefore
proposed to use a set of fast weights to speed up the mixing of the persistent Markov chains.
Particle filtered MCMC-MLE essentially strikes a balance between MCMC-MLE and con-
trastive divergence. It uses sampling-importance-resampling and a rejuvenation step to overcome
the degeneracy of particles. In more details, it uses effective sample size (ESS) to monitor the
quality of the particles. ESS can be calculated as follows
ESS({w1, ..., ws}) =(∑
j wj)2∑
j(wj)2
. (2.44)
27
When ESS drops below a certain threshold, particle filtered MCMC-MLE evokes the sampling-
importance-resampling and a followed rejuvenation step. Note that this does not happen in each
iteration of updating parameters in (2.41), and can potentially save computation in generating
particles which could be very costly in high-dimensional models.
When the original probability distribution is multimodal, the tempered transition can be used
to make the Markov chain to jump among the multiple modes [153]. There has been other work
on improving the efficiency of sampling so as to improve the calculation of Eθiφ, such as by
considering a mixture of proposal distribution [218] which extends the gradient ascent algorithm
by taking into account of the Hessian matrix.
To sum up, all these methods use sampling methods to compute the population moments so as
to calculate the gradient. The difference among all these methods are (1) how frequent to generate
particles, (2) where to start a Markov chain, (3) what parameters to use in the running of the
Markov chain, and (4) how many Markov chain steps to run for generating a particle (CD-n and
PCD-n).
Variational Methods
Fenchel-Legendre Duality: Let f : Rk → R, then the function f∗ : Rk → R, defined as
f∗(y) = supx
yTx− f(x), (2.45)
is the conjugate of the function f . The domain of the conjugate function f∗ consists of all y ∈ Rk
for which the supremum is finite, namely for which the difference yTx− f(x) is upper bounded
on the domain of f .
If f is differentiable and convex, then the supremum can be calculated by taking the derivative
with respect to x, namely ∇f(x) = y. Denote the solution x∗. If f is convex, then x∗ is a
maximum. If the solution is unique, then the pair (x∗,y) = (x∗,∇f(x∗)) is a Legendre conjugate
pair.
The right-hand side of formula (2.38) can be rewritten as
28
1
n
n∑j=1
θTφ(xj)−A(θ) = µTθ −A(θ), (2.46)
where
µ =1
n
n∑j=1
φ(xj) = EXφ. (2.47)
Denote µ(θ) to be the expectation of φ under the parameters θ, namely
µ(θ) = Eθφ. (2.48)
When (2.34) is the minimal representation,A(θ) is strictly convex and θ is identifiable. There-
fore, (θ,µ(θ)) is a Legendre conjugate pair. Denote the conjugate function of A(θ) with A∗(µ).
Therefore, the dual parameterization of the model in terms of µ and A∗(µ) is the mean value
parametrization. The domain of A∗(µ), namely the set of {µ(θ)|θ is valid}, is called marginal
polytope of the exponential family model. Also, it turns out that the functionA∗(µ) is the negative
entropy.
We have
A∗(µ) = supθµTθ −A(θ). (2.49)
Therefore, the conjugate dual problem is
A(θ) = supµθTµ−A∗(µ). (2.50)
There are other methods which bound the log partition function. [188] introduce a new class
of upper bounds on the log partition function, based on convex combinations of distribution in
the exponential domain, that is applicable to an arbitrary undirected graphical model. They show
that when the convex combination is with respect to tree-structured distributions, the variational
29
problems are convex and have a unique global minimum which gives an upper bound on the log
partition function.
Theoretical Aspects
There has been theoretical work on convergence of MLE learning of the parameters in Markov
random fields [208, 153]. Denote θ∗ to be the maximizer of the log-likelihood function L(θ|X) in
(2.38). If we use MCMC to generate s particles and approximate the log-likelihood function by
(2.42) and (2.43), the approximated log-likelihood is denoted as Ls(θ|X). If the Markov chain is
ergodic, then Ls(θ|X)→ L(θ|X) for all θ [153]. It can also be shown that under mild conditions,
if θs is the maximizer of the approximated log-likelihood function Ls(θ|X), then θs → θ∗ almost
surely [153]. The convergence properties of contrastive divergence are discussed by [211, 25,
173].
2.3.2 Bayesian Parameter Learning
Suppose that a Markov random field (MRF) on X (X ∈ X d where X is a discrete space) is pa-
rameterized by θ, and its probability mass function is P (X;θ) = P (X;θ)/Z(θ), where P (X;θ)
is some unnormalized probability measure, and Z(θ) is the normalizing constant or partition
function. Given a prior of θ and n i.i.d. observed data points X = {x1, ...,xn}, Bayesian pa-
rameter estimation provides the posterior distribution of θ, denoted by P (θ|X). This posterior
distribution is very informative, not only because the first moment E(θ|X) (a.k.a. Bayesian es-
timate) is optimal in many problems, but also because the standard deviation also depicts the
variability of θ which is useful for statistical inference. However, Bayesian parameter estima-
tion for general MRFs is known as doubly-intractable [128]. With the prior π(θ), the posterior
is P (θ|X) ∝ π(θ)P (X;θ)/Z(θ). If we use the Metropolis-Hastings (MH) algorithm to generate
posterior samples of θ, then in each MH step we have to calculate the MH ratio for the move from
θ to θ∗
30
a(θ∗|θ) =π(θ∗)P (X;θ∗)Q(θ|θ∗)π(θ)P (X;θ)Q(θ∗|θ)
=π(θ∗)P (X;θ∗)Q(θ|θ∗)Z(θ)
π(θ)P (X;θ)Q(θ∗|θ)Z(θ∗),
(2.51)
where Q(θ∗|θ) is some proposal distribution from θ to θ∗, and with probability min{1, a(θ∗|θ)}
we accept the move from θ to θ∗.
The real hurdle in Bayesian parameter estimation for general MRFs is the intractable MH
ratio in (2.51). There are three methods of calculating it in the literature. The first one is to use
importance sampling to estimate r = Z(θ)/Z(θ∗) [118] by
rIS =s−1
2
∑s2t=1 P (x
(t)2 ;θ)α(x
(t)2 )
s−11
∑s1t=1 P (x
(t)1 ;θ∗)α(x
(t)1 )
, (2.52)
where x(1)1 , ..., x(s1)
1 are sampled from P (X;θ) and x(1)2 , ..., x(s2)
2 are sampled from P (X;θ∗), and
α(X) is an arbitrary function defined on the same support as P . Theoretically, rIS is a consistent
estimator of Z(θ)/Z(θ∗) as long as the sample averages in (2.52) converge to their corresponding
population averages, which is satisfied by Markov chain Monte Carlo under regular conditions.
However, the optimal choice of α depends on the ground truth of r, and [118] provided several
options for α, such as a geometric function α(X) = (P (X;θ)P (X;θ∗))−1/2 which is included
in this paper as a baseline.
The second method is to introduce auxiliary variables and cancel Z(θ)/Z(θ∗) in (2.51). [122]
introduces one auxiliary variable Y on the same space as X, and the state variable is extended to
(θ,Y). They set the new proposal distribution for the extended state
Q(θ,Y|θ∗,Y∗)=Q(θ|θ∗)P (Y;θ)/Z(θ) (2.53)
to cancel Z(θ)/Z(θ∗) in (2.51). Therefore by ignoring Y, we can generate the posterior samples
of θ via Metropolis-Hastings. Technically, this auxiliary variable approach requires perfect sam-
31
pling [143], but [122] pointed out that other simpler Markov chain methods also work with the
proviso that it converges adequately to the equilibrium distribution. [128] extended the single aux-
iliary variable method to multiple auxiliary variables for improved efficiency, as well as pointed
out that the single auxiliary variable method can be simplified as a single-variable exchange al-
gorithm. Both the single auxiliary variable algorithm and the single-variable exchange algorithm
can be interpreted as importance sampling. In the auxiliary variable algorithm, r = Z(θ)/Z(θ∗)
is estimated by
raux =s−1
1
∑s1t=1
P (x(t)1 ;θ)
P (x(t)1 ;θ)
s−12
∑s2t=1
P (x(t)2 ;θ)
P (x(t)2 ;θ∗)
, (2.54)
where x(1)1 , ..., x
(s1)1 are sampled from P (X;θ) and x
(1)2 , ..., x
(s2)2 are sampled from P (X;θ∗),
and θ is some estimate of θ. In the single-variable exchange algorithm, r = Z(θ)/Z(θ∗) is
estimated by
rexch = s−1s∑t=1
P (x(t);θ)
P (x(t);θ∗), (2.55)
where x(1), ..., x(s) are sampled from P (X;θ∗). Both importance sampling and the auxiliary
variable method are computation intensive and do not perform well for large-scale models or high
dimensional parameter space, because in each MH step they require generating samples from
P (X;θ) for a given θ via the computation expensive perfect sampling [143] or standard Gibbs
sampling with long runs.
The third method is to use pseudolikelihood [18] to approximate P (X;θ∗) and P (X;θ) in
(2.51). Pseudolikelihood approximation requires less computation, but its approximation nature
makes the Markov chain no longer in detailed balance and may yield unsatisfactory performance.
32
2.3.3 Inference Algorithms
So far, many inference algorithms have been studied, including variable elimination, belief prop-
agation [206], junction trees [99], sampling methods [60], and variational methods [89]. For
undirected graphs without cycles or for tree-structured directed graphs, message-passing algo-
rithms provide exact inference results with a computational cost linear in the number of variables,
namely the sum-product algorithm for computing the marginal probabilities and the max-product
algorithm for computing the most probable states. For graphical models with cycles, loopy be-
lief propagation [127, 197] and the tree-reweighted algorithm [189] can be used for approximate
inference.
2.4 Feature and Variable Selection
The dimensionality of real-world machine learning problems keeps increasing, and feature se-
lection becomes a necessary procedure in many applications, resulting in improved performance,
greater efficiency and better interpretability [73]. Features can be selected with different goals,
typically either finding all features relevant to the target class variable (termed all-relevant) or
finding a minimal feature subset optimal for classification (termed minimal-optimal) [133]. Fea-
ture selection algorithms can be categorized into three types: simple filters, filters with redundancy
removal, and wrappers (Figure 2.1). Simple filters assume independence between features and
rank them individually according to some relevance criterion. They address all-relevant problems
and are efficient. Filters with redundancy removal typically first try to identify all the relevant
features, similar to simple filters, and then remove redundant features in a second step [209]. They
address the minimal-optimal problem and require more computation. Wrapper methods iteratively
generate a candidate feature subset and test it by a specific learning algorithm’s performance, until
some criterion is satisfied [93]. They target the minimal-optimal problem and are compute in-
tensive. When necessary, simple filters can first reduce dimension by filtering out non-relevant
features before wrappers are used.
33
(b) Filters with redundancy removal
SelectedSubset
RelevantSubset
RedundancyAnalysis
OriginalSet
RelevanceAnalysis
(a) Simple filters
SelectedSubset
OriginalSet
RelevanceAnalysis
Yes
NoCandidate
Subset
Current Best Subset
Original SetOr
Processed Set
SubsetGeneration
OriginalSet
RelevanceAnalysis
(c) Wrappers
SubsetEvaluation
Stopping Criterion
SelectedSubset
Figure 2.1: The workflow of the three different feature selection approaches.
34
Recently a variety of feature and variable selection algorithms appears in both the statistics
and machine learning communities, such as FCBF [209], Relief [92], DISR [119], and MRMR
[139]. With the rapid increase of feature size, some approaches focus on high-dimensional or
ultrahigh-dimensional feature selection [194, 44]. One particular popular family of approaches
is based on penalized least squares or penalized pseudo-likelihood. Specific algorithms include
but are not restricted to LASSO [176], SCAD [43], Lars [36], Dantzig selector[23] and elastic net
[219]. Several recent algorithms also take into account the structure in the covariate space, such as
group lasso [210], fused lasso with a chain structure [177], overlapping group lasso [87, 86] and
graph lasso [86]. However, almost all the penalized least squares or penalized pseudo-likelihood
feature selection methods (except elastic net) are to find a minimal feature subset optimal for
regression or classification, which is a minimal-optimal problem. However if we want to regard
genome-wide association studies as variable selection problems, the goal of feature selection is to
identify all the features relevant to the response variable which is an all-relevant problem.
Chapter 3
High-Dimensional Structured Feature
Screening Using Markov Random
Fields
Feature screening is a useful feature selection approach for high-dimensional data when the goal is
to identify all the features relevant to the response variable. However, common feature screening
methods do not take into account the correlation structure of the covariate space. We propose the
concept of a feature relevance network, a binary Markov random field to represent the relevance
of each individual feature by potentials on the nodes, and represent the correlation structure by
potentials on the edges. By performing inference on the feature relevance network, we can ac-
cordingly select relevant features. The procedure does not yield sparsity, which is different from
the particular popular family of feature selection approaches based on penalized least squares or
penalized pseudo-likelihood. We give one concrete algorithm under this framework and show its
superior performance over common feature selection methods in terms of prediction error and
recovery of the truly relevant features on real-world data and synthetic data.
35
36
3.1 Introduction
The dimensionality of machine learning problems keeps increasing, and feature selection becomes
a necessary procedure in many applications, resulting in improved performance, greater efficiency
and better interpretability [73]. However, feature selection in many applications becomes more
and more challenging due to both the increasing number of features and the complex correla-
tion structure among the features. For instance, in genome-wide association studies (GWAS),
researchers are interested in identifying all relevant genetic makers (single-nucleotide polymor-
phisms, or SNPs) among millions of candidates with hundreds or thousands of samples. Usually
the truly relevant markers are rare and only weakly associated with the response variable. A
screening feature selection procedure is usually the only method computationally feasible because
of the high dimension, but it is typically unreliable and suffers from high false positive rate. On
the other hand, the features are usually correlated with one another. For example in GWAS, most
SNPs are highly correlated with one or more nearby SNPs, with squared Pearson correlation co-
efficients well above 0.8. In the next paragraph, we give a toy example showing that taking into
account the correlation between features can be beneficial.
Suppose that our measured features are correlated because they are all influenced by some
hidden variable. This is often the case in GWAS, where our features are markers that are easy
to measure, but the actual underlying causal genetic variation is not measured. Suppose that our
data are generated from the Bayesian network in Figure 3.1(a). All variables are binary. Hidden
variables are denoted by H1 and H2. H1 is weakly associated with the class variable. H2 is not
associated. Both H1 and H2 have a probability of 0.5 of being 1. Observed variables A and B
are associated with H1. Observed variables C and D are associated with H2. We label the arc
from H1 to A with a 0.8 to denote that A is 0 with probability 0.8 when H1 is 0, and A is 1 with
probability 0.8 when H1 is 1. Under the distribution, the probability that A and the class variable
take the same value is 0.8 × 0.6 + (1 − 0.8) × (1 − 0.6) = 0.56, and it is the same for B. The
probability that H2 takes the same value with the class variable is 0.5. C and D take the same
value with the class variable with probability 0.5 respectively. The probability that A and B take
37
0.8 0.8 0.8 0.8
0.6
Class
H1
A B
H2
C D
0.56
0.52 0.56
0.68 0.56 0.68
Class
H1
A B
H2
C D
(a) (b)
Figure 3.1: One Bayesian network example.
the same value is 0.68, and it is the same for C and D. Suppose that there are more nonassociated
hidden variables than associated ones and we generate a small sample set from this distribution
specified by the Bayesian network. There will be some nonassociated variables (i.e. C) that appear
to be as promising as associated features (i.e. B) if we only look at the sample-based probability
of agreement with the class variable. Suppose that C appears as promising as A and B, with a
probability of 0.56 agreement with the class variable. In Figure 3.1(b), the number on the dotted
edges stands for the sample-based probability of agreement with the class variable. Since D is
expected to show agreement with C with probability 0.68, we expect the sample-based probability
of agreement betweenD and the class variable to be 0.56×0.68+(1−0.56)×(1−0.68) = 0.52.
If we are using any screening method to evaluate the features, it will rank A, B and C equally
high. However in this case, we should make use of the information that C is more likely to be
a false positive because its highly correlated feature D does not appear as relevant as does A’s
(B’s) highly-correlated feature. Therefore, we seek a way of taking into account the correlation
structure in this manner during the procedure of feature selection.
Markov random fields provide a natural way of representing the relevance of each feature
and the correlation structure among the features. The relevance of each feature is represented as a
node that takes the values in {0, 1}. The correlation structure among the features is captured as the
potentials on the edges. We can regard the feature selection problem in the original covariate space
38
as an inference problem on this binary Markov random field which is called a feature relevance
network. Section 3.2 gives a precise description of the feature relevance network and introduces
one feature selection algorithm. Sections 3.3 and 3.4 evaluate the algorithm on synthetic data and
real-world data respectively. We finally conclude in Section 3.5.
3.2 Method
3.2.1 Feature Relevance Network
Suppose that we have a supervised learning problem with d features and n samples (d � n).
A feature relevance network (FRN) is a binary Markov random field on a random vector X =
(X1, ..., Xd) ∈ {0, 1}d described by an undirected graph G(V,E) with the node set V and the
edge set E. The relevance of featurei is represented by the state of nodei in V . Xi = 1 represents
that featurei is relevant to the response variable whereas Xi = 0 represents that featurei is not
relevant. Correlation between Xi and Xj is denoted by an edge connecting nodei and nodej in
E. The potential on nodei, φ(Xi), depicts the relative probability that featurei is relevant to the
response variable when featurei is analyzed individually. The potential on the edge connecting
nodei and nodej , ψ(Xi, Xj), depicts the relative joint probability that featurei and featurej are
relevant to the response variable jointly. For a given FRN, the probability of a given relevance
state x = (x1, ..., xd) is
P (x) =1
Z
|V |∏i=1
φ(xi)∏
(i,j)∈E
ψ(xi, xj)
=1
Zexp
|V |∑i=1
log φ(xi) +∑
(i,j)∈E
logψ(xi, xj)
,
(3.1)
where Z is a normalization constant, and |V | = d.
Performing feature selection with an FRN involves a construction step and an inference step.
39
featurei = 0 featurei = 1 Total
Y = 1 u0 u1 uY = 0 v0 v1 vTotal n0 n1 n
Table 3.1: Empirical counts at featurei with a binary response variable Y .
To construct an FRN, one needs set φ(Xi) for i = 1, ..., |V | and ψ(Xi, Xj) for (i, j) ∈ E. Section
3.2.2 continues to discuss the construction step in detail. In the second step, one has to find the
most probable state (maximum a posterior, or MAP) of the FRN, and the features can be selected
according to its MAP state. For a binary pairwise Markov random field, finding the MAP state
is equivalent to an energy function minimization problem [21] which can be exactly solved by a
graph cut algorithm [95]. Section 3.2.3 discusses the inference step in detail.
3.2.2 The Construction Step
In the construction step, we set the potential functions φ(Xi) and ψ(Xi, Xj). Suppose that we are
using hypothesis testing to evaluate the relevance of each individual feature, and we observe the
test statistic S = (S1, ..., Sd). We assume that Si’s are independent given X . Suppose that the
probability density function of Si given Xi = 0 is f0, and the density of Si given Xi = 1 is f1. If
f0 and f1 are Gaussian, the model is essentially a coupled mixture of Gaussians model[191]. Here
we give one concrete example. Suppose that we are trying to identify whether a binary featurei is
relevant to the binary response variable Y ∈ {0, 1} with the empirical counts from data shown in
Table 3.1.
If we use a two-proportion z-test to test the relevance of featurei with Y , the test statistic is
Si =u1/u− v1/v√
u0u1/u3 + v0v1/v3. (3.2)
Si|Xi = 0 is approximately standard normally distributed. Si|Xi = 1 is approximately nor-
40
mally distributed with variance 1 and some nonzero mean δi. Many GWAS applications employ
logistic regression followed by a likelihood ratio test to identify associated SNPs. We call this
testing procedure LRLR. In this situation, Si|Xi = 0 has an asymptotic χ2 distribution with 2
degrees of freedom and Si|Xi = 1 has an asymptotic non-central χ2 distribution with 2 degrees
of freedom.
In the FRN, we only connect a pair of nodes if their corresponding features are correlated.
After specifying the structure of the FRN, we have a parameter learning problem in the Markov
random field. The parameters include φ(Xi) for i = 1, ..., |V | and ψ(Xi, Xj) for (i, j) ∈ E.
We claim learning all these parameters is extremely difficult and practically unrealistic for three
reasons. First, learning parameters is difficult by nature in undirected graphical models due to the
global normalization constant Z [190, 198]. Second, there are too many parameters to estimate.
Last but not least, X is latent and we only have one training sample which is S. Therefore, we
propose a compromise solution as follows. Although this solution looks arbitrary, it can be easily
applied in practice and has an interpretation given in formula (3.9).
The way of settingψ(Xi, Xj) comes from the observation that the chance thatXi andXj agree
increases as the magnitude of the correlation between featurei and featurej increases. Therefore,
if we can estimate the Pearson correlation coefficient rij between featurei and featurej , we set
ψ(Xi, Xj) = eλ|rij |I(Xi=Xj), (3.3)
where λ (λ > 0) is a tradeoff parameter and I(Xi = Xj) is an indicator variable that indicates
whether Xi and Xj take the same value.
The way of setting φ(Xi) is as follows. We set
φ(Xi) = e|Xi−qi|, (3.4)
where qi = 1 − pi and pi = P (featurei is relevant). With hypothesis testing in (3.2), we
usually set pi to be 1 if the absolute value of the test statistic is greater than or equal to some
41
threshold ξ and 0 if otherwise. We call the pi (from such a “hard” method using some threshold)
pHi , namely
pHi =
1, if |Si| ≥ ξ,
0, otherwise.
We can also set pi by Bayes’ rule if we know f1 and f0. We call it pBi .
pBi =1
αf0(si) + 1, (3.5)
where
α =P (Xi = 0)
f1(si)P (Xi = 1). (3.6)
However in most of the cases, the parameter δi in f1 is unknown to us. In the two-proportion
z-test in (3.2), δi refers to the mean parameter in f1 which is Gaussian. In LRLR, δi refers to
the non-centrality parameter in f1 which is non-central χ2. We can use its data-driven version δ∗i .
This step has a flattening effect on calculating pi because it assumes the values of the test statistic
for relevant features are uniformly distributed. Therefore, we introduce an adaptive procedure for
calculating pi by
pi = γpHi + (1− γ)pBi , (3.7)
where 0 ≤ γ ≤ 1. We choose ξ in pHi to be the test statistic that makes pBi be 0.5 in (3.5).
Eventually, we have three parameters in the construction step, namely λ, γ and α. In practice, one
can tune the three parameters from cross-validation.
42
3.2.3 The Inference Step
For a given FRN, we need to find the most probable state which maximizes the posterior proba-
bility of (3.1) so as to select the relevant features. Finding the MAP state of the Markov random
field specified by (3.1) is equivalent to minimizing its corresponding energy function E, which is
defined as
E(x) = −|V |∑i=1
log φ(xi)−∑
(i,j)∈E
log ψ(xi, xj). (3.8)
As long as − log ψ(Xi, Xj) is submodular, the energy minimizing problem can be exactly
solved by the graph-cut algorithm on a weighted directed graph F (V ′, E′) [95] in polynomial
time. If φ(Xi) and ψ(Xi, Xj) are set as formula (3.4) and formula (3.3), the optimization problem
is
minx
|V |∑i=1
|xi − pi|+ λ
|V |∑i,j=1
I(xi 6= xj)|rij |
, (3.9)
which can be interpreted as seeking a state of the FRN with two different goals. The first goal
is that the MAP state is close to the relevance of the features when evaluated individually, which
is implied by the first term. The second goal is that strongly correlated features arrive at the same
state, which is implied by the second term. We can run a max-flow-min-cut algorithm, such as the
push-relabel algorithm [69] or the augmenting path algorithm [51], to find the minimum-weight
cut of this directed graph; a cut is a set of edges whose removal eliminates all paths between the
source and sink nodes. Finally, after we cut the graph, every feature node is either connected to
the source node or connected to the sink node. We select the features that are connected with the
source node.
43
3.2.4 Related Methods
A variety of feature selection algorithms appear in both the statistics and machine learning com-
munities, such as FCBF [209], Relief [92], DISR [119], MRMR [139], “cat” score [221] and CAR
score [222]. Variables can be selected within SVM [74, 213, 205]. With the rapid increase of fea-
ture size, some approaches focus on high-dimensional or ultrahigh-dimensional feature selection
[194, 44]. One particular popular family of approaches is based on penalized least squares or
penalized pseudo-likelihood. Specific algorithms include but are not restricted to LASSO [176],
SCAD [43], Lars [36], Dantzig selector[23], elastic net [219], adaptive elastic net [220], Bayesian
lasso [78], pairwise elastic net [110], exclusive Lasso [215] and regularization for nonlinear vari-
able selection [151]. Several recent algorithms also take into account the structure in the covariate
space, such as group lasso [210], fused lasso with a chain structure [177], overlapping group lasso
[87, 86], graph lasso [86], group Dantzig selector [104] and EigenNet [144]. However, most of
the penalized least squares or penalized pseudo-likelihood feature selection methods are to find
a minimal feature subset optimal for regression or classification, which is termed the minimal-
optimal problem [133]. However in this chapter, the goal of feature screening is to identify all
the features relevant to the response variable which is termed the all-relevant problem [133]. The
hidden Markov random field model in our FRN has also been used in other problems, such as
image segmentation [26] and gene clustering [185].
3.3 Simulation Experiments
In this section, we generate synthetic data and compare the FRN-based feature selection algorithm
with other feature selection algorithms. We generate binary classification samples with an equal
number (n) of positive samples and negative samples. In order to generate correlated features,
we introduce h hidden Bernoulli random variables H1,...,Hh. For each hidden variable Hi, we
generate m observable Bernoulli random variables Xij (j = 1, ...,m), where Xij takes the same
value as Hi with a probability ti. We set the first πh hidden variables to be the true associated
44
hidden variables and accordingly we have πhm true associated observable features, where π is
the prior probability of association. For associated hidden variable Hi, we set P (Hi = 1) to be
uniformly distributed on the interval [0.01,0.5]. We also set the relative risk, defined as follows,
rr =P (positive|Hi = 1)
P (positive|Hi = 0). (3.10)
For each nonassociated hidden variable Hi we also set P (Hi = 1) to be uniformly distributed
on the interval [0.01,0.5]; this stays the same for the positive samples and negative samples.
FPR
TPR
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Relative Risk=1.3Prior=0.025
Relative Risk=1.2Prior=0.025
0.0 0.2 0.4 0.6 0.8 1.0
Relative Risk=1.1Prior=0.025
Relative Risk=1.3Prior=0.05
0.0 0.2 0.4 0.6 0.8 1.0
Relative Risk=1.2Prior=0.05
0.0
0.2
0.4
0.6
0.8
1.0
Relative Risk=1.1Prior=0.05
Two-proportion z-test Feature Relevance Network Elastic Net
Figure 3.2: ROC curves of two-proportion z-test, FRN and elastic net for different prior probabil-ities and different relative risks.
45
One baseline feature screening method is the two-proportion z-test which is given in formula
(3.2). We rank the features with the P-values from the tests. The other baseline feature selection
method is the elastic net (in the R package “glmnet”). Unlike other penalized least squares or
penalized pseudo-likelihood feature selection methods, the elastic net approach does not select
a sparse subset of features and is usually good at recovery of all the relevant features. For the
elastic net penalty, we set α to be 0.5, and we use a series of 20 values for λ. For our FRN-
energy minimizing algorithm, we exactly follow formula (3.5), formula (3.6) and formula (3.7).
We choose a series of 20 values for α, and set γ to be 0, and λ to be 1. Since we have the ground
truth of which features are relevant to the response variable, we can compare the ROC curves and
the precision-recall curves for feature capture (i.e., we treat associated features as positives).
For the first set of experiments, we set n = 500, h = 1000, m = 5, ti uniformly distributed
on the interval (0.8, 1.0), π = {0.025, 0.05}, and rr = {1.1, 1.2, 1.3}. Because we have 2 values
for π and 3 values for the relative risk rr, we run the simulation a total of 6 times for different
combinations of the two parameters. The results are shown in Figure 3.2 and Figure 3.3. When
the relative risk is 1.1, it is difficult for all three algorithms to recover the relevant features. When
the relative risk is 1.2 or 1.3, our FRN algorithm outperforms the two baseline algorithms. The
prior of association π does not make too much difference for the ROC curves. However for the
precision-recall curves, when π is larger, the precision will be higher for the same recall value in
the same parameter configuration.
For the second set of experiments, we set n = 500, h = 1000, π = 0.05, rr uniformly
distributed on the interval (1.1, 1.3), m = {2, 5, 10}, and ti uniformly distributed on the interval
(τ, 1.0) where τ = {0.5, 0.8, 0.9}. Because we have 3 values for m and 3 choices for ti, we run
the simulation a total of 9 times for different combinations of the two parameters. The results are
shown in Figure 3.4 and Figure 3.5. When the features have a lot of highly correlated neighbors,
the FRN approach shows an advantage over the ordinary screening method and the elastic net.
However, when the features do not have a lot of neighbors or when the neighbors are not highly
correlated, the FRN does not help a lot.
46
Recall
Pre
cisi
on
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Relative Risk=1.3Prior=0.025
Relative Risk=1.2Prior=0.025
0.0 0.2 0.4 0.6 0.8 1.0
Relative Risk=1.1Prior=0.025
Relative Risk=1.3Prior=0.05
0.0 0.2 0.4 0.6 0.8 1.0
Relative Risk=1.2Prior=0.05
0.0
0.2
0.4
0.6
0.8
1.0
Relative Risk=1.1Prior=0.05
Two-proportion z-test Feature Relevance Network Elastic Net
Figure 3.3: Precision-recall curves of two-proportion z-test, FRN and elastic net for different priorprobabilities and different relative risks.
3.4 Real-world Application
3.4.1 Background
A genome-wide association study analyzes genetic variation across the entire human genome,
searching for variations that are associated with a given heritable disease or trait. The GWAS
dataset on breast cancer for our experiment comes from NCI’s Cancer Genetics Markers of Sus-
ceptibility website (http://cgems.cancer.gov/data/). We name this dataset CGEMS data. It includes
47
FPR
TPR
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
m=2t~U(0.9,1.0)
m=5t~U(0.9,1.0)
0.0 0.2 0.4 0.6 0.8 1.0
m=10t~U(0.9,1.0)
m=2t~U(0.8,1.0)
m=5t~U(0.8,1.0)
0.0
0.2
0.4
0.6
0.8
1.0m=10
t~U(0.8,1.0)0.0
0.2
0.4
0.6
0.8
1.0m=2
t~U(0.5,1.0)
0.0 0.2 0.4 0.6 0.8 1.0
m=5t~U(0.5,1.0)
m=10t~U(0.5,1.0)
Two-proportion z-test Feature Relevance Network Elastic Net
Figure 3.4: ROC curves of two-proportion z-test, FRN and elastic net when we choose differentcorrelation structures of covariates.
528, 173 SNPs as features for 1, 145 patients and 1, 142 controls. Details about the data can be
found in the original study [82]. This GWAS also exhibits weak-association, and the relative risk
of the several identified SNPs are between 1.07 and 1.26 [142]. The reasons for weak association
are that (i) it is estimated that genetics only accounts for about 27% of breast cancer risk and the
48
Recall
Pre
cisi
on
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
m=2t~U(0.9,1.0)
m=5t~U(0.9,1.0)
0.0 0.2 0.4 0.6 0.8 1.0
m=10t~U(0.9,1.0)
m=2t~U(0.8,1.0)
m=5t~U(0.8,1.0)
0.2
0.4
0.6
0.8
1.0m=10
t~U(0.8,1.0)
0.2
0.4
0.6
0.8
1.0m=2
t~U(0.5,1.0)
0.0 0.2 0.4 0.6 0.8 1.0
m=5t~U(0.5,1.0)
m=10t~U(0.5,1.0)
Two-proportion z-test Feature Relevance Network Elastic Net
Figure 3.5: Precision-recall curves of two-proportion z-test, FRN and elastic net when we choosedifferent correlation structures of covariates.
rest is caused by environment [102] and (ii) breast cancer and many other diseases are polygenic,
namely the genetic component is spread over multiple genes. Therefore, given equal numbers of
breast cancer patients and controls without breast cancer, the highest predictive accuracy we might
reasonably expect from genetic features alone is about 63.5%, obtainable by correctly predicting
49
the controls and correctly recognizing 27% of the cancer cases based on genetics. If we select
SNPs which are already identified to be associated with breast cancer by other studies (for exam-
ple, one study [142] uses a much larger dataset which includes 4,398 cases and 4,316 controls, and
confirms results on 21,860 cases and 22,578 controls), we get a set of 19 SNPs (the closest feature
set we have the ground truth for this task). Using these 19 SNPs as input to leading classification
algorithms, such as support vector machines, results in at most a 55% predictive accuracy.
3.4.2 Experiments on CGEMS Data
Since we do not know which SNPs are truly associated, we are unable to evaluate the recovery of
the truly relevant features as what we do in Section 3.3. Instead, we compare the performance of
supervised learning when coupled with the feature selection algorithms. The baseline feature se-
lection methods include (i) logistic regression with likelihood ratio test (LRLR), (ii) FCBF [209],
(iii) Relief [92] and (iv) lasso penalized logistic regression (LassoLR) [203]. Because SVMs have
been shown to perform particularly well on high-dimensional data such as genetic data [196], we
employ it as our machine learning algorithm to test the performance of feature selection methods.
All the experiments are run in a stratified 10-fold cross-validation fashion, using the same folds
for each approach, and each feature selection method is paired with a linear SVM. For running
the SVM, we convert the SNP value AA into 1, AB into 0, and BB into −1 where A stands for
the common allele at this locus and B stands for the rare allele. For each fold, the entire training
process (feature selection and supervised learning) is repeated using only the training data in that
fold before predictions are made on the test set of that fold, to ensure a fair evaluation. For all
feature selection approaches, we tune the parameters in a nested cross-validation fashion. In each
training-testing experiment of the 10-fold cross-validation, we have 9 folds for training and 1 fold
for testing. On the 9 folds of training data, we carry out a 9-fold cross-validation (8 folds for
training and 1 fold for tuning) to select the best parameters. Since we have almost equal numbers
of cases and controls, we use accuracy to measure the classification performance for both inner
and outer cross-validation.
50
We build the FRN based on LRLR. Namely, we follow the calculation of pi in Section 3.2.2.
Then we exactly use formula (3.4) and formula (3.3) to set the φ(Xi) and ψ(Xi, Xj). α in (3.5)
and (3.6) essentially determines the threshold of the mapping function which maps the test statistic
to the association probability pi. Our tuning considers 5 values of α, namely 500, 1000, 1500,
2500, and 5000. γ in (3.7) determines the slope of the mapping function. We considers 5 values
of γ, namely 0.0, 0.25, 0.5, 0.75, and 1.0. λ in (3.9) is the tradeoff parameter between fitness
and smoothness. Our tuning considers 4 values of λ, namely 0.25, 0.5, 0.75, and 1.0. Usually if
there are multiple parameters to tune in supervised learning, one might use grid search. However,
since we will have in total 100 parameter configurations if we grid-search them, it might overfit
the parameters. Instead, we tune the parameters one by one. We first tune α based on the average
performance over the different γ and λ values. With the best α value, we then tune γ based
on the average performance over different λ values. Finally we tune λ with the selected α and
γ configuration. The computation for correlation between features can result in high run-time
and space requirements if the number of features is large. General push-relabel algorithms and
augmenting-path algorithms both have O(|V |2|E|) time complexity. Owing to these two reasons,
it is necessary to remove a portion of irrelevant SNPs in the first step to reduce the complexity
when applying the FRN-based feature selection algorithm to this GWAS data. Therefore, in the
experiments on the GWAS data we only keep the top k SNPs based on the individual relevance
measurements. Tuning k may lead to better performance. Since we already have three parameters
to tune for the energy minimizing algorithm, we fix k at 50, 000. For the baseline algorithms, there
is one parameter f , the number of features to select for supervised learning. We tune it with 20
values, namely 50, 100, 150, ..., and 1000.
As listed in Table 3.2, linear SVM’s average accuracy is 53.08% when the FRN algorithm is
used. When LRLR, FCBF, Relief and LassoLR are used, linear SVM’s average accuracies are
50.64%, 51.68%, 50.90% and 48.75% respectively. We perform a significance test on the 10
accuracies from the 10-fold cross-validation using a two-sided paired t-test. The FRN algorithm
significantly outperforms the logistic regression with likelihood ratio test algorithm and the lasso
51
Alg LRLR FCBF Relief LassoLR FRN
Acc 50.64 51.68 50.90 48.75 53.08P 0.021 0.367 0.069 0.007 –
Table 3.2: The classification accuracy (%) of linear SVM coupled with different feature selectionmethods, logistic regression with likelihood ratio test (LRLR), FCBF, Relief, lasso penalized lo-gistic regression (LassoLR) and feature relevance network (FRN) followed by the P-values fromsignificance test (two-sided paired t-test) comparing the baseline algorithms with FRN.
penalized logistic regression algorithm at 0.05 level.
3.4.3 Validating Findings on Marshfield Data
The Personalized Medicine Research Project [115], sponsored by Marshfield Clinic, was used
as the sampling frame to identify 162 breast cancer cases and 162 controls. The project was
reviewed and approved by the Marshfield Clinic IRB. Subjects were selected using clinical data
from the Marshfield Clinic Cancer Registry and Data Warehouse. Cases were defined as women
having a confirmed diagnosis of breast cancer. Both the cases and controls had to have at least
one mammogram within 12 months prior to having a biopsy. The subjects also had DNA samples
that were genotyped using the Illumina HumanHap660 array, as part of the eMERGE (electronic
MEdical Records and Genomics) network [116]. In total 522, 204 SNPs have been genotyped after
the quality assurance step. Despite the difference in genotyping chips and the different quality
assurance process, 493, 932 SNPs also appear in the CGEMS breast cancer data. Due to the
small sample size, it is undesirable to repeat the same experiment procedure in Section 3.4.2
on Marshfield data. However, we can use it to validate the results from the experiment on the
CGEMS data. We apply FRN and LRLR on CGEMS data, and compare the log odds-ratio of
the selected SNPs by the two approaches on Marshfield data. The CGEMS dataset was also used
by another study [201]. They proposed a novel multi-SNP test approach logistic kernel-machine
test (LKM-test) and demonstrated that it outperformed individual-SNP analysis method and other
52
state-of-the-art multi-SNP test approaches such as the genomic-similarity-based test [199] and the
kernel-based test [126]. Based on the CGEMS data, LKM-test identified 10 SNP sets (genes) to be
associated with breast cancer. The 10 SNP sets include 195 SNPs. We set FRN to select the same
number of relevant SNPs on the CGEMS data, and we compare the SNPs identified by LKM-test
and the SNPs identified by FRN on a different real-world GWAS dataset on breast cancer so as to
compare the performance of LKM-test and FRN.
We run FRN and LRLR on the entire CGEMS dataset and validate the selected SNPs on
Marshfield data. For FRN, we tune the parameters from the 10-fold cross validation similarly.
The selected parameters for FRN are α = 1000, γ = 0.5, and λ = 0.75. In total, FRN selected
428 SNPs from the CGEMS data; 393 of them appear in the Marshfield data. We pick the top
423 SNPs selected by LRLR which also result in 393 overlapped SNPs with Marshfield data. On
Marshfield data we compare the log odds-ratio of the 393 SNPs selected by FRN and the 393
SNPs selected by LRLR via the quantile-quantile plot (Q-Q plot) which is given in Figure 6(a).
On the CGEMS data the LKM-test selected 195 SNPs, 178 of which appear in Marshfield data. To
ensure a fair comparison, we pick the 194 of the 428 SNPs selected by FRN using their individual
P-values, which also yields 178 SNPs in Marshfield data. We also compare the log odds-ratio of
the 178 SNPs selected by FRN and the 178 SNPs selected by LKM-test via Q-Q plot, which is
given in Figure 6(b). If the log odds-ratios of the SNPs selected by two different methods are from
the same distribution, the points should lay on the 45 degree line (the red straight lines in the plots)
in the Q-Q plot. However in both of the two plots we observe obvious discrepancies at the tails.
When comparing the log odds-ratio on a different cohort, the top SNPs picked up by FRN appear
to be much more relevant to the disease than the top SNPs selected by either LRLR or LKM-test.
3.5 Discussion
We propose the feature relevance network as a further step for feature screening which takes
into account the correlation structure among features. For simulations in Section 3.3, it took a
few hours to finish all runs on a single CPU. For results in Section 3.4, we finished, including
53
0.0 0.5 1.0 1.5
0.0
0.4
0.8
(a)
Log odds−ratio of the SNPs (FRN)
Log
odds
−ra
tio o
f the
SN
Ps
(LR
LR)
0.0 0.5 1.0 1.5
0.0
0.4
0.8
(b)
Log odds−ratio of the SNPs (FRN)
Log
odds
−ra
tio o
f the
SN
Ps
(LK
M−
test
)
Figure 3.6: Q-Q plots for (a) comparing log odds-ratio of the SNPs selected by FRN and the SNPsselected by LRLR and (b) comparing log odds-ratio of the SNPs selected by FRN and the SNPsselected by LKM-test. The selection of SNPs is done on CGEMS data. The log odds-ratio iscalculated on Marshfield data.
tuning parameters, in two weeks in a parallel computing environment (∼ 20 CPUs). Besides
the computation burden, another drawback is that our algorithm only returns the selected variables
according to the MAP state. It doesn’t provide P-values or other measures for each variable. In this
chapter, the correlation structure among the features is pairwise, which is represented as edges in
an undirected graph. However, there are also other types of correlation structure which one might
want to provide as prior knowledge, such as the features coming from groups (may or may not
overlap), chain structures or tree structures. Representing all these types of correlation structure
with the help of Markov random fields will be one important direction for future research.
In this chaper, the goal of feature screening is to identify all the features relevant to the re-
sponse variable, which is termed the all-relevant problem [133], although we also compare the
prediction performance of supervised learning due to the lack of the ground truth in the real-
world GWAS application in Section 3.4. In some other applications, the goal of feature selection
is to find a minimal feature subset optimal for classification or regression, which is termed the
minimal-optimal problem [133]. We do not address the minimal-optimal problem at all in the
54
present chapter. For solving the minimal-optimal problem in high-dimensional structured covari-
ate space, many approaches have been well-studied under the lasso framework [176]. Specific
algorithms include but are not restricted to group lasso [210], fused lasso with a chain structure
[177], overlapping group lasso [87, 86], graph lasso [86] and group Dantzig selector [104].
The material in this chapter first appeared in the 15th International Conference on Artificial
Intelligence and Statistics (AISTATS’2012) as follows:
Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside and David
Page. High-Dimensional Structured Feature Screening Using Binary Markov Random Fields. The
15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
The next chapter reformulates this feature selection approach as a multiple testing procedure
that has many elegant properties, including controlling false discovery rate at a specified level and
significantly improving the power of the tests by leveraging dependence.
Chapter 4
Multiple Testing under Dependence via
Parametric Graphical Models
Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence be-
tween individual tests is still one challenging and important problem in statistics. With recent
advances in graphical models, it is feasible to use them to perform multiple testing under depen-
dence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled
mixture model. The ground truth of hypotheses is represented by a latent binary Markov random
field, and the observed test statistics appear as the coupled mixture variables. The parameters in
our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm
to infer the posterior probability that each hypothesis is null (termed local index of significance),
and the false discovery rate can be controlled accordingly. Simulations show that the numerical
performance of multiple testing can be improved substantially by using our procedure. We apply
the procedure to a real-world genome-wide association study on breast cancer, and we identify
several SNPs with strong association evidence.
55
56
4.1 Introduction
Observations from large-scale multiple testing problems often exhibit dependence. For instance,
in genome-wide association studies, researchers collect hundreds of thousands of highly corre-
lated genetic markers (single-nucleotide polymorphisms, or SNPs) with the purpose of identifying
the subset of markers associated with a heritable disease or trait. In functional magnetic reso-
nance imaging studies of the brain, thousands of spatially correlated voxels are collected while
subjects are performing certain tasks, with the purpose of detecting the relevant voxels. The most
popular family of large-scale multiple testing procedures is the false discovery rate analysis, such
as the p-value thresholding procedures [12, 14, 63], the local false discovery rate procedure [38],
and the positive false discovery rate procedure [166, 167]. However, all these classical multiple
testing procedures ignore the correlation structure among the individual factors, and the question
is whether we can reduce the false non-discovery rate by leveraging the dependence, while still
controlling the false discovery rate in multiple testing.
Graphical models provide an elegant way of representing dependence. With recent advances
in graphical models, especially more efficient algorithms for inference and parameter learning, it
is feasible to use these models to leverage the dependence between individual tests in multiple
testing problems. One influential paper [172] in the statistics community uses a hidden Markov
model to represent the dependence structure, and has shown its optimality under certain conditions
and its strong empirical performance. It is the first graphical model (and the only one so far) used
in multiple testing problems. However, their procedure can only deal with a sequential dependence
structure, and the dependence parameters are homogenous. In this chapter, we propose a multiple
testing procedure based on a Markov-random-field-coupled mixture model which allows arbitrary
dependence structures and heterogeneous dependence parameters. This extension requires more
sophisticated algorithms for parameter learning and inference. For parameter learning, we design
an EM algorithm with MCMC in the E-step and persistent contrastive divergence algorithm [178]
in the M-step. We use the MCMC algorithm to infer the posterior probability that each hypothesis
is null (termed local index of significance or LIS). Finally, the false discovery rate can be controlled
57
by thresholding the LIS. Section 4.2 introduces related work and our procedure. Sections 4.3 and
4.4 evaluate our procedure on a variety of simulations, and the empirical results show that the
numerical performance can be improved substantially by using our procedure. In Section 4.5, we
apply the procedure to a real-world genome-wide association study (GWAS) on breast cancer, and
we identify several SNPs with strong association evidence. We finally conclude in Section 4.6.
4.2 Method
4.2.1 Terminology and Previous Work
Not rejected Rejected TotalNull N00 N10 m0
Non-null N01 N11 m1
Total S R m
Table 4.1: Classification of tested hypotheses
Suppose that we carry out m tests whose results can be categorized as in Table 4.1. False
discovery rate (FDR), defined as E(N10/R|R > 0)P (R > 0), depicts the expected propor-
tion of incorrectly rejected null hypotheses [12]. False non-discovery rate (FNR), defined as
E(N01/S|S > 0)P (S > 0), depicts the expected proportion of false non-rejections in those tests
whose null hypotheses are not rejected [62]. An FDR procedure is valid if it controls FDR at a
nominal level, and optimal if it has the smallest FNR among all the valid FDR procedures [172].
The effects of correlation on multiple testing have been discussed, under different assumptions,
with a focus on the validity issue [16, 49, 136, 155, 34, 46, 150, 204, 19]. The efficiency issue has
also been investigated [207, 64, 11, 212], indicating FNR could be decreased by considering de-
pendence in multiple testing. Several approaches have been proposed, such as dependence kernels
[100], factor models [53] and principal factor approximation [42]. [172] explicitly use a hidden
Markov model (HMM) to represent the dependence structure and analyze the optimality under the
58
𝜃𝑗
𝑥𝑗
𝜃𝑖
𝑥𝑖
𝜃𝑘
𝑥𝑘
…
…
… 𝜙𝑖𝑗
1 − 𝜙𝑖𝑗 𝜙𝑖𝑗
1 − 𝜙𝑖𝑗
𝜋𝑖
1 − 𝜋𝑖
𝜙𝑖𝑘
1 − 𝜙𝑖𝑘 𝜙𝑖𝑘
1 − 𝜙𝑖𝑘
𝜙𝑗𝑘
1 − 𝜙𝑗𝑘 𝜙𝑗𝑘
1 − 𝜙𝑗𝑘
𝜋𝑗
1 − 𝜋𝑗
𝜋𝑘
1 − 𝜋𝑘
𝜓
𝜓 𝜓
0
1
0
1
0
1 0
1
0
1
0
1
0 1
0 1
0 1
Figure 4.1: The MRF-coupled mixture model for three dependent hypothesesHi,Hj andHk withobserved test statistics (xi, xj and xk) and latent ground truth (θi,θj and θk). The dependenceis captured by potential functions parameterized by φij ,φjk and φik, and coupled mixtures areparameterized by ψ.
compound decision framework [171]. However, their procedure can only deal with sequential de-
pendence, and it uses only a single dependence parameter throughout. In this chapter, we replace
HMM with a Markov-random-field-coupled mixture model, which allows richer and more flexi-
ble dependence structures. The Markov-random-field-coupled mixture models are related to the
hidden Markov random field models used in many image segmentation problems [214, 26, 28].
4.2.2 The Multiple Testing Procedure
Let x = (x1, ..., xm) be a vector of test statistics from a set of hypotheses (H1, ...,Hm). The
ground truth of these hypotheses is denoted by a latent Bernoulli vector θ = (θ1, ..., θm) ∈
{0, 1}m, with θi = 0 denoting that the hypothesis Hi is null and θi = 1 denoting that the hypoth-
esis Hi is non-null. The dependence among these hypotheses is represented as a binary Markov
random field (MRF) on θ. The structure of the MRF can be described by an undirected graph
59
G(V, E) with the node set V and the edge set E . The dependence between Hi and Hj is denoted
by an edge connecting nodei and nodej in E , and the strength of dependence is parameterized by
the potential function on the edge. The degree of prior belief thatHi is null is captured by the node
potential function (parametrized by πi, 0<πi<1). Suppose that the probability density function
of the test statistic xi given θi = 0 is f0, and the density of xi given θi = 1 is f1. Then, x is an
MRF-coupled mixture. The mixture model is parameterized by a parameter set ϑ = (π,φ,ψ),
where π and φ parameterize the binary MRF andψ parameterizes f0 and f1. For example, if f0 is
standard normal N (0, 1) and f1 is noncentered normal N (µ, 1), then ψ only contains parameter
µ. Figure 4.1 shows the MRF-coupled mixture model for three dependent hypothesesHi,Hj and
Hk.
In our MRF-coupled mixture model, x is observable, and θ is hidden. With the parameter set
ϑ = (π,φ,ψ), the joint probability density over x and θ is
P (x,θ|φ,ψ) = P (θ;π,φ)∏m
i=1P (xi|θi;ψ). (4.1)
Define the marginal probability that Hi is null given all the observed statistics x under the
parameters in ϑ, Pϑ(θi = 0|x), to be the local index of significance (LIS) forHi [172]. If we can
accurately calculate the posterior marginal probabilities of θ (or LIS), then we can use a step-up
procedure to control FDR at the nominal level α as follows [172]. We first sort LIS from the
smallest value to the largest value. Suppose LIS(1), LIS(2), ..., and LIS(m) are the ordered LIS,
and the corresponding hypotheses areH(1),H(2),..., andH(m). Let
k = max
{i :
1
i
∑i
j=1LIS(j) ≤ α
}. (4.2)
Then we rejectH(i) for i = 1, ..., k.
Therefore, the key inferential problem that we need to solve is that of computing the posterior
marginal distribution of the hidden variables θi given the test statistics x, namely Pϑ(θi = 0|x),
for i = 1, ...,m. It is a typical inference problem if the parameters in ϑ are known. Section 4.2.3
60
provides possible inference algorithms for calculating Pϑ(θi = 0|x) for given ϑ. However, ϑ is
usually unknown in real-world applications, and we need to estimate it. Section 4.2.4 provides a
novel EM algorithm for parameter learning in our MRF-coupled mixture model.
4.2.3 Posterior Inference
Now we are interested in calculating Pϑ(θi = 0|x) for a given parameter set ϑ. One popular fam-
ily of inference algorithms is the sum-product family [97], also known as belief propagation [206].
For loop-free graphs, belief propagation algorithms provide exact inference results with a compu-
tational cost linear in the number of variables. In our MRF-coupled mixture model, the structure
of the latent MRF is described by a graph G(V, E). When G is chain structured, the instantiation of
belief propagation is the forward-backward algorithm [8]. When G is tree structured, the instan-
tiation of belief propagation is the upward-downward algorithm [32]. For graphical models with
cycles, loopy belief propagation [127, 197] and the tree-reweighted algorithm [189] can be used
for approximate inference. Other inference algorithms for graphical models include junction trees
[99], sampling methods [60], and variational methods [89]. Recent papers [160, 159] discuss exact
inference algorithms on binary Markov random fields which allow loops. In our simulations, we
use belief propagation when the graph G has no loops. When G has loops (e.g. in the simulations
on genetic data and the real-world application), we use a Markov chain Monte Carlo (MCMC)
algorithm to perform inference for Pϑ(θi = 0|x).
4.2.4 Parameters and Parameter Learning
In our procedure, the dependence among these hypotheses is represented by a graphical model on
the latent vector θ parameterized by π and φ, and observed test statistics x are represented by
the coupled mixture parameterized by ψ. In Sun and Cai’s work on HMMs, φ is the transition
parameter and ψ is the emission parameter. One implicit assumption in their work is that the
transition parameter and the emission parameter stay the same for i(i = 1, ...,m). Our extension
to MRFs also allows us to untie these parameters. In the second set of basic simulations in Section
61
4.3, we make φ and ψ heterogeneous and investigate how this affects the numerical performance.
In the simulations on genetic data in Section 4.4 and the real-world GWAS application in Section
4.5, we have different parameters for SNP pairs with different levels of correlation.
In our model, learning (π,φ,ψ) is difficult for two reasons. First, learning parameters is diffi-
cult by nature in undirected graphical models due to the global normalization constant [190, 198].
State-of-the-art MRF parameter learning methods include MCMC-MLE [66], contrastive diver-
gence [80] and variational methods [58]. Several new sampling methods with higher efficiency
have been recently proposed, such as persistent contrastive divergence [178], fast-weight con-
trastive divergence [179], tempered transitions [153], and particle-filtered MCMC-MLE [4]. In
our procedure, we use the persistent contrastive divergence algorithm to estimate parameters π
and φ. Another difficulty is that θ is latent and we only have one observed training sample x.
We use an EM algorithm to solve this problem. In the E-step, we run our MCMC algorithm in
Section 4.2.3 to infer the latent θ based on the currently estimated parameters ϑ = (π,φ,ψ).
In the M-step, we run the persistent contrastive divergence (PCD) algorithm [178] to estimate π
and φ from the currently inferred θ. Note that PCD is also an iterative algorithm, and we run it
until it converges in each M-step. In the M-step, we also do a maximum likelihood estimation
of ψ from the currently inferred θ and observed x. We run the EM algorithm until π, φ and ψ
converge. Although this EM algorithm involves intensive computation in both E-step and M-step,
it converges very quickly in our experiments.
4.3 Basic Simulations
In the basic simulations, we investigate the numerical performance of our multiple testing ap-
proach on different fabricated dependence structures where we can control the ground truth pa-
rameters. We first simulate θ from P (θ;π,φ) and then simulate x from P (x|θ;ψ) under a variety
of settings of ϑ = (π,φ,ψ). Because we have the ground truth parameters, we have two versions
of our multiple testing approach, namely the oracle procedure (OR) and the data-driven procedure
(LIS). The oracle procedure knows the true parameters ϑ in the graphical models, whereas the
62
data-driven procedure does not and has to estimate ϑ. The baseline procedures include the BH
procedure [12] and the adaptive p-value procedure (AP) [14, 63] which are compared by [172].
We include another baseline procedure, the local false discovery rate procedure (localFDR) [38].
The adaptive p-value procedure requires a consistent estimate of the proportion of the true null
hypotheses. The localFDR procedure requires a consistent estimate of the proportion of the true
null hypotheses and the knowledge of the distribution of the test statistics under the null and under
the alternative. In our simulations, we endow AP and localFDR with the ground truth values of
these in order to let these baseline procedures achieve their best performance.
In the simulations, we assume that the observed xi under the null hypothesis (namely θi = 0)
is standard-normally distributed and that xi under the alternative hypothesis (namely θi = 1) is
normally distributed with mean µ and standard deviation 1.0. We choose the setup and parameters
to be consistent with [172] when possible. In total, we consider three MRF models, namely a
chain-structured MRF, tree-structured MRF and grid-structured MRF. For chain-MRF, we choose
the number of hypotheses m = 3, 000. For tree-MRF, we choose perfect binary trees of height 12
which yields a total number of 8, 191 hypotheses. For grid-MRF, we choose the number of rows
and the number of columns to be 100 which yields a total number of 10, 000 hypotheses. In all
the experiments, we choose the number of replications N = 500 which is also the same as [172].
In total, we have three sets of simulations with different goals as follows.
Basic simulation 1: We stay consistent with [172] in the simulations except that we use the three
MRF models. In all three structures, (θi)m1 is generated from the MRFs whose potentials on
the edges are
φ 1− φ
1− φ φ
. Therefore, φ only contains parameter φ, and ψ only includes
parameter µ.
Basic simulation 2: One assumption in basic simulation 1 is that the parameters φ and µ are
homogeneous in the sense that they stay the same for i(i = 1, ...,m). This assumption is car-
ried down from [172]. However in many real-world applications, the transition parameters can
be different across the multiple hypotheses. Similarly, the test statistics for the non-null hy-
potheses, although normally distributed and standardized, could have different µ values. There-
63
fore, we investigate the situation where the parameters can vary in different hypotheses. The
simulations are carried out for all three different dependence structures aforementioned. In the
first set of simulations, instead of fixing φ, we choose φ’s uniformly distributed on the interval
(0.8 −∆(φ)/2, 0.8 + ∆(φ)/2). In the second set of simulations, instead of fixing µ, we choose
µ’s uniformly distributed on the interval (2.0 − ∆(µ)/2, 2.0 + ∆(µ)/2). The oracle procedure
knows the true parameters. The data-driven procedure does not know the parameters, and assumes
the parameters are homogeneous.
Basic simulation 3: Another implicit assumption in basic simulation 1 is that each individual test
in the multiple testing problem is exact. Many widely used hypothesis tests, such as Pearson’s
χ2 test and the likelihood ratio test, are asymptotic in the sense that we only know the limiting
distribution of the test statistics for large samples. As an example, we simulate the two-proportion
z-test in this section and show how the sample size affects the performance of the procedures
when the individual test is asymptotic. Suppose that we have n samples (half of them are positive
samples and half of them are negative samples). For each sample, we havem Bernoulli distributed
attributes. A fraction of the attributes are relevant. If the attributeA is relevant, then the probability
of “heads” in the positive samples (p+A) is different from that in the negative samples (p−A). p+
A and
p−A are the same ifA is nonrelevant. For each individual test, the null hypothesis is that the attribute
is not relevant, and the alternative hypothesis is otherwise. The two-proportion z-test can be used
to test whether p+A − p
−A is zero, which yields an asymptotic N (0, 1) under the null and N (µ, 1)
under the alternative (µ is nonzero). In the simulations, we fix µ, but vary the sample size n, and
apply the aforementioned tree-MRF structure (m = 8, 191). The oracle procedure and localFDR
only know the limiting distribution of the test statistics and assume the test statistics exactly follow
the limiting distributions even when the sample size is small.
Figure 4.2 shows the numerical results in basic simulation 1. Figures (1a)-(1f) are for the
chain structure. Figures (2a)-(2f) are for tree structure. Figures (3a)-(3f) are for the grid structure.
In Figures (1a)-(1c), (2a)-(2c) and (3a)-(3c), we set µ = 2 and plot FDR, FNR and the average
number of true positives (ATP) when we vary φ between 0.2 and 0.8. In Figures (1d)-(1f), (2d)-(2f)
64
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.050.070.09FDR
(1
a)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.150.250.35FNR
(1
b)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
6008001000ATP
(1
c)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.040.060.080.10FDR
(1d)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.00.10.20.30.40.5FNR
(1
e)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
050010001500ATP
(1f)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.050.070.09FDR
(2
a)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.150.250.35FNR
(2
b)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
150025003500ATP
(2
c)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.040.060.080.10FDR
(2d)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.00.10.20.30.40.5FNR
(2
e)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
010003000ATP
(2f)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.050.070.09FDR
(3
a)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.00.10.20.30.4FNR
(3
b)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2000300040005000ATP
(3
c)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.060.100.14FDR
(3d)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.00.20.4FNR
(3
e)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
020004000ATP
(3f)
Figu
re4.
2:C
ompa
riso
nof
BH
(©),
AP(4
),lo
calF
DR
(×),
OR
(+),
and
LIS
(�)
inba
sic
sim
ulat
ion
1:(1
)ch
ain-
MR
F,(2
)tr
ee-M
RF,
(3)g
rid-
MR
F;(a
)FD
Rvsφ
,(b)
FNR
vsφ
,(c)
AT
Pvsφ
,(d)
FDR
vsµ
,(e)
FNR
vsµ
,(f)
AT
Pvsµ
.
65
and (3d)-(3f), we set φ = 0.8 and plot FDR, FNR and ATP when we vary µ between 1.0 and 4.0.
The nominal FDR level is set to be 0.10. From Figure 4.2, we can observe comparable numerical
results between the chain structure and tree structure. The FDR levels of all five procedures are
controlled at 0.10 and BH is conservative. From the plots for FNR and ATP, we can observe that
the data-driven procedure performs almost the same as the oracle procedure, and they dominate the
p-value thresholding procedures BH and AP. The oracle procedure and the data-driven procedure
also dominate localFDR except when φ = 0.5, when they perform comparably. This is to be
expected because the dependence structure is no longer informative when φ is 0.5. In this situation
when the hypotheses are independent, our procedure reduces to the localFDR procedure. As φ
departs from 0.5 and approaches either 0 or 1.0, the difference between OR/LIS and the baselines
gets larger. When the individual hypotheses are easy to test (large µ values), the differences
between them are not substantial. When we turn to the grid structure, the numerical performance
is similar to that in the chain structure and the tree structure except for two observations. First, the
data-driven procedure does not appear to control the FDR at 0.1 when µ is small (e.g. µ = 1.0),
although the oracle procedure does, which indicates the parameter estimation in the EM algorithm
is difficult when µ is small. In other words, with a limited number of hypotheses, it is difficult to
estimate the pairwise potential parameters if the test statistics of the non-nulls do not look much
different from the test statistics of the nulls. The second observation is that the slopes of the FNR
curve and ATP curve for the grid structure are different from those in the chain and tree structures.
The reason is that the connectivity in the grid structure is higher than that in the chain and tree.
Therefore we can observe that even when the individual hypotheses are difficult to test (small µ
values), the FNR is still low because each individual hypothesis has more neighbors in the grid
than in the chain or tree, and the neighbors are informative.
Figure 4.3 shows the numerical performance in basic simulation 2. Figures (1a)-(1f), (2a)-(2f),
and (3a)-(3f) correspond to the chain structure, the tree structure and the grid structure respectively.
In Figures (1a)-(1c), (2a)-(2c), and (3a-3c), we set µ = 2 and vary ∆(φ) between 0 and 0.4. In
Figures (1d)-(1f), (2d)-(2f), and (3d)-(3f), we set φ = 0.8 and vary ∆(µ) between 0 and 4.0.
66
0.1
0.2
0.3
0.4
0.050.070.09FDR
(1a)
0.1
0.2
0.3
0.4
0.100.200.300.40FNR
(1b)
0.1
0.2
0.3
0.4
6008001200ATP
(1c)
12
34
0.050.070.09FDR
(1d)
12
34
0.150.250.35FNR
(1e)
12
34
6008001000ATP
(1f)
0.1
0.2
0.3
0.4
0.050.070.09FDR
(2a)
0.1
0.2
0.3
0.4
0.100.200.300.40FNR
(2b)
0.1
0.2
0.3
0.4
150025003500ATP
(2c)
12
34
0.050.070.09FDR
(2d)
12
34
0.150.250.35FNR
(2e)
12
34
150025003500ATP
(2f)
0.1
0.2
0.3
0.4
0.050.070.09FDR
(3a)
0.1
0.2
0.3
0.4
0.00.10.20.30.4FNR
(3b)
0.1
0.2
0.3
0.4
2000300040005000ATP
(3c)
12
34
0.040.060.080.10FDR
(3d)
12
34
0.00.10.20.30.4FNR
(3e)
12
34
2000300040005000ATP
(3f)
Figu
re4.
3:C
ompa
riso
nof
BH
(©),
AP(4
),lo
calF
DR
(×),
OR
(+),
and
LIS
(�)
inba
sic
sim
ulat
ion
2:(1
)ch
ain-
MR
F,(2
)tr
ee-M
RF,
(3)g
rid-
MR
F;(a
)FD
Rvs
∆(φ
),(b
)FN
Rvs
∆(φ
),(c
)AT
Pvs
∆(φ
),(d
)FD
Rvs
∆(µ
),(e
)FN
Rvs
∆(µ
),(f
)AT
Pvs
∆(µ
).
67
Again, the nominal FDR level is set to be 0.10. From Figure 4.3, we observe that all five proce-
dures control FDR at the nominal level and BH is conservative when the transition parameter φ is
heterogeneous. However, the data-driven procedure becomes more and more conservative as we
increase the variance of φ in the grid-structure. Nevertheless, the data-driven procedure does not
lose much efficiency compared with the oracle procedure based on FNR and ATP. Both the data-
driven procedure and the oracle procedure dominate the three baselines. When the µ parameter
is heterogeneous, all five procedures are still valid, but the data-driven procedure becomes more
and more conservative as we increase the variance of µ. The data-driven procedure can be more
conservative than the BH procedure when ∆(µ) is large enough. The conservativeness appears
most severe in the grid-structure. However when we look at the FNR and ATP, the data-driven
procedure still dominates BH, AP and localFDR substantially in all the situations, although the
data-driven procedure loses a certain amount of efficiency compared with the oracle procedure
when the variance of µ gets large.
100 200 300 400 500
0.05
0.07
0.09
0.11
FD
R
Sample size(a)
100 200 300 400 500
0.10
0.20
0.30
0.40
FN
R
Sample size(b)
100 200 300 400 500
1500
2500
3500
AT
P
Sample size(c)
Figure 4.4: Comparing BH(©), AP(4), localFDR(×), OR(+), and LIS(�) in basic simulation 3:(a)FDR vs n, (b)FNR vs n, (c)ATP vs n.
Figure 4.4 shows the results from basic simulation 3. The oracle procedure and localFDR are
liberal when the sample size is small. This is because when the sample size is small, there exists
a discrepancy between the true distribution of the test statistic and the limiting distribution. Quite
surprisingly, the data-driven procedure stays valid. The reason is that the data-driven procedure
can estimate the parameters from data. The data-driven procedure and the oracle procedure still
68
have comparable performance and enjoy a much lower level of FNR compared with the baselines.
For all the basic simulations, we set the nominal FDR level to be 0.10. We have also replicated the
basic simulations by setting the nominal level to be 0.05, and similar conclusions can be made.
4.4 Simulations on Genetic Data
Unlike the fabricated dependence structures in the basic simulations in Section 4.3, the depen-
dence structure in the simulations on genetic data in this section is real. We simulate the linkage
disequilibrium structure of a segment on human chromosome 22, and treat a test of whether a SNP
is associated as one individual test. We follow the simulation settings in [201]. We use HAPGEN2
[170] and the CEU sample of HapMap [175] (Release 22) to generate SNP genotype data at each
of the 2, 420 loci between bp 14431347 and bp 17999745 on Chromosome 22. A total of 685 out
of 2, 420 SNPs can be genotyped with the Affymetrix 6.0 array. These are the typed SNPs that we
use for our simulations. Within the overall 2, 420 SNPs, we randomly select 10 SNPs to be the
causal SNPs. All the SNPs on the Affymetrix 6.0 array whose r2 values, according to HapMap,
with any of the causal SNPs are above t are set to be the associated SNPs. In the simulations, we
report results for three different t values, namely 0.8, 0.5 and 0.25. We also simulate three differ-
ent genetic models (additive model, dominant model, and recessive model) with different levels
of relative risk (1.2 and 1.3). In total, we simulate 250 cases and 250 controls. The experiment
is replicated for 100 times and the average result is provided. With the simulated data, we apply
our multiple testing procedure (LIS) and three baseline procedures: the BH procedure, the adap-
tive p-value procedure (AP), and the local false discovery rate procedure (localFDR). Because the
dependence structure is real and the ground truth parameters are unknown to us, we do not have
the oracle procedure in the simulations on genetic data.
With the simulated genetic data, we use two commonly used tests in genetic association stud-
ies, namely two-proportion z-test and Cochran-Armitage’s trend test (CATT) [31, 3, 163, 52] as
the individual tests for the association of each SNP. CATT also yields an asymptotic N (0, 1)
under the null and N (µ, 1) under the alternative (µ is nonzero). Therefore, we parameterize
69
ψ = (µ1, σ21) where µ1 and σ2
1 are the mean and variance of the test statistics under alternative.
The graph structure is built as follows. Each SNP becomes a node in the graph. For each SNP,
we connect it with the SNP with the highest r2 value with it. There are in total 490 edges in the
graph. We further categorize the edges into a high correlation edge set Eh (r2 above 0.8), medium
correlation edge set Em (r2 between 0.5 and 0.8) and low correlation edge set El (r2 between 0.25
and 0.5). We have three different parameters (φh, φm, and φl) for the three sets of edges. Then
the density of θ in formula (4.1) takes the form
P (θ;φ) ∝ exp{∑
(i,j)∈Eh
φhI(θi = θj)+
∑(i,j)∈Em
φmI(θi = θj) +∑
(i,j)∈El
φlI(θi = θj)},(4.3)
where I(θi = θj) is an indicator variable that indicates whether θi and θj take the same value.
In the MCMC algorithm, we run the Markov chain for 20, 000 iterations with a burn-in of 100
iterations. In the PCD algorithm, we generate 100 particles. In each iteration of PCD learning,
the particles move forward for 5 iterations (the n parameter in PCD-n). The learning rate in PCD
gradually decreases as suggested by [178]. The EM algorithm converges after about 10 to 20
iterations, which usually take less than 10 minutes on a 3.00GHz CPU.
Figure 4.5 shows the performance of the procedures in the additive models with the homozy-
gous relative risk set to 1.2 and 1.3. The test statistics are from a two-proportion z-test. We
have also replicated the simulations on Cochran-Armitage’s trend test, and the results are almost
the same. In Figure 4.5, table (1) summarizes the empirical FDR and the total number of true
positives (#TP) of our LIS procedure, BH, AP and localFDR (lfdr), in the additive models with
different (homozygous) relative risk levels, when we vary t and when we vary the nominal FDR
level α. We regard a SNP having r2 above t with any causal SNP as an associated SNP, and we
regard a rejection of the null hypothesis for an associated SNP as a true positive. Our LIS pro-
cedure and localFDR are valid while being conservative. BH and AP appear liberal in some of
70
t =
0.8
t = 0
.5
t =
0.2
5
LIS
BH
A
P lfd
r
LIS
BH
A
P lfd
r
LIS
BH
A
P lfd
r
rr =
1.2
α
= 0.
05
FDR
: 0.
018
0.05
9 0.
059
0.01
0
0.01
8 0.
059
0.05
9 0.
010
0.
018
0.05
8 0.
058
0.00
9 #T
P:
12
11
11
1
12
11
11
1
20
18
19
7
α =
0.10
FD
R:
0.07
7 0.
089
0.08
9 0.
010
0.
077
0.08
9 0.
089
0.01
0
0.07
6 0.
079
0.07
9 0.
009
#TP:
13
11
11
1
13
11
11
1
21
20
20
8
rr =
1.3
α
= 0.
05
FDR
: 0.
047
0.04
4 0.
054
0.01
5
0.04
7 0.
044
0.06
4 0.
005
0.
046
0.04
4 0.
064
0.01
4 #T
P:
16
4 4
1
16
4 4
1
22
10
10
6
α =
0.10
FD
R:
0.06
7 0.
104
0.10
4 0.
015
0.
067
0.10
4 0.
104
0.00
5
0.06
6 0.
103
0.10
3 0.
014
#TP:
18
15
15
1
18
15
15
1
27
21
21
6
(1)
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(2a)
RO
C t=
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(2b)
PR
t=0.
8
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR(2
c) R
OC
t=0.
5
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(2d)
PR
t=0.
5
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(2e)
RO
C t=
0.25
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(2f)
PR
t=0.
25
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(3a)
RO
C t=
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(3b)
PR
t=0.
8
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(3c)
RO
C t=
0.5
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(3d)
PR
t=0.
5
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(3e)
RO
C t=
0.25
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(3f)
PR
t=0.
25
Figu
re4.
5:C
ompa
riso
nof
BH
,AP,
loca
lFD
Ran
dL
ISin
the
addi
tive
mod
els
whe
nw
eva
ryre
lativ
eri
skrr
,tan
dth
eno
min
alFD
Rle
velα
.Ta
ble
(1)
sum
mar
izes
resu
lts.
Subfi
gure
s(2
a)-(
2f)
show
sR
OC
and
PRcu
rves
ofL
IS(s
olid
red
lines
)an
din
divi
dualp
-val
ues
(das
hed
gree
nlin
es)w
ithrr
=1.
2.S
ubfig
ures
(3a)
-(3f
)sho
ws
RO
Can
dPR
curv
esof
LIS
(sol
idre
dlin
es)
and
indi
vidu
alp
-val
ues
(das
hed
gree
nlin
es)w
ithrr
=1.
3.
71
the configurations. In any of the circumstances, our LIS procedure can identify more associated
SNPs than the baselines. We can find a clue to why our procedure LIS is being conservative from
the results in Figure 4.3. In basic simulation 2, we observe that when the parameters µ and φ are
heterogeneous and we carry out the data-driven procedure under the homogeneous parameter as-
sumption, the data-driven procedure is conservative. The discrepancy between the nominal FDR
level and the empirical FDR level increases as the parameters move further away from homogene-
ity. Although we assign three different parameters φh, φm, and φl to Eh, Em and El respectively,
the edges within the same set (e.g. El) may still be heterogeneous. The fact that the LIS procedure
recaptures more true positives than the baselines while remaining more conservative in many con-
figurations indicates that the local indices of significance provide a ranking more efficient than the
ranking provided by the p-values from the individual tests. Therefore, we further plot the ROC
curves and precision-recall (PR) curves when we rank SNPs by LIS and by the p-values from the
two-proportion z-test. The ROC curve and PR curve are vertically averaged from 100 replications.
Subfigures (2a)-(2f) are for the additive model with homozygous relative risk level set to be 1.2.
Subfigures (3a)-(3f) are for the additive model with homozygous relative risk level set to be 1.3.
It is observed that the curves from LIS dominate those from the p-values from individual tests in
most places, which further suggests that LIS provides a more efficient ranking of the SNPs than
the individual tests.
Figure 4.6 shows the performance of the procedures in the dominant model and the recessive
model with the homozygous relative risk set to be 1.2. The test statistics are from a two-proportion
z-test. In Figure 4.6, table (1) summarizes the empirical FDR and the total number of true pos-
itives (#TP) of our LIS procedure, BH, AP and localFDR (lfdr) in the dominant model and the
recessive model when we vary t and when we vary the nominal FDR level α. Our LIS proce-
dure and localFDR are valid while being conservative in all configurations, and they appear more
conservative in the recessive model than in the dominant model. On the other hand, BH and AP
appear liberal in the recessive model. Our LIS procedure still confers an advantage over the base-
lines in the dominant model. The LIS procedure also recaptures almost the same number of true
72
positives as BH and AP while maintaining a much lower FDR in the recessive model. Again,
we further plot the ROC curves and precision-recall curves when we rank SNPs by LIS and by
the p-values from individual tests. Subfigures (2a)-(2f) are for the dominant model. Subfigures
(3a)-(3f) are for the recessive model. It is also observed that the curves from LIS dominate those
from the p-values from individual tests in most places, which also suggests that LIS provides a
more efficient ranking.
4.5 Real-world Application
Our primary GWAS dataset on breast cancer is CGEMS dataset. Details about CGEMS dataset
are provided in Subsection 3.4.1. Our secondary GWAS dataset comes from Marshfield Clinic
dataset. Details about CGEMS dataset are provided in Subsection 3.4.3.
We apply our multiple testing procedure on the CGEMS data. The settings of the procedure are
the same as in the simulations on genetic data in Section 4.4. The individual test is two-proportion
z-test. Our procedure reports 32 SNPs with LIS value of 0.0 (an estimated probability 1.0 of
being associated). We further calculate the per-allele odds-ratio of these SNPs on the Marshfield
data, and 14 of them have an odds-ratio around 1.2 or above. There are two clusters among
them. First, rs3870371, rs7830137 and rs920455 (on chromosome 8) locate near each other
and near the gene hyaluronan synthase 2 (HAS2) which has been shown to be associated with
invasive breast cancer by many studies [182, 101, 17]. The other cluster includes rs11200014,
rs2981579, rs1219648, and rs2420946 on chromosome 10. They are exactly the 4 SNPs reported
by [82]. Their associated gene FGFR2 is also well known to be associated with breast cancer.
SNP rs4866929 on chromosome 5 is also very likely to be associated because it is highly correlated
(r2=0.957) with SNP rs981782 (not included in our data) which was identified from a much larger
dataset (4, 398 cases and 4, 316 controls and a follow-up confirmation stage on 21, 860 cases and
22, 578 controls) by [33].
73
t =
0.8
t = 0
.5
t =
0.2
5
LIS
BH
A
P lfd
r
LIS
BH
A
P lfd
r
LIS
BH
A
P lfd
r
Dom
inan
t α
= 0.
05
FDR
: 0.
026
0.04
0 0.
040
0.01
0
0.02
6 0.
040
0.04
0 0.
010
0.
025
0.03
9 0.
039
0.00
9 #T
P:
14
4 4
2
14
4 4
2
21
10
10
7
α =
0.10
FD
R:
0.05
1 0.
079
0.08
9 0.
010
0.
048
0.07
9 0.
109
0.01
0
0.04
4 0.
079
0.10
9 0.
009
#TP:
20
12
12
3
22
12
12
3
33
19
29
18
Rec
essi
ve
α =
0.05
FD
R:
0.00
9 0.
079
0.07
9 0.
009
0.
009
0.07
9 0.
079
0.00
9
0.00
9 0.
079
0.07
9 0.
009
#TP:
11
11
11
11
11
11
11
11
18
17
18
17
α =
0.10
FD
R:
0.01
8 0.
104
0.10
4 0.
009
0.
018
0.10
4 0.
114
0.00
9
0.01
7 0.
104
0.11
4 0.
009
#TP:
11
12
12
11
11
12
12
11
22
21
21
17
(1)
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(2a)
RO
C t=
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(2b)
PR
t=0.
8
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR(2
c) R
OC
t=0.
5
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(2d)
PR
t=0.
5
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(2e)
RO
C t=
0.25
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(2f)
PR
t=0.
25
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(3a)
RO
C t=
0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(3b)
PR
t=0.
8
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(3c)
RO
C t=
0.5
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(3d)
PR
t=0.
5
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
FPR
TPR
(3e)
RO
C t=
0.25
0.0
0.2
0.4
0.6
0.8
1.0
0.10.20.30.4
Reca
ll
Precision
(3f)
PR
t=0.
25
Figu
re4.
6:C
ompa
riso
nof
BH
,AP,
loca
lFD
Ran
dL
ISin
the
dom
inan
tmod
elan
dth
ere
cess
ive
mod
elw
ithdi
ffer
entt
valu
esan
ddi
ffer
ent
nom
inal
FDRα
valu
es.
Tabl
e(1
)su
mm
ariz
esre
sults
.Su
bfigu
res
(2a)
-(2f
)sh
ows
RO
Can
dPR
curv
esof
LIS
(sol
idre
dlin
es)
and
indi
vidu
alp
-val
ues
(das
hed
gree
nlin
es)
inth
edo
min
antm
odel
.Su
bfigu
res
(3a)
-(3f
)sh
ows
RO
Can
dPR
curv
esof
LIS
and
indi
vidu
alp
-val
ues
inth
ere
cess
ive
mod
el.
74
4.6 Discussion
In this chapter, we use an MRF-coupled mixture model to leverage the dependence in multiple
testing problems, and show the improved numerical performance on a variety of simulations and
its applicability in a real-world GWAS problem. A theoretical question of interest is whether this
graphical model based procedure is optimal in the sense that it has the smallest FNR among all
the valid procedures. The optimality of the oracle procedure can be proved under the compound
decision framework [171, 172], as long as an exact inference algorithm exists or an approximate
inference algorithm can be guaranteed to converge to the correct marginal probabilities. The
asymptotic optimality of the data-driven procedure (the FNR yielded by the data-driven procedure
approaches the FNR yielded by the oracle procedure as the number of tests m → ∞) requires
consistent estimates of the unknown parameters in the graphical models. Parameter learning in
undirected models is more complicated than in directed models due to the normalization constant.
To the best of our knowledge, asymptotic properties of parameter learning for hidden MRFs and
MRF-coupled mixture models have not been investigated. Therefore, we cannot prove the asymp-
totic optimality of the data-driven procedure so far, although we can observe its close-to-oracle
performance in the basic simulations.
The material in this chapter first appeared in the 28th Conference on Uncertainty in Artificial
Intelligence (UAI’2012) as follows:
Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside and David
Page. Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-
wide Association Studies. The 28th Conference on Uncertainty in Artificial Intelligence (UAI),
2012.
The graphical model used in this chapter is fully parametric. However in practice, f1 is often
heterogeneous, and cannot be estimated with a simple parametric distribution. The next chapter
proposes a semiparametric graphical model for multiple testing under dependence, which esti-
mates f1 adaptively. This semiparametric approach is still effective to capture the dependence
among multiple hypotheses, and it exactly generalizes the local FDR procedure [38] and connects
75
with the BH procedure [12].
Chapter 5
Multiple Testing under Dependence via
Semiparametric Graphical Models
By extending earlier work of Sun and Cai [172], the previous chapter shows that graphical models
can be used to leverage the dependence in large-scale multiple testing problems, with significantly
improved performance [107]. These graphical models are fully parametric and require that we
know the parameterization of f1 — the density function of the test statistic under the alternative
hypothesis. However in practice, f1 is often heterogeneous, and cannot be estimated with a sim-
ple parametric distribution. This chapter proposes a novel semiparametric approach for multiple
testing under dependence, which estimates f1 adaptively. This semiparametric approach exactly
generalizes the local FDR procedure [38] and connects with the BH procedure [12]. A variety
of simulations show that our semiparametric approach outperforms classical procedures which
assume independence and the parametric approaches which capture dependence.
5.1 Introduction
High-throughput computational biology studies, such as gene expression analysis and genome-
wide association studies, often involve large-scale multiple testing problems which exhibit de-
76
77
pendence in the sense that whether the null hypothesis of one test is true or not depends on the
ground truth of other tests. Recently, new multiple testing procedures have been proposed with
such dependence explicitly captured by graphical models such as hidden Markov models [172]
and Markov-random-field-coupled mixture models [107]. These graphical models are fully para-
metric, and they assume that we know not only the parameterization form of f0, but also the
parameterization form of f1. 1 Eventually, a fully parametric graphical model is learned, and
the multiple testing problem becomes an inference problem on the graphical model. This para-
metric approach is effective in some simple situations, but the assumptions for f1 often make it
impractical, as discussed next.
A long tradition in hypothesis testing is to derive test statistics and calculate P -values all under
the null hypothesis H0. When testing multiple hypotheses, we control familywise error rate via
Bonferroni correction, or we control false discovery rate via the Benjamini-Hochberg (BH) pro-
cedure [12], both of which are P -value thresholding procedures, and all calculation is done under
H0. Statisticians avoid making assumptions about f1 because the distribution of the test statistic
under H1 sometimes can be difficult to derive. Take for instance a two-proportion z-test, which
tests whether two Bernoulli variables have the same parameter (i.e. P (head) in coin-flippings);
the two-proportion z-test is widely used in case-control studies (e.g. comparing the minor allele
frequencies in cases and controls). Under H0 (the two proportions are the same), the test statistic
X asymptotically follows a standard normal N (0, 1). Under H1 (the two proportions are dif-
ferent), X asymptotically follows a standardized non-centered normal N (µ, 1) (µ 6= 0) where
µ depends on the odds-ratio of this genetic marker. When there are multiple genetic markers to
be tested, f0 remains N (0, 1), but f1 becomes a mixture of Gaussians because these associated
markers can have different odds-ratios and therefore different µ values (i.e. different effect sizes).
In this situation, f1 is no longer a simple parametric distribution. In a real-world genome-wide
association study on breast cancer, we plot the estimated f1 in Figure 5.1; obviously it is inap-
1f0 and f1 are the probability density functions of the test statistic under the null hypothesis H0 and the alternativehypothesis H1, respectively. In the HMM model [172] and the MRF-coupled mixture model [107], f0 and f1 are theemitting probabilities for state 0 and state 1 respectively.
78
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Test Statistics
f 1
Figure 5.1: Estimated f1 in a real-world genome-wide association study on breast cancer.
propriate to estimate f1 with a simple parametric distribution. Note that this is not a problem for
classical multiple testing procedures such as the BH procedure, whose calculations of P -values
are done under H0, but this is a serious problem for the graphical-model-based procedures which
require f1 to be estimated parametrically. Therefore, the key question is whether we can still make
use of the graphical models to leverage the dependence among the hypotheses without making
assumptions about f1.
In this paper, we propose a semiparametric graphical model to leverage the dependence among
the hypotheses. In our model, f1 is estimated nonparametrically and the remaining parts are esti-
mated parametrically. More algorithmic details are introduced in Section 5.3 after we summarize
the terminology in Section 5.2. Section 5.4 shows that the two widely-used multiple testing pro-
cedures, the BH procedure [12] and the local FDR procedure [38], estimate their parameters in the
same semiparametric way to avoid assumptions about f1. This unification demonstrates that the
most appropriate way of using graphical models to capture the dependence is the semiparametric
model in our paper rather than the fully parametric models [172, 107]. Simulations in Section
5.5 show that our semiparametric approach controls false discovery rate and reduces false non-
discovery rate, compared with the baseline procedures. We apply the procedure to a real-world
genome-wide association study on breast cancer in Section 5.6 and identify a number of genetic
variants.
79
RETAINED REJECTED TOTAL
H0 IS TRUE N00 N10 m0
H0 IS FALSE N01 N11 m1
TOTAL S R m
Table 5.1: Classification of tested hypotheses
5.2 Preliminaries
FDR, FNR, Validity and Efficiency: When we test m hypotheses simultaneously, various out-
comes can be described by Table 5.1 based on their ground truth and whether the hypotheses are re-
jected. False discovery rate (FDR),E(N10/R|R>0)P (R>0), is the expected proportion of incor-
rectly rejected null hypotheses [12]. False non-discovery rate (FNR), E(N01/S|S>0)P (S>0),
is the expected proportion of false non-rejections in those tests whose null hypotheses are not
rejected [62]. An FDR procedure is valid if it controls FDR at a nominal level α. One valid proce-
dure is more efficient than another if it has a smaller FNR. In multiple testing problems, we would
like to control FDR at the nominal level and reduce FNR as much as possible.
Dependence in Multiple Testing: Classical multiple testing procedures usually assume inde-
pendence among the hypotheses. The effects of dependence on multiple testing have been inves-
tigated with a focus on the validity issue, namely how to control FDR at the nominal level when
dependence exists [16, 49, 147, 136, 155, 34, 46, 150, 169, 204, 19]. Despite FDR-control chal-
langes, dependence also brings opportunities for decreasing FNR. This efficiency issue has been
investigated [207, 64, 11, 212], indicating FNR could be decreased by leveraging the dependence
among hypotheses. Several approaches have been proposed, such as dependence kernels [100],
factor models [53] and principal factor approximation [42]. [172] use a hidden Markov model to
explicitly leverage chain dependence structures [172]. [107] extend such graphical-model-based
approaches to general dependence structures via a Markov-random-field-coupled mixture model.
Capturing the dependence in multiple testing in such an explicit manner is innovative, but it relies
on the strong assumption that we know the parameterization of f1, which is unrealistic in all but
the simplest situations. Improper assumption of f1 may make the testing procedure too liberal,
80
e.g. Figure 4 of [172], or conservative, e.g. Figure 3 of [107]. In this paper, we build on the
approach of [107] and take the major step of relaxing this assumption by estimating f1 adaptively.
5.3 Methods
5.3.1 Graphical models for Multiple Testing
Let x = (x1, ..., xm) be a vector of test statistics from hypotheses (H1, ...,Hm) with their ground
truth denoted by a latent Bernoulli vector θ = (θ1, ..., θm) ∈ {0, 1}m, with θi = 0 denoting
that the hypothesis Hi is null and θi = 1 denoting that the hypothesis Hi is non-null. In [107],
the dependence among these hypotheses is represented as a binary Markov random field (MRF)
on θ. The structure of the MRF is assumed to be known, and described by an undirected graph
G(V, E) with the node set V and the edge set E . The dependence between Hi and Hj is denoted
by an edge connecting nodei and nodej . The strength of dependence is captured by a potential
function (parametrized by φij , 0<φij<1) on this edge. The degree of prior belief that Hi is
null is captured by the node potential function (parametrized by πi, 0<πi<1). Suppose that the
probability density function of the test statistic xi|θi=0 is f0, and the density of xi|θi=1 is f1.
Then (x,θ;π,φ, f0, f1) forms an MRF-coupled mixture model where π and φ are node potential
functions and edge potential functions in the MRF. In the MRF-coupled mixture model, x is
observed, and θ is hidden. We also need to estimate π, φ and f1. 2
For the reasons discussed in Section 5.1, it is often difficult to estimate f1 with a simple para-
metric distribution. In order to avoid the f1 assumption, we estimate f1 adaptively via an indirect,
nonparametric way, as introduced in Section 5.3.2. Then we estimate π and φ via a contrastive di-
vergence style algorithm, as introduced in Section 5.3.3. Therefore the graphical model is learned
semiparametrically — f1 is learned nonparametrically and the MRF part is learned by estimating
parameters φ and π. Finally, we perform marginal inference of θ|x with the learned model and
reject hypotheses with a step-up procedure to control FDR, as introduced in Section 5.3.4. Figure
2f0 is usually known to us in hypothesis testing.
81
…
…
…111
11
11
1
1
, ,
,
Figure 5.2: The semiparametric graphical model for hypotheses Hi, Hj and Hk with observedtest statistics (xi, xj , xk) and latent ground truth (θi, θj , θk).
5.2 shows the semiparametric MRF-coupled mixture model for the three dependent hypotheses
Hi,Hj andHk.
5.3.2 Nonparametric Estimation of f1
We cannot directly estimate f1 from observed x because the ground truth θ is hidden. However,
we can estimate f from observed x nonparametrically via kernel density estimation. Therefore,
we can estimate f1 indirectly using the rule of total probability
f(x) = p0f0(x) + (1− p0)f1(x), (5.1)
where p0 is the proportion of null hypotheses. Since we know f0 in advance (e.g. N (0, 1)),
we only need to estimate f and p0 so as to estimate f1.
Estimating p0: We can estimate p0 with the method in [166], namely
82
p0(λ) =W (λ)
(1− λ)m, (5.2)
where λ ∈ [0, 1) is a tuning parameter, and W (λ) is the total number of hypotheses whose
P -values are above λ. The motivation of this estimation is that the P -values of null hypotheses are
uniformly distributed on the interval (0, 1). If we assume all the hypotheses with P -values greater
than λ are from null hypotheses, then W (λ)/(1 − λ) is the total number of null hypotheses.
Therefore the right hand side of (5.2) is an estimate of p0. Obviously, p0(λ) over-estimates p0
because there might be nonnull hypotheses whose P -values are greater than λ, especially when
λ is small. Therefore, a bias-variance trade-off presents in the choice of λ — a larger λ value
yields less bias but brings in more variance. [168] showed that the BH procedure coupled with
p0(λ) maintains strong control of FDR under mild conditions. In simulations, we test different λ
values, and the results show that the performance of our multiple testing procedure is insensitive
to different choices of λ.
Estimating f : Since we can observe all the test statistics x, we can estimate f directly via
kernel density estimation [152]. One may choose any kernel function and bandwidth parameter
as long as they provide a reasonable estimate. A Gaussian kernel would be a natural choice.
Nevertheless in our experiments, we use the Epanechnikov kernel because its computation burden
is low, and it is optimal in a minimum variance sense [39]. Finally we can get f , the nonparametric
estimate of f .
Estimating f1: With the estimated p0 and f , we estimate f1 as
f1(x) =f(x)− p0f0(x)
1− p0. (5.3)
5.3.3 Parametric Estimation of φ and π
The pairwise potential functions φ and the node potential functions π parametrize the Markov
random field part of the model. In the simulations, we tie all the pairwise potential functions
together, i.e. φ={φ}. In the real-world application in Section 5.6, we assume there are three types
83
of edges (high correlation edges, medium correlation edges and low correlation edges), and there
are three parameters, φ={φh, φm, φl}, corresponding to the three levels of correlation. We also
tie all the node potentials in both the simulations and the real-world application, i.e. π={π}.
Parameter learning for MRFs is generally difficult due to the partition function. So far, the
state-of-the-art parameter learning algorithms are based on contrastive divergence [80], such as the
persistent contrastive divergence (PCD) algorithm [178]. Contrastive divergence algorithms are
iterative algorithms which gradually update parameters by generating particles based on current
estimates of parameters and then comparing the moments from the particles with the moments
from the data. Contrastive divergence is related to pseudo-likelihood [18] and ratio matching
[83, 84]. However, contrastive divergence algorithms cannot be directly applied to our model
because θ is hidden. Therefore, we modify the PCD algorithm as follows. Suppose we already
generate particles for θ in the normal PCD algorithm. We further generate the particles for x
using f0 and f1 conditional on the generated particles for θ. Then we update the parameters by
comparing the moments from particles for x and the moments from the observed x.
5.3.4 Inference of θ and FDR Control
After we estimate f1, φ and π, the MRF-coupled mixture model is fully specified, and the next
importance step is to calculate the posterior probability thatHi is null given all the observed statis-
tics x, namely P (θi=0|x) for i = 1, ...,m. This quantity is termed the local index of significance
(LIS) [172], which reduces to local false discovery rate P (θi=0|xi) when the hypotheses are in-
dependent. In our simulations and the real-world application, we use a Markov chain Monte Carlo
(MCMC) algorithm to perform posterior inference for P (θi=0|x).
After we calculate the posterior marginal probabilities of θ (i.e. LIS), we use a step-up pro-
cedure [172] to decide which of the hypotheses should be rejected so as to control FDR at the
nominal level α. We first sort LIS from the smallest value to the largest value. Suppose LIS(1),
LIS(2), ..., and LIS(m) are the ordered LIS, and the corresponding hypotheses are H(1), H(2),...,
andH(m). Let
84
k = max
{i :
1
i
∑i
j=1LIS(j) ≤ α
}. (5.4)
Then we rejectH(i) for i = 1, ..., k.
5.4 Connections with Classical Multiple Testing Procedures
We show that both the local FDR procedure [38] and the BH procedure [14, 63] can be regarded
as semiparametric graphical models which do not consider dependence among the hypotheses.
The local FDR procedure uses Bayes Theorem to calculate the posterior probability thatHi is null
given its observed test statistic xi, namely
P (Hi is null|Xi=xi) =p0f0(xi)
p0f0(xi) + p1f1(xi). (5.5)
This posterior probability is termed the local false discovery rate [37]. Note that our LIS
reduces to local false discovery rate under the assumption of independence. [37] recommend
using empirical Bayes inference [148] to calculate local false discovery rate as
P (Hi is null|Xi=xi) =p0f0(xi)
f(xi), (5.6)
where f is the empirical density of the test statistic and p0 is an estimate of p0. If we use θi
to denote the ground truth of Hi, its local false discovery rate is P (θi = 0|Xi=xi). Therefore,
we can use the graphical model in Figure 5.3(a) to denote it. Obviously, this model is exactly
our semiparametric model in Figure 5.2, except that there are no pairwise potentials capturing
the dependence because the local FDR procedure assumes independence among the hypotheses.
The model for the local FDR procedure is also semiparametric because f1 is nonparametrically
estimated. Also note that the parameter π in our model reduces to the prior parameter p0 in this
simplified model.
The following shows that the BH procedure is also a semiparametric model, but the observed
85
,1
1, … ,
1
1, … ,
b TheBHprocedurea ThelocalFDRprocedure
,
Figure 5.3: The plate presentation of the semiparametric graphical models for local FDR and theBH procedure.
statistic is modeled by a cumulative distribution function (CDF). Let P(1)<...<P(m) be the ordered
P -values from the m tests and P(0)=0. The BH procedure rejects any hypothesis whose P -value
satisfies P ≤ P ∗ with
P ∗ = max{P(i)|P(i) ≤i
m
α
p0}, (5.7)
which controls FDR at the level α [12, 166, 62]. The inequality in (5.7) can be rewritten as
p0P(i)
i/m≤ α. (5.8)
Because a P -value is the CDF of f0 at the value of its test statistic x, and i/m is the empirical
CDF of f at the test statistic ofH(i), (5.8) is further rewritten as
p0F0(x)
F (x)≤ α, (5.9)
where F0 and F are the CDFs of f0 and f respectively, and F is an empirical version of
F . Note that the left hand side of (5.9) is also an empirical Bayes inference, similar to (5.6).
Therefore, both the BH procedure and the local FDR procedure can be interpreted as empirical
86
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.00
0.04
0.08
● ● ● ● ● ● ● ● ●
µ
FD
R
(1a) Chain structure, φ = 0.8
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.06
0.08
0.10
0.12 ●
●
●
●
●
●
●● ●
µ
FN
R
(1b) Chain structure, φ = 0.8
1.5 2.0 2.5 3.0 3.5 4.0 4.5
800
1000
1200
1400
●
●
●
●
●
●
●● ●
µ
ATP
(1c) Chain structure, φ = 0.8
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.00
0.04
0.08 ●
● ● ● ● ● ● ● ●
µ
FD
R
(2a) Chain structure, φ = 0.6
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.12
0.16
0.20
0.24 ●
●
●
●
●
●●
● ●
µ
FN
R
(2b) Chain structure, φ = 0.6
1.5 2.0 2.5 3.0 3.5 4.0 4.51400
1800
2200
2600
●
●
●
●
●
●●
● ●
µ
ATP
(2c) Chain structure, φ = 0.6
●
OracleSemiparametricBHlocal FDRParametric
Figure 5.4: Performance of the procedures under Model 1 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is chain.
Bayes inference, and the difference is that the BH procedure uses the CDFs whereas the local
FDR procedure uses the density functions. Thus, we can present the BH procedure as the graph-
ical model in Figure 5.3(b). This model is also semiparametric because F1 is nonparametrically
estimated. Therefore, both the local FDR procedure and the BH procedure are semiparametric
graphical models which do not consider dependence among the hypotheses.
5.5 Simulations
We explore the empirical performance of our multiple testing procedure and three baseline pro-
cedures, including the local FDR procedure [38], the BH procedure [14, 63] and the procedure
based on a parametric graphical model [107]. Because we have the ground truth parameters, we
have two versions of our multiple testing approach, namely an oracle procedure and a data-driven
procedure. The oracle procedure knows the true parameters in the graphical model (including φ,
π and f1), whereas the data-driven procedure does not and has to estimate the graphical model in
the semiparametric way introduced in Sections 5.3.2 and 5.3.3. Both the BH procedure and the
local FDR procedure need an estimate of p0; we use the same estimating method in Section 5.3.2
87
1.0 1.2 1.4 1.6 1.8 2.0
0.00
0.04
0.08
● ● ● ● ● ●
β
FD
R
(1a) Chain structure, φ = 0.8
1.0 1.2 1.4 1.6 1.8 2.0
0.04
0.08
0.12
●
●
●
●
●
●
β
FN
R
(1b) Chain structure, φ = 0.8
1.0 1.2 1.4 1.6 1.8 2.0
050
010
0015
00
●
●
●
●
●●
β
ATP
(1c) Chain structure, φ = 0.8
1.0 1.2 1.4 1.6 1.8 2.0
0.00
0.04
0.08 ● ● ● ● ● ●
β
FD
R
(2a) Chain structure, φ = 0.6
1.0 1.2 1.4 1.6 1.8 2.0
0.15
0.20
0.25
●
●
●
●
●
●
β
FN
R
(2b) Chain structure, φ = 0.6
1.0 1.2 1.4 1.6 1.8 2.0
050
015
0025
00
●
●
●
●
●●
β
ATP
(2c) Chain structure, φ = 0.6
●
OracleSemiparametricBHlocal FDRParametric
Figure 5.5: Performance of the procedures under Model 2 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is chain.
for a fair comparison. The local FDR procedure also needs an estimate of f , and we estimate it in
the same way as in our data-driven procedure.
We choose the setup to be consistent with previous work [172, 107] when possible. We con-
sider two dependence structures, namely a chain structure and a grid structure. For the chain
structure, we choose the number of hypotheses m=10,000. For the grid structure, we choose a
100×100 grid, which also yields 10,000 hypotheses. We test two levels of dependence strength,
i.e. φ=0.8 and φ=0.6. We set π to be 0.4. We first simulate the ground truth of the hypotheses θ
from P (θ;φ,π) and then simulate the test statistics x from P (x|θ; f0, f1). We assume that the
observed xi under the null hypothesis (namely θi=0) is from a standard normal N (0, 1). We test
two different models for xi under the alternative hypothesis (namely θi=1) as follows.
Model 1: xi|θi=1 comes from a mixture of Gaussians
1
3N (1, 1) +
1
3N (µ, 1) +
1
3N (5, 1). (5.10)
In total, we test nine values for µ, namely 1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8, 4.2 and 4.6. Different
µ values yield different f1 with different shapes.
88
Model 2: xi|θi=1 comes from a Gaussian N (µ, 1) and µ has a prior of Gamma(2.0, β)
where β is the scale parameter. We test six different values for β, namely 1.0, 1.2, 1.4, 1.6, 1.8
and 2.0. This model is designed to mimic the common situation in GWAS that common genetic
variants have small effect sizes and rare genetic variants have large effect sizes [113].
We compare three measures from these procedures. First, we check whether these procedures
are valid, namely whether the FDR yielded from these procedures is controlled at the nominal level
α. The nominal FDR level α is 0.10, which is consistent with the multiple testing literature [35].
Second, we compare the FNR yielded from these procedures. The third measure is the average
number of true positives (ATP) of these procedures. Valid procedures with a lower FNR and a
higher ATP are considered to be more efficient (or powerful). In the simulations, each experiment
is replicated 500 times and the average results are reported.
Performance under chain structure: The performance of the five procedures under the chain
dependence structure is shown in Figures 5.4 and 5.5, which correspond to Model 1 and Model
2, respectively. It is observed that all five procedures are valid. The parametric procedure [107]
is conservative, which agrees with the observations in Figure 3(1d) of [107]. Our semiparametric
data-driven procedure, the BH procedure and the local FDR procedure are slightly conservative.
The oracle procedure slightly outperforms the semiparametric data-driven procedure based the
plots for FNR and ATP. These two completely dominate the three baselines, which indicates the
benefit of leveraging dependence among the hypotheses via the semiparametric graphical model.
We also observe that the advantage of the oracle procedure and our semiparametric data-driven
procedure over the local FDR procedure is larger when φ = 0.8 than when φ = 0.6. The reason
is that as φ decreases from 0.8 to 0.6, the dependence strength among the hypotheses decreases,
and we benefit less from leveraging the dependence. When φ = 0.5, the edge potentials in our
graphical model are no longer informative, and the node potentials become the priors in the local
FDR procedure, and our procedure exactly reduces to the local FDR procedure.
Performance under grid structure: The performance of the five procedures under the grid
dependence structure is shown in Figures 5.6 and 5.7, which correspond to Model 1 and Model
89
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.00
0.04
0.08
● ● ● ● ● ● ● ● ●
µ
FD
R
(1a) Grid structure, φ = 0.8
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.00
0.04
0.08
●●
●
●
●
●●
● ●
µ
FN
R
(1b) Grid structure, φ = 0.8
1.5 2.0 2.5 3.0 3.5 4.0 4.5
800
1200
1600
●
●
●
●
●
●●
● ●
µ
ATP
(1c) Grid structure, φ = 0.8
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.00
0.04
0.08
● ● ● ● ● ● ● ● ●
µ
FD
R
(2a) Grid structure, φ = 0.6
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.12
0.16
0.20
0.24
●
●
●
●
●
●
●● ●
µ
FN
R
(2b) Grid structure, φ = 0.6
1.5 2.0 2.5 3.0 3.5 4.0 4.5
1000
1400
1800
●
●
●
●
●
●●
● ●
µ
ATP
(2c) Grid structure, φ = 0.6
●
OracleSemiparametricBHlocal FDRParametric
Figure 5.6: Performance of the procedures under Model 1 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is grid.
1.0 1.2 1.4 1.6 1.8 2.0
0.00
0.04
0.08
● ● ● ● ● ●
β
FD
R
(1a) Grid structure, φ = 0.8
1.0 1.2 1.4 1.6 1.8 2.0
0.00
0.04
0.08
0.12
●
●
●
●
●
●
β
FN
R
(1b) Grid structure, φ = 0.8
1.0 1.2 1.4 1.6 1.8 2.0
050
010
0015
00
●
●
●
●●
●
β
ATP
(1c) Grid structure, φ = 0.8
1.0 1.2 1.4 1.6 1.8 2.0
0.00
0.04
0.08
● ● ● ● ● ●
β
FD
R
(2a) Grid structure, φ = 0.6
1.0 1.2 1.4 1.6 1.8 2.0
0.08
0.12
0.16
●
●
●
●
●
●
β
FN
R
(2b) Grid structure, φ = 0.6
1.0 1.2 1.4 1.6 1.8 2.0
050
010
0015
0020
00
●
●
●
●
●●
β
ATP
(2c) Grid structure, φ = 0.6
●
OracleSemiparametricBHlocal FDRParametric
Figure 5.7: Performance of the procedures under Model 2 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is grid.
2, respectively. All five procedures are valid. The parametric procedure is considerably conser-
vative, which agrees with the observations in Figure 3(3d) of [107]. Again, our semiparamet-
ric data-driven procedure significantly outperforms the three baselines in all the configurations,
demonstrating the benefit of leveraging dependence among the hypotheses via the semiparametric
901.5 2.0 2.5 3.0 3.5 4.0 4.5
0.00
0.04
0.08
FD
R
µ(1a)
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.06
0.10
FN
R
µ(1b)
1.5 2.0 2.5 3.0 3.5 4.0 4.5
800
1000
1400
ATP
µ(1c)
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.00
0.04
0.08
FD
R
µ(2a)
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.12
0.16
0.20
0.24
FN
R
µ(2b)
1.5 2.0 2.5 3.0 3.5 4.0 4.5
1400
1800
2200
2600
ATP
µ(2c)
Figure 5.8: Performance of our procedure when λ = 0.2 (dotted lines), 0.5 (dashed lines) and 0.8(solid lines).
graphical model. The difference between our semiparametric data-driven procedure and the base-
lines is even larger compared with simulations under the chain structure. The reason is that in the
grid structure, each hypothesis has more neighbors than in the chain structure, and we can benefit
more from leveraging the dependence among the hypotheses.
Robustness of λ: In the previous simulations, λ is fixed at 0.8. We test another two values for
λ, namely 0.2 and 0.5, and repeat previous simulations. The performance of our semiparametric
procedure under the chain dependence structure and Model 1 with φ = 0.8 is provided in Figure
5.8. Quite surprisingly, our data-driven semiparametric procedure is valid for the three values of
λ and is slightly conservative for most of the configurations. However, the FNR and ATP of our
data-driven procedure for the three different values of λ are almost the same. Therefore although
our approach needs to pick a λ parameter, its performance is robust for different choices of λ. The
robustness of λ was also observed in [166]. The sensitivity analysis of λ in other configurations
yield similar observations.
Efficiency of Ranking: Although ranking the hypotheses by the probability thatH0 is false is
a secondary goal in multiple testing, readers may wonder how well our semiparametric procedure
performs in terms of ranking the hypotheses. For the oracle procedure, the parametric procedure
[107] and our semiparametric procedure, we rank the hypotheses by the posterior probability that
H0 is false, namely 1 − LIS. For BH, we use 1−P -value. For local FDR procedure, we use
1 − lfdr. Here we plot the ROC curves and PR curves yielded by the five procedures in Figure
5.9 for µ = 1.4 and φ = 0.8 in the chain structure under model 1. We observe that the oracle
91
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
ROC Curve
OracleSemiparametricParametricLocal FDRBH
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
PR Curve
OracleSemiparametricParametricLocal FDRBH
Figure 5.9: ROC/PR curves from these procedures.
procedure produces the most efficient ranking, followed by the semiparametric procedure and the
parametric procedure. The rankings yielded by local FDR and BH procedure are less efficient. The
ROC curves and PR curves of these procedures under other configurations show similar behavior.
Run Time: In the chain-structure simulations, it took our data-driven procedure about 10
hours to finish the 500 replications sequentially (for one µ value in (5.10)) on one 3GHz CPU. In
the grid-structure simulations, it took our procedure around 30 hours to finish the 500 replications
sequentially (for one µ value in (5.10)) on one 3GHz CPU.
5.6 Application
We apply our procedure to a real-world GWAS on breast cancer [82] which involves 528,173 SNPs
for 1,145 cases and 1,142 controls. In total, we test 528,173 hypotheses, and they are dependent
because SNPs nearby tend to be highly correlated. We query the squared correlation coefficients
(r2 values) among the SNPs from HapMap [175], and build the dependence structure as follows.
Each SNP becomes a node in the graph. For each SNP, we connect it with the SNP having the
highest r2 value with it. We further categorize the edges into a high correlation edge set Eh (r2
92
above 0.8), a medium correlation edge set Em (r2 between 0.5 and 0.8) and a low correlation edge
set El (r2 between 0.25 and 0.5). We have three parameters (φh, φm, and φl) for the three sets of
edges.
When we apply our procedure on the dataset, the individual test is a two-proportion z-test.
We set λ= 0.8, and the value of p0 is estimated to be 0.978, which means that about 2.2% of the
SNPs are associated to breast cancer. The estimated f1 in this study is plotted in Figure 5.1. The
whole experiment takes around 30 hours on a single processor. Our procedure reports 20 SNPs
with LIS value below 0.01. There are five clusters covering 18 of them. All 18 SNPs have very
small P -values from the two-proportion z-test and locate near one another in the same cluster.
The first cluster on Chr2, the cluster on Chr4, the cluster on Chr9 and the cluster on Chr10 are
identified in [82] and [156]. The second cluster on Chr2 is associated to a telomere and telomeres
are known to be related to breast cancer [174]. We further use a second cohort to validate the 18
SNPs, and 16 of them show a moderate level of association on the second cohort. We also would
like to mention that there is some work on estimating less conservative significance thresholds for
controlling family-wise error rate in GWAS [154, 76].
5.7 Discussion
We propose a novel semiparametric graphical model to leverage the dependence in multiple testing
problems. Although our semiparametric approach seems incremental over previous fully paramet-
ric approach [172, 107] from the viewpoint of graphical models, such a modification is nontrivial
to the multiple testing area, for both a methodological reason and an application reason. From the
methodological standpoint, our semiparametric approach naturally generalizes the local FDR pro-
cedure and connects with the BH procedure — we show that both the BH procedure and the local
FDR procedure estimate their parameters in the same semiparametric way to avoid assumptions
about f1. The methodological unification demonstrates that such a modification is necessary for
multiple testing. From the application aspect, our semiparametric approach no longer requires the
investigators to know the parameterization of f1, which is generally unknown in practical prob-
93
lems. Improper parameterization assumptions for f1 can make the fully parametric approach either
too liberal which makes the procedure invalid, or too conservative which makes the procedure lose
power, as illustrated by both our simulations and previous work [172, 107]. Our semiparametric
approach better controls FDR and is more powerful. For these reasons, we suggest that investi-
gators choose the semiparametric approach for their large-scale multiple testing problems if (i)
they speculate that there exists dependence among the hypotheses, and (ii) there is no suitable
parametric distribution for f1.
The material in this chapter first appeared in the 31st International Conference on Machine
Learning (ICML’2014) as follows:
Jie Liu, Chunming Zhang, Elizabeth Burnside and David Page. Multiple Testing under De-
pendence via Semiparametric Graphical Models. The 31st International Conference on Machine
Learning (ICML), 2014.
In both the semiparametric graphical model in this chapter and the fully parametric graphical
model in last chapter, we assume that φi’s in the pairwise potential functions are homogeneous.
However, in large-scale graphical models, φi’s can be heterogeneous. There are two situations.
First, there is some background knowledge about how these parameter may change; for example
if the HapMap [175] resource shows an r2 of 0.99 between two SNPs, this background knowledge
provides some evidence that the parameterization should make them more likely to take the same
value than if the r2 were, say, 0.8. Second, these parameters are latently tied; for example, pairs
of SNPs, and consequently the parameters on the pairs, might naturally cluster into four groups–
highly-correlated, intermediately-correlated, weakly-correlated, and uncorrelated–or some other
number of groups based on correlation. Chapter 6 deals with the first situation, namely capturing
the heterogeneity within parameter learning in hidden Markov random fields with the help of
background knowledge. Chapter 7 deals with the second situation, namely estimating latently-
grouped parameters in undirected graphical models.
Chapter 6
Learning Heterogeneous Hidden
Markov Random Fields
Hidden Markov random fields (HMRFs) are conventionally assumed to be homogeneous in the
sense that the potential functions are invariant across different sites. However in some biological
applications, it is desirable to make HMRFs heterogeneous, especially when there exists some
background knowledge about how the potential functions vary. We formally define heterogeneous
HMRFs and propose an EM algorithm whose M-step combines a contrastive divergence learner
with a kernel smoothing step to incorporate the background knowledge. Simulations show that
our algorithm is effective for learning heterogeneous HMRFs and outperforms alternative binning
methods. We learn a heterogeneous HMRF in a real-world study.
6.1 Introduction
Hidden Markov models (HMMs) and hidden Markov random fields (HMRFs) are useful ap-
proaches for modelling structured data such as speech, text, vision and biological data. HMMs
and HMRFs have been extended in many ways, such as the infinite models [9, 54, 29], the factorial
models [67, 90], the high-order models [98] and the nonparametric models [81, 164]. HMMs are
94
95
homogeneous in the sense that the transition matrix stays the same across different sites. HM-
RFs, intensively used in image segmentation tasks [214, 26, 28], are also homogeneous. The
homogeneity assumption for HMRFs in image segmentation tasks is legitimate, because people
usually assume that the neighborhood system on an image is invariant across different regions.
However, it is necessary to bring heterogeneity to HMMs and HMRFs in some biological applica-
tions where the correlation structure can change over different sites. For example, a heterogeneous
HMM is used for segmenting array CGH data [114], and the transition matrix depends on some
background knowledge, i.e. some distance measurement which changes over the sites. A hetero-
geneous HMRF is used to filter SNPs in genome-wide association studies [108], and the pairwise
potential functions depend on some background knowledge, i.e. some correlation measure be-
tween the SNPs which can be different between different pairs. In both of these applications,
the transition matrix and the pairwise potential functions are heterogeneous and are parameter-
ized as monotone parametric functions of the background knowledge. Although the algorithms
tune the parameters in the monotone functions, there is no justification that the parameterization
of the monotone functions is correct. Can we adopt the background knowledge about these het-
erogeneous parameters adaptively during HMRF learning, and recover the relation between the
parameters and the background knowledge nonparametrically?
This chapter is the first to learn HMRFs with heterogeneous parameters by adaptively incor-
porating the background knowledge. It is an EM algorithm whose M-step combines a contrastive
divergence style learner with a kernel smoothing step to incorporate the background knowledge.
Details about our EM-kernel-PCD algorithm are given in Section 6.3 after we formally define
heterogeneous HMRFs in Section 6.2. Simulations in Section 6.4 show that our EM-kernel-PCD
algorithm is effective for learning heterogeneous HMRFs and outperforms alternative methods. In
Section 6.5, we learn a heterogeneous HMRF in a real-world genome-wide association study. We
conclude in Section 6.6.
96
6.2 Models
6.2.1 HMRFs And Homogeneity Assumption
Suppose that X = {0, 1, ...,m−1} is a discrete space, and we have a Markov random field (MRF)
defined on a random vector X ∈ Xd. The conditional independence is described by an undirected
graph G(V,E). The node set V consists of d nodes. The edge set E consists of r edges. The
probability of x from the MRF with parameters θ is
P (x;θ) =Q(x;θ)
Z(θ)=
1
Z(θ)
∏c∈C(G)
φc(x;θc), (6.1)
where Z(θ) is the normalizing constant. Q(x;θ) is some unnormalized measure with C(G) being
some subset of the cliques in G. The potential function φc is defined on the clique c and is
parameterized by θc. For simplicity in this chapter, we consider pairwise MRFs, whose potential
functions are defined on the edges, namely |C(G)| = r. We further assume that each pairwise
potential function is parameterized by a single parameter, i.e. θc = {θc}.
A hidden Markov random field [214, 26, 28] consists of a hidden random field X ∈ Xd and
an observable random field Y ∈ Yd where Y is another space (either continuous or discrete).
The random field X is a Markov random field with density P (x;θ), as defined in Formula (6.1),
and its instantiation x cannot be measured directly. Instead, we can observe the emitted random
field Y with its individual dimension Yi depending on Xi for i = 1, ..., d, namely P (y|x;ϕ) =∏di=1 P (yi|xi;ϕ) where ϕ = {ϕ0, ..., ϕm−1} and ϕxi parameterizes the emitting distribution of
Yi under the state xi. Therefore, the joint probablity of x and y is
P (x,y;θ,ϕ) = P (x;θ)P (y|x;ϕ)
=1
Z(θ)
∏c∈C(G)
φc(x;θc)d∏i=1
P (yi|xi;ϕ).(6.2)
Example 1: One pairwise HMRF model with three latent variables (X1, X2, X3) and three
observable variables (Y1, Y2, Y3) is given in Figure 6.1. Let X = {0, 1}. X1, X2 and X3 are
97
1
11
Figure 6.1: The pairwise HMRF model with three latent nodes (X1,X2,X3) and observable nodes(Y1, Y2, Y3) with parameters θ = {θ1, θ2, θ3} and ϕ = {ϕ0, ϕ1}.
connected by three edges. The pairwise potential function φi on edge i (connecting Xu and
Xv) parameterized by θi (0<θi<1) is φi(X; θi) = θI(Xu=Xv)i (1 − θi)I(Xu 6=Xv) for i = 1, 2, 3,
where I is an indicator variable. Let Y = R. For i = 1, 2, 3, Yi|Xi=0 ∼ N(µ0, σ20) and
Yi|Xi=1 ∼ N(µ1, σ21), namely ϕ0 = {µ0, σ0} and ϕ1 = {µ1, σ1}.
In common applications of HMRFs, we observe only one instantiation y which is emitted
according to the hidden state vector x, and the task is to infer the most probable state configuration
of X, or to compute the marginal probabilities of X. In both tasks, we need to estimate the
parameters θ = {θ1, ..., θr} and ϕ = {ϕ0, ..., ϕm−1}. Usually, we seek maximum likelihood
estimates of θ and ϕ which maximize the log likelihood
L(θ,ϕ) = logP (y;θ,ϕ) = log∑x∈Xd
P (x,y;θ,ϕ). (6.3)
Since we only have one instantiation (x,y), we usually have to assume that θi’s are the same
for i = 1, ..., r for effective parameter learning. This homogeneity assumption is widely used in
computer vision problems because people usually assume that the neighborhood system on an im-
age is invariant across its different regions. Therefore, conventional HMRFs refer to homogeneous
HMRFs, similar to conventional HMMs whose transition matrix is invariant across different sites.
98
6.2.2 Heterogeneous HMRFs
In a heterogeneous HMRF, the potential functions on different cliques can be different. Taking
the model in Figure 6.1 as an example, θ1, θ2 and θ3 can be different if the HMRF is heteroge-
neous. As with conventional HMRFs, we want to be able to address applications that have one
instantiation (x,y) where y is observable and x is hidden. Therefore, learning an HMRF from
one instantiation y is infeasible if we free all θ’s. To partially free the parameters, we assume that
there is some background knowledge k = {k1, ..., kr} about the parameters θ = {θ1, ..., θr} in the
form of some unknown smooth mapping function which maps θi to ki for i = 1, ..., r. The back-
ground knowledge describes how these potential functions are different across different cliques.
Taking pairwise HMRFs for example, the potentials on the edges with similar background knowl-
edge should have similar parameters. We can regard the homogeneity assumption in conventional
HMRFs as an extreme type of background knowledge that k1 = k2 = ... = kr. The problem we
solve in this chapter is to estimate θ and ϕ which maximize the log likelihood L(θ,ϕ) in Formula
(6.3), subject to the condition that the estimate of θ is smooth with respect to k.
6.3 Parameter Learning Methods
Learning heterogeneous HMRFs in above manner involves three difficulties, (i) the intractable
Z(θ), (ii) the latent x, and (iii) the heterogeneous θ. The way we handle the intractable Z(θ)
is similar to using contrastive divergence [80] to learn MRFs. We review contrastive divergence
and its variations in Section 6.3.1. To handle the latent x in HMRF learning, we introduce an
EM algorithm in Section 6.3.2, which is applicable to conventional HMRFs. In Section 6.3.3, we
further address the heterogeneity of θ in the M-step of the EM algorithm.
6.3.1 Contrastive Divergence for MRFs
Assume that we observe s independent samples X = {x1,x2, ...,xs} from (6.1), and we want to
estimate θ. The log likelihood L(θ|X) is concave w.r.t. θ, and we can use gradient ascent to find
the MLE of θ. The partial derivative of L(θ|X) with respect to θi is
99
∂L(θ|X)
∂θi=
1
s
s∑j=1
ψi(xj)− Eθψi = EXψi − Eθψi, (6.4)
where ψi is the sufficient statistic corresponding to θi, and Eθψi is the expectation of ψi with
respect to the distribution specified by θ. In the i-th iteration of gradient ascent, the parameter
update is
θ(i+1) = θ(i) + η∇L(θ(i)|X) = θ(i) + η(EXψ − Eθ(i)ψ),
where η is the learning rate. However the exact computation ofEθψi takes time exponential in the
treewidth of G. A few sampling-based methods have been proposed to solve this problem. The key
differences among these methods are how to draw particles and how to compute Eθψ from the
particles. MCMC-MLE [66, 218] uses importance sampling, but might suffer from degeneracy
when θ(i) is far away from θ(1). Contrastive divergence [80] generates new particles in each
iteration according to the current θ(i) and does not require the particles to reach equilibrium, so
as to save computation. Variations of contrastive divergence include particle-filtered MCMC-
MLE [5], persistent contrastive divergence (PCD) [178] and fast PCD [179]. Because PCD is
efficient and easy to implement, we employ it in this chapter. Its pseudo-code is provided in
Algorithm 1. Other than contrastive divergence, MRF can be learned via ratio matching [85], non-
local contrastive objectives [184], noise-contrastive estimation [72] and minimum KL contraction
[112].
6.3.2 Expectation-Maximization for Learning Conventional HMRFs
We begin with a lower bound of the log likelihood function, and then introduce the EM algorithm
which handles the latent variables in HMRFs. Let qx(x) be any distribution on x∈Xd. It is well
known that there exists a lower bound of the log likelihood L(θ,ϕ) in (6.3), which is provided by
an auxiliary function F(qx(x), {θ,ϕ}) defined as follows,
100
Algorithm 1 PCD-n Algorithm [178]
1: Input: independent samples X = {x1,x2, ...,xs} from P (x;θ), maximum iteration numberT
2: Output: θ from the last iteration3: Procedure:4: Initialize θ(1) and initialize particles5: Calculate EXψ from X6: for i = 1 to T do7: Advance particles n steps under θ(i)
8: Calculate Eθ(i)ψ from the particles9: θ(i+1) = θ(i) + η(EXψ − Eθ(i)ψ)
10: Adjust η11: end for
F(qx(x),{θ,ϕ}) =∑x∈Xd
qx(x) logP (x,y;θ,ϕ)
qx(x)
= L(θ,ϕ)−KL[qx(x)|P (x|y,θ,ϕ)],
(6.5)
where KL[qx(x)|P (x|y,θ,ϕ)] is the Kullback-Leibler divergence between qx(x) andP (x|y,θ,ϕ),
the posterior distribution of the hidden variables. This Kullback-Leibler divergence is the distance
between L(θ,ϕ) and F(qx(x), {θ,ϕ}).
Expectation-Maximization: We maximize L(θ,ϕ) with an expectation-maximization (EM)
algorithm which iteratively maximizes its lower bound F(qx(x), {θ,ϕ}). We first initialize θ(0)
and ϕ(0). In the t-th iteration, the updates in the expectation (E) step and the maximization (M)
step are
q(t)x = arg max
qxF(qx(x), {θ(t−1),ϕ(t−1)}) (E),
θ(t),ϕ(t) = arg max{θ,ϕ}
F(q(t)x , {θ,ϕ}) (M).
In the E-step, we maximize F(qx(x), {θ(t−1),ϕ(t−1)}) with respect to qx(x). Because the
101
difference between F(qx(x), {θ,ϕ}) and L(θ,ϕ) is KL[qx(x)|P (x|y,θ,ϕ)], the maximizer in
the E-step q(t)x is P (x|y,θ(t−1),ϕ(t−1)), namely the posterior distribution of x|y under the current
estimated parameters θ(t−1) and ϕ(t−1). This posterior distribution can be calculated by Markov
chain Monte Carlo for general graphs.
In the M-step, we maximizeF(q(t)x (x), {θ,ϕ}) with respect to {θ,ϕ}, which can be rewritten
as
arg max{θ,ϕ}
F(q(t)x (x), {θ,ϕ})
= arg max{θ,ϕ}
∑x∈Xd
q(t)x (x) logP (x,y;θ,ϕ)
= arg max{θ,ϕ}
∑x∈Xd
q(t)x (x)
{logP (x;θ) + logP (y|x;ϕ)
}.
It is obvious that this function can be maximized with respect to ϕ and θ separately as
θ(t) = arg maxθ
∑x∈Xd
q(t)x (x) logP (x;θ),
ϕ(t) = arg maxϕ
∑x∈Xd
q(t)x (x) logP (y|x;ϕ).
(6.6)
Estimating ϕ: Estimating ϕ in this maximum likelihood manner is straightforward, because
the maximization can be rewritten as follows,
arg maxϕ
∑x∈Xd
q(t)x (x) logP (y|x;ϕ)
= arg maxϕ
d∑i=1
∑xi∈X
q(t)xi (xi) logP (yi|xi;ϕ),
where q(t)x (x) =
∏di=1 q
(t)xi (xi).
102
Estimating θ: Estimating θ in Formula (6.6) is difficult due to the intractable Z(θ). Some
approaches [214, 26] use pseudo-likelihood [18] to estimate θ in the M-step. It can be shown that∑x∈Xd q
(t)x (x) logP (x;θ) is concave with respect to θ. Therefore, we can use gradient ascent to
find the MLE of θ, which is similar to using contrastive divergence [80] to learn MRFs in Section
6.3.1.
Denote∑
x∈Xd q(t)x (x) logP (x;θ) by LM (θ|q(t)
x ). The partial derivative of LM (θ|q(t)x ) with
respect to θi is
∂LM (θ|q(t)x )
∂θi=∑x∈Xd
q(t)x (x)
{ψi(x)− Eθψi(x)
}.
Therefore, the derivative here is similar to the derivative in contrastive divergence in Formula
(6.4) except we have to reweight it to q(t)x . We run the EM algorithm until both θ and ϕ converge.
Note that when learning homogeneous HMRFs with this algorithm, we tie all θ’s all the time,
namely θ = {θ}. Therefore, we name this parameter learning algorithm for conventional HMRFs
the EM-homo-PCD algorithm.
6.3.3 Learning Heterogeneous HMRFs
Learning heterogeneous HMRFs is different from learning conventional homogeneous HMRFs in
two ways. First, we need to free the θ’s in heterogeneous HMRFs. Second, there is some back-
ground knowledge k about how the θ’s are different, as introduced in Section 6.2. Therefore, we
make two modifications to the EM-homo-PCD algorithm in order to learn heterogeneous HMRFs
with background knowledge. First, we estimate the θ’s separately, which obviously brings more
variance in estimation. Second, within each iteration of the contrastive divergence update, we ap-
ply a kernel regression to smooth the estimate of the θ’s with respect to the background knowledge
k. Specifically, in the i-th iteration of PCD update, we advance the particles under θ(i) for n steps,
and calculate the moments Eθ(i)ψ from the particles. Therefore, we can update the estimate as
θ(i+1) = θ(i) + η∇LM (θ|q(t)x ).
103
Then we regress θ(i+1) with respect to k via Nadaraya-Watson kernel regression [129, 195],
and set θ(i+1) to be the fitted values. For ease of notation, we drop the iteration index (i +
1). Suppose that θ = {θ1, ..., θr} is the estimate before kernel smoothing; we set the smoothed
estimate θ = {θ1, ..., θr} as
θj =∑r
i=1γij θi,∀j = 1, ..., r,
where
γij =K(
ki−kjh )∑r
m=1K(km−kjh )
.
For the kernel function K, we use the Epanechnikov kernel, which is usually computationally
more efficient than a Gaussian kernel. We tune the bandwidth h through cross-validation, namely
we select the bandwidth which minimizes the leave-one-out score
1
r
∑r
i=1
(θi − θi1− γii
)2
.
Tuning the bandwidth is usually computation-intensive, so we tune it every t0 iterations to
save computation. We name our parameter learning algorithm for heterogeneous HMRFs the
EM-kernel-PCD algorithm. Its pseudo-code is given in Algorithm 2.
Another intuitive way of handling background knowledge about these heterogeneous param-
eters is to create bins according to the background knowledge and tie the θ’s that are in the
same bin. Suppose that we have b bins after we carefully select the binwidth, namely we have
θ = {θ1, ..., θb}. The rest of the algorithm is the same as the EM-homo-PCD algorithm in Section
6.3.2. We name this parameter learning algorithm via binning the EM-binning-PCD algorithm.
We can also regard our EM-kernel-PCD algorithm as a soft-binning version of EM-binning-PCD.
6.3.4 Geometric Interpretation
Before providing empirical evaluations of the algorithms, we first present an example showing
why adopting the background knowledge helps when we are learning the heterogeneous parame-
104
Algorithm 2 EM-kernel-PCD Algorithm
1: Input: sample y, background knowledge k, max iteration number T , initial bandwidth h2: Output: θ from the last iteration3: Procedure:4: Initialize θ, ϕ and particles5: while not converge do6: E-step: infer x from y7: Calculate Exψ from x8: for i = 1 to T do9: Advance particles for n steps under θ(i)
10: Calculate Eθ(i)ψ from the particles
11: θ(i+1) = θ(i) + η∇LM (θ|q(t)x )
12: θ(i+1) = kernelRegF it(θ(i+1),k, h)13: Adjust η and tune bandwidth h14: end for15: MLE ϕ from x and y16: end while
X1 X2 X31 0 01 1 11 1 10 1 11 0 00 0 00 1 10 0 10 0 11 1 0
Data
11
0 1
0
1
11
0 1
0
1
Figure 6.2: Geometric interpretation of the parameter learning algorithms with a small Markovrandom field model on {X1, X2, X3} parameterized by {θ1, θ2}; we observe ten samples X. Theplot on the right is the log likelihood of the parameters L(θ1, θ2|X).
ters via gradient ascent. Suppose that we have a Markov random field on (X1, X2, X3) parame-
terized by θ1 and θ2 (0 < θ1 < 1, 0 < θ2 < 1), and we observe ten samples X generated from the
model, as shown in Figure 6.2. The ground truth is θ1 = 0.65 and θ2 = 0.55, and we have the
105
background knowledge θ1 > θ2. The plot on the right is the log likelihood L(θ1, θ2|X) which is
concave with respect to (θ1, θ2). If the parameterization of the model is minimal, the global max-
imum is unique. The global maximum is at the point (0.6, 0.7) for our observed data X because
X1 agrees with X2 six times, and X2 agrees with X3 seven times among the ten samples. We can
run the standard contrastive divergence algorithm, namely gradient ascent in the feasible region
{(θ1, θ2)|0 < θ1 < 1, 0 < θ2 < 1} to reach the global maximum point (0.6, 0.7), although it is
far from the ground truth point (0.65, 0.55) due to the small sample size (recall that in HMRFs we
only have one example). If we make the homogeneity assumption θ1 = θ2, we actually perform
the gradient ascent on the blue curve, which is the intersection of the log-likelihood surface and
the hyperplane θ1 = θ2. There is one maximum point on the blue curve and we can also achieve
it; we usually get better gradient information and reach the maximum point faster because we
pool the data. However, this maximum point on the blue curve can be far from the ground truth
if the strong assumption θ1 = θ2 is far from correct (see the performance of the EM-homo-PCD
algorithm in Figure 6.3). Now if we adopt the background knowledge θ1 > θ2, we are oper-
ating in the region {(θ1, θ2)|0 < θ2 < 1, θ2 < θ1 < 1} which is smaller than the full region
{(θ1, θ2)|0 < θ1 < 1, 0 < θ2 < 1}. If the background knowledge only contains order constraints
such as θ1 > θ2, the feasible region is still convex and we are guaranteed to find a global maxi-
mum in that region. In the meantime, if the background knowledge is correct, we are guaranteed
to find a better solution than searching in the full region, and therefore we can get a more accurate
estimate.
6.4 SimulationsWe investigate the performance of our EM-kernel-PCD algorithm on heterogeneous HMRFs with
different structures, namely a tree-structure HMRF and a grid-structure HMRF. In the simulations,
we first set the ground truth of the parameters, and then set the background knowledge. We then
generate one example x and then generate one example y|x. With the observable y, we apply
EM-kernel-PCD, EM-binning-PCD and EM-homo-PCD to learn the parameters θ. We eventually
106
compare the three algorithms by their average absolute estimate error 1/r∑r
i=1 |θi− θi| where θi
is the estimate of θi.
For the HMRFs, each dimension of X takes values in {0, 1}. The pairwise potential func-
tion φi on edge i (connecting Xu and Xv) parameterized by θi (0 < θi < 1) is φi(X; θi) =
θI(Xu=Xv)i (1− θi)I(Xu 6=Xv), where I is an indicator variable. For the tree structure, we choose a
perfect binary tree of height 12, which yields a total number of 8,191 nodes and 8,190 parameters,
i.e. d = 8,191 and r = 8,190. For the grid-structure HMRFs, we choose a grid of 100 rows and
100 columns, which yields a total number of 10,000 nodes and 19,800 parameters, i.e. d = 10,000
and r = 19,800. For both of the two models, we generate θi ∼ U(0.5, 1) independently and then
generate the background knowledge ki. We have two types of background knowledge. In the
first type of background knowledge, we set ki = sin θi + ε. In the second type of background
knowledge, we set ki = θ2i + ε, where ε is some random Gaussian noise from N(0, σ2
ε ). We try
three values for σε, namely 0.0, 0.01 and 0.02. Then we generate one instantiation x. Finally,
we generate one observable y from a d dimensional multivariate normal distribution N(µx, σ2I)
where µ = 2 is the strength of signal, and σ2 = 1.0 is the variance of the manifestation, and I is
the identity matrix of dimension d. For our EM-kernel-PCD algorithm, we use an Epanechnikov
kernel with α = β = 5. For tuning bandwidth h, we try 100 values in total, namely 0.005, 0.01,
0.015, ..., 0.5. For the EM-binning-PCD algorithm, we set the binwidth to be 0.005. The rest of
the parameter settings for the three algorithms are the same, including the n parameter in PCD
which is set to be 1 and the number of particles which is set to be 100. We also replicate each
experiment 20 times, and the averaged results are reported.
Performance of the algorithms The results from the tree-structure HMRFs and the grid-
structure HMRFs are reported in Figure 6.3. We plot the average absolute error of the estimate of
the three algorithms against the number of iterations of PCD update. We have separate plots for
background knowledge ki = sin θi + ε, and background knowledge ki = θ2i + ε. Since there are
three noise levels for background knowledge, both the EM-kernel-PCD algorithm and the EM-
binning-PCD algorithm have three variations. All the three algorithms converge as they iterate.
107
0 500 1000 1500 2000
0.05
0.10
0.15
0.20
Abs
olut
e E
stim
ate
Err
or
Iteration Number
(1a) tree structure, ki = sin(θi) + ε
EM−homo−PCDEM−binning−PCD, σε =0.02EM−binning−PCD, σε =0.01EM−binning−PCD, σε =0EM−kernel−PCD, σε =0.02EM−kernel−PCD, σε =0.01EM−kernel−PCD, σε =0
0 500 1000 1500 2000
0.05
0.10
0.15
0.20
Abs
olut
e E
stim
ate
Err
or
Iteration Number
(1b) tree structure, ki = θi2 + ε
EM−homo−PCDEM−binning−PCD, σε =0.02EM−binning−PCD, σε =0.01EM−binning−PCD, σε =0EM−kernel−PCD, σε =0.02EM−kernel−PCD, σε =0.01EM−kernel−PCD, σε =0
0 500 1000 1500 2000
0.05
0.10
0.15
Abs
olut
e E
stim
ate
Err
or
Iteration Number
(2a) grid structure, ki = sin(θi) + ε
EM−homo−PCDEM−binning−PCD, σε =0.02EM−binning−PCD, σε =0.01EM−binning−PCD, σε =0EM−kernel−PCD, σε =0.02EM−kernel−PCD, σε =0.01EM−kernel−PCD, σε =0
0 500 1000 1500 2000
0.05
0.10
0.15
Abs
olut
e E
stim
ate
Err
or
Iteration Number
(2b) grid structure, ki = θi2 + ε
EM−homo−PCDEM−binning−PCD, σε =0.02EM−binning−PCD, σε =0.01EM−binning−PCD, σε =0EM−kernel−PCD, σε =0.02EM−kernel−PCD, σε =0.01EM−kernel−PCD, σε =0
Figure 6.3: Performance of EM-homo-PCD, EM-binning-PCD and EM-kernel-PCD in tree-HMRFs and grid-HMRFs for two types of background knowledge: (a) ki = sin θi + ε, and (b)ki = θ2
i + ε.
108
It is observed that the absolute estimate error of the EM-homo-PCD algorithm reduces to 0.125
as it converges. Since the parameters θi’s are drawn independently from the uniform distribution
on the interval [0.5, 1], the EM-homo-PCD algorithm ties all the θi’s and estimates them to be
0.75. Therefore, the averaged absolute error is∫ 1.0
0.5 2|x − 0.75|dx = 0.125. Our EM-kernel-
PCD algorithm significantly outperforms the EM-binning-PCD algorithm and the EM-homo-PCD
algorithm. It is also observed that as the noise level of background knowledge increases, the
performance of the EM-kernel-PCD algorithm and the EM-binning-PCD algorithm deteriorates.
However, as long as the noise level is moderate, the performance of our EM-kernel-PCD algorithm
is satisfactory. The results from the tree-structure HMRFs and the grid-structure HMRFs are
comparable except that it takes more iterations to converge in grid-structure HMRFs than in tree-
structure HMRFs.
Behavior of the algorithms We then plot the estimated parameters against their background
knowledge in the iterations of our EM-kernel-PCD algorithm. We provide plots for after 100
iterations, after 200 iterations and after convergence respectively, to show how the EM-kernel-
PCD algorithm behaves during the gradient ascent. Figure 6.4 shows the plots for the background
knowledge ki = sin θi + ε and the background knowledge ki = θ2i + ε with three levels of noise
(namely σε=0, 0.01 and 0.02) for both the tree-structure HMRFs and the grid-structure HMRFs.
It is observed that as the algorithm iterates, it gradually recovers the relationship between the
parameters and the background knowledge. There is still a gap between our estimate and the
ground truth. This is because we only have one hidden instantiation x and we have to infer x from
the observed y in the E-step. Especially at the boundaries, we can observe a certain amount of
estimate bias. The boundary bias is very common in kernel regression problems because there are
fewer data points at the boundaries [41].
Choosing parameter n One parameter in contrastive divergence algorithms is n, the number
of MCMC steps we need to perform under the current parameters in order to generate the particles.
The rationale of contrastive divergence is that it is enough to find the direction to update the
parameters by a few MCMC steps using the current parameters, and we do not have to reach the
109
Figure 6.4: The behavior of the EM-kernel-PCD algorithm during gradient ascent for differenttypes of background knowledge with different levels of noise in the tree-structure HMRFs and thegrid-structure HMRFs. The red dots show the mapping pattern between the ground truth of theparameters and their background knowledge.
equilibrium. Therefore, the parameter n is usually set to be very small to save computation when
we are learning general Markov random fields. Here we explore how we should choose the n
parameter in our EM-kernel-PCD algorithm for learning HMRFs. We choose three values for n
in the simulations, namely 1, 5 and 10. In Figure 6.5, the running time and absolute estimate
110
error are plotted for the three choices in the tree-structure HMRFs and grid-structure HMRFs
under different levels of noise in the background knowledge ki = sin θi + ε and the background
knowledge ki = θ2i + ε. The running time increases as n increases, but the estimation accuracy
does not increase. This observation stays the same for different structures and different levels of
noise in different types of background knowledge. This suggests that we can simply choose n = 1
in our EM-kernel-PCD algorithm.
6.5 Real-world ApplicationWe use our EM-kernel-PCD algorithm to learn a heterogeneous HMRF model in a real-world
genome-wide association study on breast cancer. The dataset is from NCI’s Cancer Genetics
Markers of Susceptibility (CGEMS) study [82]. Details about CGEMS dataset are provided in
Subsection 3.4.1. We build a heterogeneous HMRF model to identify the associated SNPs. In
the HMRF model, the hidden vector X ∈ {0, 1}d denotes whether the SNPs are associated with
breast cancer, i.e. Xi = 1 means that the SNPi is associated with breast cancer. For each SNP,
we can perform a two-proportion z-test from the minor allele count in cases and the minor allele
count in controls. Denote Yi to be the test statistic from the two-proportion z-test for SNPi. It
can be derived that Yi|Xi=0 ∼ N(0, 1) and Yi|Xi=1 ∼ N(µ1, 1) for some unknown µ1 (µ1 6= 0).
We assume that X forms a pairwise Markov random field with respect to the graph G. The
graph G is built as follows. We query the squared correlation coefficients (r2 values) among
the SNPs from HapMap [175]. Each SNP becomes a node in the graph. For each SNP, we
connect it with the SNP having the highest r2 value with it. We also remove the edges whose
r2 values are below 0.25. There are in total 340,601 edges in the graph. The pairwise potential
function φi on edge i (connecting Xu and Xv) parameterized by θi (0 < θi < 1) is φi(X; θi) =
θI(Xu=Xv)i (1− θi)I(Xu 6=Xv) for i = 1, ..., 340,601, where I is an indicator variable. It is believed
that two SNPs with a higher level of correlation are more likely to agree in their association with
breast cancer. Therefore, we set the background knowledge k about the parameters to be the
r2 values between the SNPs on the edge. We first perform the two-proportion z-test and set y
111
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
020
0040
0060
0080
00
Run
ning
Tim
e (s
econ
ds)
0.01
60.
018
0.02
00.
022
0.02
4
Abs
olut
e E
stim
ate
Err
or
0 100 200 300 400 500 600 700 800 900 1100Iteration Number
(1a) tree structure, ki = sin(θi) + ε , σε =0
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
020
0040
0060
0080
00
Run
ning
Tim
e (s
econ
ds)
0.02
40.
025
0.02
60.
027
0.02
8
Abs
olut
e E
stim
ate
Err
or
0 100 200 300 400 500 600 700 800 900 1100Iteration Number
(1b) tree structure, ki = sin(θi) + ε , σε =0.01
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
020
0040
0060
0080
00
Run
ning
Tim
e (s
econ
ds)
0.03
150.
0325
0.03
350.
0345
Abs
olut
e E
stim
ate
Err
or
0 100 200 300 400 500 600 700 800 900 1100Iteration Number
(1c) tree structure, ki = sin(θi) + ε , σε =0.02
n=1n=5n=10
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
050
0010
000
1500
020
000
Run
ning
Tim
e (s
econ
ds)
0.01
70.
018
0.01
90.
020
0.02
10.
022
Abs
olut
e E
stim
ate
Err
or
0 500 1000 1500 2000 2500 3000 3500Iteration Number
(2a) tree structure, ki = θi2 + ε , σε =0
n=1n=5n=10
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
050
0010
000
1500
020
000
Run
ning
Tim
e (s
econ
ds)
0.01
90.
020
0.02
10.
022
0.02
3
Abs
olut
e E
stim
ate
Err
or
0 500 1000 1500 2000 2500 3000 3500Iteration Number
(2b) tree structure, ki = θi2 + ε , σε =0.01
n=1n=5n=10
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
050
0010
000
1500
020
000
Run
ning
Tim
e (s
econ
ds)
0.02
200.
0230
0.02
400.
0250
Abs
olut
e E
stim
ate
Err
or
0 500 1000 1500 2000 2500 3000 3500Iteration Number
(2c) tree structure, ki = θi2 + ε , σε =0.02
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
010
000
2000
030
000
4000
0
Run
ning
Tim
e (s
econ
ds)
0.03
60.
038
0.04
00.
042
0.04
40.
046
0.04
8
Abs
olut
e E
stim
ate
Err
or
0 200 400 600 800 1000 1200 1400Iteration Number
(3a) grid structure, ki = sin(θi) + ε , σε =0
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
010
000
2000
030
000
4000
0
Run
ning
Tim
e (s
econ
ds)
0.04
00.
042
0.04
40.
046
0.04
80.
050
Abs
olut
e E
stim
ate
Err
or
0 200 400 600 800 1000 1200 1400Iteration Number
(3b) grid structure, ki = sin(θi) + ε , σε =0.01
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
010
000
2000
030
000
4000
0
Run
ning
Tim
e (s
econ
ds)
0.04
60.
048
0.05
00.
052
0.05
4
Abs
olut
e E
stim
ate
Err
or
0 200 400 600 800 1000 1200 1400Iteration Number
(3c) grid structure, ki = sin(θi) + ε , σε =0.02
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
010
000
2000
030
000
4000
050
000
Run
ning
Tim
e (s
econ
ds)
0.03
50.
040
0.04
5
Abs
olut
e E
stim
ate
Err
or
0 200 400 600 800 1200 1600 2000Iteration Number
(4a) grid structure, ki = θi2 + ε , σε =0
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
010
000
2000
030
000
4000
050
000
Run
ning
Tim
e (s
econ
ds)
0.03
50.
040
0.04
50.
050
Abs
olut
e E
stim
ate
Err
or
0 200 400 600 800 1200 1600 2000Iteration Number
(4b) grid structure, ki = θi2 + ε , σε =0.01
n=1n=5n=10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
010
000
2000
030
000
4000
050
000
Run
ning
Tim
e (s
econ
ds)
0.03
80.
040
0.04
20.
044
0.04
60.
048
0.05
00.
052
Abs
olut
e E
stim
ate
Err
or
0 200 400 600 800 1200 1600 2000Iteration Number
(4c) grid structure, ki = θi2 + ε , σε =0.02
n=1n=5n=10
Figure 6.5: Absolute estimate error (plotted in blue, in the units on the right axes) and runningtime (plotted in black, in minutes on the left axes) of the EM-kernel-PCD algorithm in the tree-structure HMRFs and the grid-structure HMRFs when we choose different n values; n is thenumber of MCMC steps for advancing particles in the PCD algorithm. The absolute estimateerror in the first 400 iterations is not shown in the plots.
to be the calculated test statistics. Then we estimate θ|y,k in the heterogeneous HMRF with
respect to G using our EM-kernel-PCD algorithm. After we estimate θ and µ1, we calculate the
marginal probabilities of the hidden X. Eventually, we rank the SNPs by the marginal probabilities
112
Figure 6.6: The estimated parameters against their background knowledge, namely the r2 valuesbetween the pairs of SNPs.
P (Xi = 1|y; θ, µ1), and select the SNPs with the largest marginal probabilities.
The algorithm ran for 46 days on a single processor (AMD Opteron Processor, 3300 MHz)
before it converged. We plotted the estimated parameters against their background knowledge,
namely the r2 values between the pairs of SNPs on the edges. The plot is provided in Figure 6.6. It
is observed that the mapping between the estimated parameters and the background knowledge is
monotone increasing, as we expect. Finally we calculated the marginal probabilities of the hidden
X, and ranked the SNPs by the marginal probabilities P (Xi = 1|y; θ, µ1). There are in total five
SNPs with P (Xi = 1|y; θ, µ1) greater than 0.99, which means they are associated with breast
cancer with a probability greater than 0.99 given the observed test statistics y under the estimated
parameters θ and µ1. There is strong evidence in the literature that supports the association with
breast cancer for three of them. The two SNPs rs2420946 and rs1219648 on chromosome 10 are
reported by Hunter et al (2007), and have been further validated by 1,776 cases and 2,072 controls
from three additional studies. Their associated gene FGFR2 is very well known to be associated
with breast cancer in the literature. There is also strong evidence supporting the association of the
SNP rs7712949 on chromosome 5. The SNP rs7712949 is highly correlated (r2=0.948) with SNP
rs4415084 which has been identified to be associated with breast cancer by another six large-scale
113
studies. 1
6.6 DiscussionCapturing parameter heterogeneity is an important issue in machine learning and statistics, and
it is particular challenging in HMRFs due to both the intractable Z(θ) and the latent x. In this
chapter, we propose the EM-kernel-PCD algorithm for learning the heterogeneous parameters
with background knowledge. Our algorithm is built upon the PCD algorithm which handles the
intractable Z(θ). The EM part we add is for dealing with the hidden x. The kernel smoothing
part we add is to adaptively incorporate the background knowledge about the heterogeneity in
parameters in the gradient ascent learning. Eventually, the relation between the parameters and
the background knowledge is recovered in a nonparametric way, which is also adaptive to the
data. Simulations show that our algorithm is effective for learning heterogeneous HMRFs and
outperforms alternative binning methods.
Similar to other EM algorithms, our algorithm only converges to a local maximum of the like-
lihood L(θ,ϕ), although the lower bound F(qx(x), {θ,ϕ}) nondecreases over the EM iterations
(except for some MCMC error introduced in the E-step). Our algorithm also suffers from long run
time due to computationally expensive PCD algorithm within each M-step. These two issues are
important directions for future work.
The material in this chapter first appeared in the 17th International Conference on Artificial
Intelligence and Statistics (AISTATS’2014) as follows:
Jie Liu, Chunming Zhang, Elizabeth Burnside and David Page. Learning Heterogeneous
Hidden Markov Random Fields. The 17th International Conference on Artificial Intelligence and
Statistics (AISTATS), 2014.
This chapter discusses learning heterogeneous parameters in graphical model with some back-
ground knowledge about these parameters. When there is no such background knowledge, it can
be beneficial to group the parameters for more efficient learning. The next chapter imposes Dirich-
1http://snpedia.com/index.php/rs4415084
114
let process priors over the parameters to specify these latent parameter groups, and estimates the
parameters in a Bayesian framework. The chapter demonstrates that it can indeed be beneficial to
group the parameters, even if we do not have domain-specific background knowledge about what
the grouping should be.
Chapter 7
Bayesian Estimation of
Latently-grouped Parameters in
Graphical Models
In large-scale applications of undirected graphical models, such as social networks and biological
networks, similar patterns occur frequently and give rise to similar parameters. In this situation,
it is beneficial to group the parameters for more efficient learning. We show that even when
the grouping is unknown, we can infer these parameter groups during learning via a Bayesian
approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually
involves calculating intractable terms, and we propose two approximation algorithms, namely
a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with
“stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform
conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs
sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better
than the models learned by MLE on real-world Senate voting data.
115
116
7.1 Introduction
Undirected graphical models, a.k.a. Markov random fields (MRFs), have many real-world ap-
plications such as social networks and biological networks. In these large-scale networks, similar
kinds of relations can occur frequently and give rise to repeated occurrences of similar parameters,
but the grouping pattern among the parameters is usually unknown. For a social network example,
suppose that we collect voting data over the last 20 years from a group of 1,000 people who are
related to each other through different types of relations (such as family, co-workers, classmates,
friends and so on), but the relation types are usually unknown. If we use a binary pairwise MRF
to model the data, each binary node denotes one person’s vote, and two nodes are connected if the
two people are linked in the social network. Eventually we want to estimate the pairwise potential
functions on edges, which can provide insights about how the relations between people affect their
decisions. This can be done via standard maximum likelihood estimation (MLE), but the latent
grouping pattern among the parameters is totally ignored, and the model can be overparametrized.
Therefore, two questions naturally arise. Can MRF parameter learners automatically identify
these latent parameter groups during learning? Will this further abstraction make the model gen-
eralize better, analogous to the lessons we have learned from hierarchical modeling [61] and topic
modeling [20]?
This chapter shows that it is feasible and potentially beneficial to identify the latent parameter
groups during MRF parameter learning. Specifically, we impose a Dirichlet process prior on the
parameters to accommodate our uncertainty about the number of the parameter groups. Posterior
inference can be done by Markov chain Monte Carlo with proper approximations. We propose
two approximation algorithms, a Metropolis-Hastings algorithm with auxiliary variables and a
Gibbs sampling algorithm with stripped Beta approximation (Gibbs SBA). Algorithmic details
are provided in Section 7.3 after we review related parameter estimation methods in Section 7.2.
In Section 7.4, we evaluate our Bayesian estimates and the classical MLE on different models,
and both algorithms outperform classical MLE. The Gibbs SBA algorithm performs very close to
the Gibbs sampling algorithm with exact likelihood calculation. Models learned with Gibbs SBA
117
also generalize better than the models learned by MLE on real-world Senate voting data in Section
7.5. We finally conclude in Section 7.6.
7.2 Maximum Likelihood Estimation and Bayesian Estimation for
MRFs
Let X = {0, 1, ...,m−1} be a discrete space. Suppose that we have an MRF defined on a random
vector X ∈ X d described by an undirected graph G(V, E) with d nodes in the node set V and r
edges in the edge set E . The probability of one sample x from the MRF parameterized by θ is
P (x;θ) = P (x;θ)/Z(θ), (7.1)
where Z(θ) is the partition function. P (x;θ)=∏c∈C(G) φc(x;θc) is some unnormalized mea-
sure, and C(G) is some subset of cliques in G, and φc is the potential function defined on the
clique c parameterized by θc. In this chapter, we consider binary pairwise MRFs for simplicity,
i.e. C(G)=E andm=2. We also assume that each potential function φc is parameterized by one pa-
rameter θc, namely φc(X; θc)=θcI(Xu=Xv)(1−θc)I(Xu 6=Xv) where I(Xu=Xv) indicates whether
the two nodes u and v connected by edge c take the same value, and 0<θc<1,∀c=1, ...,r. Thus,
θ={θ1, ..., θr}. Suppose that we have n independent samples X={x1, ...,xn} from (7.1), and we
want to estimate θ.
Maximum Likelihood Estimate: The MLE of θmaximizes the log-likelihood functionL(θ|X)
which is concave w.r.t. θ. Therefore, we can use gradient ascent to find the global maximum of the
likelihood function and find the MLE of θ. The partial derivative of L(θ|X) with respect to θi is∂L(θ|X)∂θi
= 1n
∑nj=1 ψi(x
j)−Eθψi=EXψi−Eθψi where ψi is the sufficient statistic corresponding
to θi after we rewrite the density into the exponential family form, and Eθψi is the expectation of
ψi with respect to the distribution specified by θ. However the exact computation of Eθψi takes
time exponential in the treewidth of G. A few sampling-based methods have been proposed, with
different ways of generating particles and computing Eθψ from the particles, including MCMC-
118
MLE [66, 218], particle-filtered MCMC-MLE [5], contrastive divergence [80] and its variations
such as persistent contrastive divergence (PCD) [178] and fast PCD [179]. Note that contrastive
divergence is related to pseudo-likelihood [18], ratio matching [83, 84], and together with other
MRF parameter estimators [72, 184, 71] can be unified as minimum KL contraction [112].
Bayesian Estimate: Let π(θ) be a prior of θ; then its posterior isP (θ|X) ∝ π(θ)P (X;θ)/Z(θ).
The Bayesian estimate of θ is its posterior mean. Exact sampling from P (θ|X) is known as
doubly-intractable for general MRFs [128]. If we use the Metropolis-Hastings algorithm, then
Metropolis-Hastings ratio is
a(θ∗|θ) =π(θ∗)P (X;θ∗)Q(θ|θ∗)/Z(θ∗)
π(θ)P (X;θ)Q(θ∗|θ)/Z(θ), (7.2)
where Q(θ∗|θ) is some proposal distribution from θ to θ∗, and with probability min{1, a(θ∗|θ)}
we accept the move from θ to θ∗. The real hurdle is that we have to evaluate the intractable
Z(θ)/Z(θ∗) in the ratio. In [123], Møller et al. introduce one auxiliary variable y on the same
space as x, and the state variable is extended to (θ,y). They set the new proposal distribution for
the extended state Q(θ,y|θ∗,y∗)=Q(θ|θ∗)P (y;θ)/Z(θ) to cancel Z(θ)/Z(θ∗) in (7.2). There-
fore by ignoring y, we can generate the posterior samples of θ via Metropolis-Hastings. Techni-
cally, this auxiliary variable approach requires perfect sampling [143], but [123] pointed out that
other simpler Markov chain methods also work with the proviso that it converges adequately to
the equilibrium distribution.
7.3 Bayesian Parameter Estimation for MRFs with Dirichlet Process
Prior
In order to model the latent parameter groups, we impose a Dirichlet process prior on θ, which
accommodates our uncertainty about the number of groups. Then, the generating model is
119
G ∼ DP(α0, G0)
θi|G ∼ G, i = 1, ..., r
xj |θ ∼ F (θ), j = 1, ..., n,
(7.3)
where F (θ) is the distribution specified by (7.1). G0 is the base distribution (e.g. Unif(0, 1)),
and α0 is the concentration parameter. With probability 1.0, the distribution G drawn from
DP(α0, G0) is discrete, and places its mass on a countably infinite collection of atoms drawn
from G0. In this model, X={x1, ...,xn} is observed, and we want to perform posterior inference
for θ = (θ1, θ2, ..., θr), and regard its posterior mean as its Bayesian estimate. We propose two
Markov chain Monte Carlo (MCMC) methods. One is a Metropolis-Hastings algorithm with aux-
iliary variables, as introduced in Section 7.3.1. The second is a Gibbs sampling algorithm with
stripped Beta approximation, as introduced in Section 7.3.2. In both methods, the state of the
Markov chain is specified by two vectors, c and φ. In vector c = (c1, ..., cr), ci denotes the group
to which θi belongs. φ = (φ1, ..., φk) records the k distinct values in {θ1, ..., θr} with φci = θi
for i = 1, ..., r. This way of specifying the Markov chain is more efficient than setting the state
variable directly to be (θ1, θ2, ..., θr) [131].
7.3.1 Metropolis-Hastings (MH) with Auxiliary Variables
In the MH algorithm (see Algorithm 3), the initial state of the Markov chain is set by performing
K-means clustering on MLE of θ (e.g. from the PCD algorithm [178]) with K=bα0 ln rc. The
Markov chain resembles Algorithm 5 in [131], and it is ergodic. We move the Markov chain
forward for T steps. In each step, we update c first and then update φ. We update each element
of c in turn; when resampling ci, we fix c−i, all elements in c other than ci. When updating ci,
we repeatedly for M times propose a new value c∗i according to proposal Q(c∗i |ci) and accept the
move with probability min{1, a(c∗i |ci)} where a(c∗i |ci) is the MH ratio. After we update every
element of c in the current iteration, we draw a posterior sample of φ according to the current
120
grouping c. We iterate T times, and get T posterior samples of θ. Unlike the tractable Algorithm
5 in [131], we need to introduce auxiliary variables to bypass MRF’s intractable likelihood in two
places, namely calculating the MH ratio (in Section 7.3.1) and drawing samples of φ|c (in Section
7.3.1).
Calculating Metropolis-Hastings Ratio
Algorithm 3 The Metropolis-Hastings algorithmInput: observed data X={x1, ...,xn}
Output: θ(1)
, ..., θ(T )
; T samples of θ|X
Procedure:
Perform PCD algorithm to get θ, MLE of θ
Init. c and φ via K-means on θ; K=bα0 ln rc
for t = 1 to T do
for i = 1 to r do
for l = 1 to M do
Draw a candidate c∗i from Q(ci|c∗i )
If c∗i 6∈ c, draw a value for φci from G0
Set ci=c∗i with prob min{1, a(c∗i |ci)}
end for
end for
Draw a posterior sample of φ according to current c, and set θ(t)i =φci for i=1, ..., r.
end for
The MH ratio of proposing a new value c∗i for ci according to proposal Q(c∗i |ci) is
121
a(c∗i |ci) =π(c∗i , c−i)P (X;θ.∗i )Q(ci|c∗i )π(ci, c−i)P (X;θ)Q(c∗i |ci)
=π(c∗i |c−i)P (X;θ.∗i )Q(ci|c∗i )/Z(θ.∗i )
π(ci|c−i)P (X;θ)Q(c∗i |ci)/Z(θ),
where θ.∗i is the same as θ except its i-th element is replaced with φc∗i . The conditional prior
π(c∗i |c−i) is
π(ci=c|c−i)=
n−i,c
r−1+α0, if c ∈ c−i
α0r−1+α0
, if c 6∈ c−i
where n−i,c is the number of cj for j 6=i and cj=c. We choose proposal Q(c∗i |ci) to be the condi-
tional prior π(c∗i |c−i), and the Metropolis-Hastings ratio can be further simplified as
a(c∗i |ci)=P (X;θ.∗i )Z(θ)/P (X;θ)Z(θ.∗i ).
However, Z(θ)/Z(θ.∗i ) is intractable. Similar to [123], we introduce an auxiliary variable
Z on the same space as X, and the state variable is extended to (c,Z). When proposing a
move, we propose c∗i first and then propose Z∗ with proposal P (Z;θ.∗i ) to cancel the intractable
Z(θ)/Z(θ.∗i ). We set the target distribution of Z to be P (Z; θ) where θ is some estimate of θ
(e.g. from PCD [178]). Then, the MH ratio with the auxiliary variable is
a(c∗i ,Z∗|ci,Z) =P (Z∗; θ)P (X;θ.∗i )P (Z;θ)
P (Z; θ)P (X;θ)P (Z∗;θ.∗i )=P (Z∗; θ)P (X;θ.∗i )P (Z;θ)
P (Z; θ)P (X;θ)P (Z∗;θ.∗i ).
Thus, the intractable computation of the MH ratio is replaced by generating particles Z∗ and
Z under θ.∗i and θ respectively. Ideally, we should use perfect sampling [143], but it is intractable
for general MRFs. As a compromise, we use standard Gibbs sampling with long runs to generate
these particles.
122
Drawing Posterior Samples of φ|c
We draw posterior samples of φ under grouping c via the MH algorithm, again following [123].
The state of the Markov chain is φ. The initial state of the Markov chain is set by running
PCD [178] with parameters tied according to c. The proposal Q(φ∗|φ) is a k-variate Gaussian
N (φ, σ2QIk) where σ2
QIk is the covariance matrix. The auxiliary variable Y is on the same space
as X, and the state is extended to (φ,Y). The proposal distribution for the extended state variable
is Q(φ,Y|φ∗,Y∗) = Q(φ|φ∗)P (Y;φ)/Z(φ). We set the target distribution of Y to be P (Y; φ)
where φ is some estimate of φ such as the estimate from the PCD algorithm [178]. Then, the MH
ratio for the extended state is
a(φ∗,Y∗|φ,Y) = I(φ∗∈Θ)P (Y∗; φ)P (X;φ∗)P (Y;φ)
P (Y; φ)P (X;φ)P (Y∗;φ∗),
where I(φ∗ ∈Θ) indicates that every dimension of φ∗ is in the domain of G0. We set the state
to be the new values with probability min{1, a(φ∗,Y∗|φ,Y)}. We move the Markov chain for S
steps, and get S samples ofφ by ignoring Y. Eventually we draw one sample from them randomly.
123
7.3.2 Gibbs Sampling with Stripped Beta Approximation
Algorithm 4 The Gibbs sampling algorithmInput: observed data X = {x1,x2, ...,xn}
Output: θ(1)
, ..., θ(T )
; T posterior samples of θ|X
Procedure:
Perform PCD algorithm to get MLE θ
Init. c and φ via K-means on θ; K=bα0 ln rc
for t = 1 to T do
for i = 1 to r do
If current ci is unique in c, remove φci from φ
Update ci according to (7.4).
If new ci 6∈c, draw a value for φci and add to φ
end for
Draw a posterior sample of φ according to current c, and set θ(t)i = φci for i = 1, ..., r
end for
In the Gibbs sampling algorithm (see Algorithm 4), the initialization of the Markov chain is
exactly the same as in the MH algorithm in Section 7.3.1. The Markov chain resembles Algorithm
2 in [131] and it can be shown to be ergodic. We move the Markov chain forward for T steps. In
each of the T steps, we update c first and then update φ. When we update c, we fix the values in
φ, except we may add one new value to φ or remove a value from φ. We update each element of c
in turn. When we update ci, we first examine whether ci is unique in c. If so, we remove φci from
φ first. We then update ci by assigning it to an existing group or a new group with a probability
proportional to a product of two quantities, namely
124
P (ci = c|c−i,X, φc−i) ∝
n−i,c
r−1+α0P (X;φc, φc−i), if c ∈ c−i
α0r−1+α0
∫P (X; θi, φc−i) dG0(θi), if c 6∈ c−i.
(7.4)
The first quantity is n−i,c, the number of members already in group c. For starting a new
group, the quantity is α0. The second quantity is the likelihood of X after assigning ci to the new
value c conditional on φc−i . When considering a new group, we integrate the likelihood w.r.t.
G0. After ci is resampled, it is either set to be an existing group or a new group. If a new group
is assigned, we draw a new value for φci , and add it to φ. After updating every element of c
in the current iteration, we draw a posterior sample of φ under the current grouping c. In total,
we run T iterations, and get T posterior samples of θ. This Gibbs sampling algorithm involves
two intractable calculations, namely (i) calculating P (X;φc, φc−i) and∫P (X; θi, φc−i) dG0(θi)
in (7.4) and (ii) drawing posterior samples for φ. We use a stripped Beta approximation in both
places, as in Sections 7.3.2 and 7.3.2.
Calculating P (X;φc, φc−i) and∫P (X; θi, φc−i) dG0(θi) in (7.4)
In Formula (7.4), we evaluate P (X;φc, φc−i) for different φc values with φc−i fixed and X =
{x1,x2, ...,xn} observed. For ease of notation, we rewrite this quantity as a likelihood function
of θi, L(θi|X,θ−i), where θ−i = {θ1, ..., θi−1, θi+1, ..., θr} is fixed. Suppose that the edge i
connects variables Xu and Xv, and we denote X−uv to be the variables other than Xu and Xv.
Then
L(θi|X,θ−i)=∏n
j=1P (xju, x
jv|x
j−uv; θi,θ−i)P (xj−uv; θi,θ−i)
≈∏n
j=1P (xju, x
jv|x
j−uv; θi,θ−i)P (xj−uv;θ−i) ∝
∏n
j=1P (xju, x
jv|x
j−uv; θi,θ−i).
Above we approximate P (xj−uv; θi,θ−i) with P (xj−uv;θ−i) because the density of X−uv
125
mostly depends on θ−i. The term P (xj−uv;θ−i) can be dropped since θ−i is fixed, and we only
have to consider P (xju, xjv|xj−uv; θi,θ−i). Since θ−i is fixed and we are conditioning on xj−uv,
they together can be regarded as a fixed potential function telling how likely the rest of the graph
thinks Xu and Xv should take the same value. Suppose that this fixed potential function (the
message from the rest of the network xj−uv) is parameterized as ηi (0<ηi<1). Then
n∏j=1
P (xju, xjv|x
j−uv; θi,θ−i)∝
n∏j=1
λI(xju=xjv)(1−λ)I(x
ju 6=xjv)=λ
n∑j=1
I(xju=xjv)
(1−λ)
n∑j=1
I(xju 6=xjv)
(7.5)
where λ=θiηi/{θiηi+(1−θi)(1−ηi)}. The end of (7.5) resembles a Beta distribution with pa-
rameters (∑n
j=1 I(xju=xjv)+1, n−
∑nj=1 I(x
ju=xjv)+1) except that only part of λ, namely θi, is
random. Now we want to use a Beta distribution to approximate the likelihood with respect to
θi, and we need to remove the contribution of ηi and only consider the contribution from θi. We
choose Beta(bnθic+1, n−bnθic+1) where θi is MLE of θi (e.g. from the PCD algorithm). This
approximation is named the Stripped Beta Approximation. The simulation results in Section 7.4.2
indicate that the performance of the stripped Beta approximation is very close to using exact calcu-
lation. Also this approximation only requires as much computation as in the tractable tree-structure
MRFs, and it does not require generating expensive particles as in the MH algorithm with auxil-
iary variables. The integral∫P (X; θi, φc−i) dG0(θi) in (7.4) can be calculated via Monte Carlo
approximation. We draw a number of samples of θi fromG0, and evaluate P (X; θi, φc−i) and take
the average.
Drawing Posterior Samples of φ|c
The stripped Beta approximation also allows us to draw posterior samples from φ|c approxi-
mately. Suppose that there are k groups according to c, and we have estimates for φ, denoted as
φ = (φ1, ..., φk). We denote the numbers of elements in the k groups by m = {m1, ...,mk}. For
group i, we draw a posterior sample for φi from Beta(bminφic+1,min−bminφic+1).
126
7.4 Simulations
We investigate the performance of our Bayesian estimators on three models: (i) a tree-MRF, (ii)
a small grid-MRF whose likelihood is tractable, and (iii) a large grid-MRF whose likelihood is
intractable. We first set the ground truth of the parameters, and then generate training and testing
samples. On training data, we apply our grouping-aware Bayesian estimators and two baseline
estimators, namely a grouping-blind estimator and an oracle estimator. The grouping-blind esti-
mator does not know groups exist in the parameters, and estimates the parameters in the normal
MLE fashion. The oracle estimator knows the ground truth of the groupings, and ties the parame-
ters from the same group and estimates them via MLE. For the tree-MRF, our Bayesian estimator
is exact since the likelihood is tractable. For the small grid-MRF, we have three variations for the
Bayesian estimator, namely Gibbs sampling with exact likelihood computation, MH with auxil-
iary variables, and Gibbs sampling with stripped Beta approximation. For the large grid-MRF, the
computational burden only allows us to apply Gibbs sampling with stripped Beta approximation.
We compare the estimators by three measures. The first is the average absolute error of esti-
mate 1/r∑r
i=1 |θi − θi| where θi is the estimate of θi. The second measure is the log likelihood
of the testing data, or the log pseudo-likelihood [18] of the testing data when exact likelihood is
intractable. Thirdly, we evaluate how informative the grouping yielded by the Bayesian estima-
tor is. We use the variation of information metric [117] between the inferred grouping C and the
ground truth groupingC, namely VI(C,C). Since VI(C,C) is sensitive to the number of groups
in C, we contrast it with VI(C,C) where C is a random grouping with its number of groups the
same as C. Eventually, we evaluate C via the VI difference, namely VI(C,C)−VI(C,C). A
larger value of VI difference indicates a more informative grouping yielded by our Bayesian esti-
mator. Because we have one grouping in each of the T MCMC steps, we average the VI difference
yielded in each of the T steps.
127
●
●
●
●
●
●●
●●
●
0.0000.0100.0200.030
Error of Estimate
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(a
)
●M
LEO
racl
eB
ayes
ian
●
●
●
●
●●
●●
●●
−4160−4150−4140−4130−4120
Log−likelihood of Test Data
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(b
)
●M
LEO
racl
eB
ayes
ian
5.05.56.06.57.0
VI Difference
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(c
)
Bay
esia
n
Figu
re7.
1:Pe
rfor
man
ceof
the
grou
ping
-blin
dM
LE
,the
orac
leM
LE
and
our
Bay
esia
nes
timat
oron
tree
-str
uctu
reM
RFs
inte
rms
of(a
)er
ror
ofes
timat
ean
d(b
)lo
g-lik
elih
ood
ofte
stda
ta.
Subfi
gure
(c)
show
sth
eV
Idi
ffer
ence
betw
een
the
grou
ping
yiel
ded
byou
rBay
esia
nes
timat
oran
dra
ndom
grou
ping
.
128
7.4.1 Simulations on Tree-structure MRFs
For the structure of the MRF, we choose a perfect binary tree of height 12 (i.e. 8,191 nodes and
8,190 edges). We assume there are 25 groups among the 8,190 parameters. The base distribution
G0 is Unif(0, 1). We first generate the true parameters for the 25 groups from Unif(0, 1). We
then randomly assign each of the 8,190 parameters to one of the 25 groups. We then generate
1,000 testing samples and n training samples (n=100, 200, ..., 1,000). Eventually, we apply the
grouping-blind MLE, the oracle MLE, and our grouping-aware Bayesian estimator on the training
samples. For tree-structure MRFs, both MLE and Bayesian estimation have a closed form solu-
tion. For the Bayesian estimator, we set the number of Gibbs sampling steps to be 500 and set
α0=1.0. We replicate the experiment 500 times, and the averaged results are in Figure 7.1.
Our grouping-aware Bayesian estimator has a lower estimate error and a higher log likelihood
of test data, compared with the grouping-blind MLE, demonstrating the “blessing of abstraction”.
Our Bayesian estimator performs worse than oracle MLE, as we expect. In addition, as the training
sample size increases, the performance of our Bayesian estimator approaches that of the oracle
MLE. The VI difference in Figure 7.1(c) indicates that the Bayesian estimator also recovers the
latent grouping to some extent, and the inferred groupings become more and more reliable as the
training size increases. The number of groups inferred by the Bayesian estimator and its running
time are in Figure 7.2.
7.4.2 Simulations on Small Grid-MRFs
For the structure of the MRF, we choose a 4×4 grid with 16 nodes and 24 edges. Exact likeli-
hood is tractable in this small model, which allows us to investigate how good the two types of
approximation are. We apply the grouping-blind MLE (the PCD algorithm), the oracle MLE (the
PCD algorithm with the parameters from same group tied) and three Bayesian estimators: Gibbs
sampling with exact likelihood computation (Gibbs ExactL), Metropolis-Hastings with auxiliary
variables (MH AuxVar), and Gibbs sampling with stripped Beta approximation (Gibbs SBA). We
assume there are five parameter groups. The base distribution is Unif(0, 1). We first generate
129
Traing Sample Size
Num
ber
of G
roup
s In
ferr
ed
15
20
25
30
100 200 300 400 500 600 700 800 900 1000
●
●
●
●●
●●
● ● ●
●
●●
●
●
●●
●●
●●
●●●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
400
410
420
430
440
450
460
Run
Tim
e (in
sec
onds
)
100 200 300 400 500 600 700 800 900 1000Training Sample Size
Figure 7.2: Number of groups inferred by the Bayesian estimator and its run time.
130
(a) Gibbs_ExactL
Traing Sample Size
# G
roup
s In
ferr
ed
4
6
8
10
100 200 300 400 500 600 700 800 9001000
● ● ● ● ● ● ● ● ● ●
●
● ●●
● ● ●
(b) MH_AuxVar
Traing Sample Size
4
6
8
10
100 200 300 400 500 600 700 800 9001000
● ● ● ● ● ● ● ● ● ●
●
(c) Gibbs_SBA
Traing Sample Size
4
6
8
10
100 200 300 400 500 600 700 800 9001000
● ● ● ● ● ● ● ● ● ●
●●
●●
●●
●
●●
●●
●
●
● ● ●
●
●
●
●
●
Figure 7.3: The number of groups inferred by Gibbs ExactL, MH AuxVar and Gibbs SBA.
the true parameters for the five groups from Unif(0, 1). We then randomly assign each of the
24 parameters to one of the five groups. We then generate 1,000 testing samples and n training
samples (n=100, 200, ..., 1,000). For Gibbs ExactL and Gibbs SBA, we set the number of Gibbs
sampling steps to be 100. For MH AuxVar, we set the number of MH steps to be 500 and its pro-
posal number M to be 5. The parameter σQ in Section 7.3.1 is set to be 0.001 and the parameter
S is set to be 100. For all three Bayesian estimators, we set α0=1.0. We replicate the experiment
50 times, and the averaged results are in Figure 7.4.
Our grouping-aware Bayesian estimators have a lower estimate error and a higher log like-
lihood of test data, compared with the grouping-blind MLE, demonstrating the blessing of ab-
straction. All three Bayesian estimators perform worse than oracle MLE, as we expect. The VI
difference in Figure 7.4(c) indicates that the Bayesian estimators also recover the grouping to
some extent, and the inferred groupings become more and more reliable as the training size in-
creases. In Figure 7.3, we provide the boxplots of the number of groups inferred by Gibbs ExactL,
MH AuxVar and Gibbs SBA. All three methods recover a reasonable number of groups, and
Gibbs SBA slightly over-estimates the number of groups.
Among the three Bayesian estimators, Gibbs ExactL has the lowest estimate error and the
highest log likelihood of test data. Gibbs SBA also performs considerably well, and its perfor-
mance is close to the performance of Gibbs ExactL. MH AuxVar works slightly worse, especially
when there is less training data. However, MH AuxVar recovers better groupings than Gibbs SBA
131
●
●
●
●
●
●●
●●
●
0.0050.0150.0250.035
Error of Estimate
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(a
)
●M
LEO
racl
eG
ibbs
_Exa
ctL
Gib
bs_S
BA
MH
_Aux
Var
●
●
●
●●
●●
●●
●
−6920−6880−6840−6800
Log−likelihood of Test Data
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(b
)
●M
LEO
racl
eG
ibbs
_Exa
ctL
Gib
bs_S
BA
MH
_Aux
Var
1.01.21.41.61.82.02.2
VI Difference
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(c
)
Gib
bs_E
xact
LG
ibbs
_SB
AM
H_A
uxV
ar
Figu
re7.
4:Pe
rfor
man
ceof
grou
ping
-blin
dM
LE
,ora
cle
ML
E,G
ibbs
Exa
ctL
,MH
Aux
Var
,and
Gib
bsSB
Aon
the
smal
lgri
d-st
ruct
ure
MR
Fsin
term
sof
(a)
erro
rof
estim
ate
and
(b)
log-
likel
ihoo
dof
test
data
.Su
bfigu
re(c
)sh
ows
the
VI
diff
eren
cebe
twee
nth
egr
oupi
ngyi
elde
dby
ourB
ayes
ian
estim
ator
san
dra
ndom
grou
ping
.
132
n=100 n=500 n=1,000GIBBS EXACTL 88,136.3 91,055.0 92,503.4MH AUXVAR 540.2 3,342.2 4,546.7GIBBS SBA 8.1 10.8 14.2
Table 7.1: The run time (in seconds) of Gibbs ExactL, MH AuxVar and Gibbs SBA when trainingsize is n.
when there are more training data. The run times of the three Bayesian estimators are listed in Ta-
ble 7.1. Gibbs ExactL has a computational complexity that is exponential in the dimensionality d,
and cannot be applied to situations when d > 20. MH AuxVar is also computationally intensive
because it has to generate expensive particles. Gibbs SBA runs fast, with its burden mainly from
running PCD under a specific grouping in each Gibbs sampling step, and it scales well.
7.4.3 Simulations on Large Grid-MRFs
The large grid consists of 30 rows and 30 columns (i.e. 900 nodes and 1,740 edges). Exact like-
lihood is intractable for this large model, and we cannot run Gibbs ExactL. The high dimension
also prohibits MH AuxVar. Therefore, we only run the Gibbs SBA algorithm on this large grid-
structure MRF. We assume that there are 10 groups among the 1,740 parameters. We also evaluate
the estimators by the log pseudo-likelihood of testing data. The other settings of the experiments
stay the same as Section 7.4.2. We replicate the experiment 50 times, and the averaged results are
in Figure 7.5.
For all 10 training sets, our Bayesian estimator Gibbs SBA has a lower estimate error and a
higher log likelihood of test data, compared with the grouping-blind MLE (via the PCD algorithm).
Gibbs SBA has a higher estimate error and a lower pseudo-likelihood of test data than the oracle
MLE. The VI difference in Figure 7.5(c) indicates that Gibbs SBA gradually recovers the grouping
as the training size increases. The number of groups inferred by Gibbs SBA and its running
time are provided in Figure 7.6. Similarly to the observation in Section 7.4.2, Gibbs SBA over-
estimates the number of groups. Gibbs SBA finishes the simulations on 900 nodes and 1,740
133
●
●
●
●
●●
●●
●●
0.010.020.030.04
Error of Estimate
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(a
)
●M
LEO
racl
eG
ibbs
_SB
A
●
●
●
●
●●
●●
●●
−210000−206000−202000−198000
Log−pseudolikelihood of Test Data
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(b
)
●M
LEO
racl
eG
ibbs
_SB
A
0.51.01.52.0
VI Difference
100
200
300
400
500
600
700
800
900
1000
Trai
ning
Sam
ple
Siz
e(c
)
Gib
bs_S
BA
Figu
re7.
5:Pe
rfor
man
ceof
the
grou
ping
-blin
dM
LE
,the
orac
leM
LE
and
the
Bay
esia
nes
timat
or(G
ibbs
SBA
)on
larg
egr
id-
stru
ctur
eM
RFs
inte
rms
of(a
)er
ror
ofes
timat
ean
d(b
)lo
g-lik
elih
ood
ofte
stda
ta.
Subfi
gure
(c)
show
sth
eV
Idi
ffer
ence
betw
een
the
grou
ping
yiel
ded
byou
rBay
esia
nes
timat
oran
dra
ndom
grou
ping
.
134
Traing Sample Size
Num
ber
of G
roup
s In
ferr
ed
20
40
60
80
100 200 300 400 500 600 700 800 900 1000
●
●
●
●●
●
●●
●●●
●
● ●
●●●
●
●15
000
2000
025
000
3000
0
Run
Tim
e (in
sec
onds
)
100 200 300 400 500 600 700 800 900 1000Training Sample Size
Figure 7.6: Number of groups inferred by Gibbs SBA and its run time.
135
LPL-TRAIN LPL-TESTMLE GIBBS SBA MLE GIBBS SBA # GROUPS RUNTIME (MINS)
EXP1 -10716.75 -10721.34 -9022.01 -8989.87 7.89 204EXP2 -8306.17 -8322.34 -11490.47 -11446.45 7.29 183
Table 7.2: Log pseudo-likelihood (LPL) of training and testing data from MLE (PCD) andBayesian estimate (Gibbs SBA), the number of groups inferred by Gibbs SBA, and its run time inthe Senate voting experiments.
edges in hundreds of minutes (depending on the training size), which is considered to be very fast.
7.5 Real-world Application
We apply the Gibbs SBA algorithm on US Senate voting data from the 109th Congress (available
at www.senate.gov). The 109th Congress has two sessions, the first session in 2005 and the second
session in 2006. There are 366 votes and 278 votes in the two sessions, respectively. There are 100
senators in both sessions, but Senator Corzine only served the first session and Senator Menendez
only served the second session. We remove them. In total, we have 99 senators in our experiments,
and we treat the votes from the 99 senators as the 99 variables in the MRF. We only consider
contested votes, namely we remove the votes with less than ten or more than ninety supporters.
In total, there are 292 votes and 221 votes left in the two sessions, respectively. The structure of
the MRF is from Figure 13 in [7]. There are in total 279 edges. The votes are coded as −1 for no
and 1 for yes. We replace all missing votes with −1, staying consistent with [7]. We perform two
experiments. First, we train the MRF using the first session data, and test on the second session
data. Then, we train on the second session and test on the first session. We compare our Bayesian
estimator (via Gibbs SBA) and MLE (via PCD) by the log pseudo-likelihood of testing data since
exact likelihood is intractable. We set the number of Gibbs sampling steps to be 3,000. Both of the
two experiments are finished in around three hours on a single CPU. The results are summarized
in Table 7.2. In the first experiment, the log pseudo-likelihood of test data is−9022.01 from MLE,
whereas it is −8989.87 from our Bayesian estimate. In the second experiment, the log pseudo-
136
likelihood of test data is −11490.47 from MLE, whereas it is −11446.45 from our Bayesian
estimate. The increase of log pseudo-likelihood is comparable to the increase of log (pseudo-
)likelihood we gain in the simulations (please refer to Figures 7.1b, 7.4b and 7.5b at the points
when we simulate 200 and 300 training samples). Both experiments indicate that the models
trained with the Gibbs SBA algorithm generalize considerably better than the models trained with
MLE. Gibbs SBA also infers there are around eight different types of relations among the senators.
The estimated parameters in the two models are consistent.
7.6 Discussion
Bayesian nonparametric approaches [135, 65], such as the Dirichlet process [48], provide an ele-
gant way of modeling mixtures with an unknown number of components. These approaches have
yielded advances in different machine learning areas, such as the infinite Gaussian mixture models
[145], the infinite mixture of Gaussian processes [146], infinite HMMs [9, 54], infinite HMRFs
[29], DP-nonlinear models [161], DP-mixture GLMs [77], infinite SVMs [217, 216], and the in-
finite latent attribute models [137]. In this chapter, we play the same trick of replacing the prior
distribution with a prior stochastic process to accommodate our uncertainty about the number of
parameter groups. To the best of our knowledge, this is the first time a Bayesian nonparamet-
ric approach is applied to models whose likelihood is intractable. Accordingly, we propose two
types of approximation, namely a Metropolis-Hastings algorithm with auxiliary variables and a
Gibbs sampling algorithm with stripped Beta approximation. Both algorithms show superior per-
formance over conventional MLE, and Gibbs SBA can also scale well to large-scale MRFs. The
Markov chains in both algorithms are ergodic, but may not be in detailed balance because we rely
on approximation. Thus, we guarantee that both algorithms converge for general MRFs, but they
may not exactly converge to the target distribution.
In this chapter, we only consider the situation where the potential functions are pairwise and
there is only one parameter in each potential function. For graphical models with more than
one parameter in the potential functions, it is appropriate to group the parameters on the level
137
of potential functions. A more sophisticated base distribution G0 (such as some multivariate
distribution) needs to be considered. In this chapter, we also assume the structures of the MRFs
are given. When the structures are unknown, we still need to perform structure learning. Allowing
structure learners to automatically identify structure modules will be another very interesting topic
to explore in the future research.
The material in this chapter first appeared in the Advances in Neural Information Processing
Systems (NIPS’2013) as follows:
Jie Liu and David Page. Bayesian Estimation of Latently-grouped Parameters in Undirected
Graphical Models. Advances in Neural Information Processing Systems (NIPS), 2013.
Chapters 3, 4 and 5 focus statistical inference aspect of GWAS with the help of graphical
models. Chapters 6 and 7 further discuss issues related to learning graphical models. The next
chapter shifts gears to the application aspect of GWAS, namely the clinical translation of GWAS
discoveries to personalized breast cancer diagnosis.
Chapter 8
Genetic Variants Improve Personalized
Breast Cancer Diagnosis
Recently, a number of genome-wide association studies have identified genetic variants associated
with breast cancer. However, the degree to which these genetic variants improve breast cancer
diagnosis in concert with mammography remains unknown. We conducted a retrospective case-
control study, collecting mammographic findings and high-frequency/low-penetrance genetic vari-
ants from an existing personalized medicine data repository. A Bayesian network was developed
on the mammographic findings, with and without the genetic variants collected. We analyzed the
predictive performance using the area under the ROC curve, and found that the genetic variants
significantly improved breast cancer diagnosis on mammograms.
8.1 Introduction
Large multi-relational databases containing variables that confer disease risk are increasingly
available, providing the opportunity for informatics tools to better stratify individuals for appropri-
ate healthcare decisions and explore disease mechanism and behavior. Coincident to this, policy-
makers have recommended that interventions, like breast cancer screening with mammography,
138
139
be increasingly based on individualized risk and shared decision-making [132, 158]. Targeting
at risk individuals for intervention after mammographic screening has the potential to decrease
recommendations for breast biopsy in women most likely to have an unnecessary procedure for
benign findings. Recent large-scale genome-wide association studies have identified 77 suscepti-
bility loci associated with breast cancer. In addition, there is a long history of development and
codification of features observed by radiologists on mammography that also predict a woman’s
risk of breast cancer. However, genetics and mammography abnormality findings have not yet
been used together to predict risk. Furthermore, the opportunity to use this data to interpret geno-
type/phenotype association, explain family aggregation of breast cancer, and shed light on disease
mechanism or natural history is just becoming possible.
There have been several attempts to incorporate these genetic variants into the Gail model [57]
which is a standard clinical breast cancer risk model including the number of first-degree relatives
with a diagnosis of breast cancer, age at menarche, age at first live birth and the number of previous
breast biopsies. Seven associated SNPs, when added to the Gail model, increase the area under the
receiver operating characteristic (ROC) curve from 0.607 to 0.632 [55, 56]. When ten associated
SNPs are added to the Gail model, the area under the ROC curve of the risk model increases
from 0.580 to 0.618 on another dataset [186]. However, the Gail model does not include any
mammography features which are clinically used by radiologists. Therefore, it is still unknown
how much these genetic variants improve breast cancer diagnosis and clinical decision-making
after an abnormal mammogram.
The main purpose of this chapter is to examine the impact of genetic information on im-
proving breast cancer risk prediction on mammograms. We incorporate genetic polymorphisms
with the descriptors that radiologists observe on mammograms while making medical decisions,
the American College of Radiology Breast Imaging Reporting and Data System (BI-RADS) [1],
version 4, including the shape and the margin of masses, the shape and the distribution of micro-
calcifications, background breast density and other associated findings as defined by this standard
lexicon in breast imaging. We also include a small number of predictive variables not included in
140
BI-RADS currently. Specifically, we employ these mammographic findings (49 mammography
descriptors) and the 77 genetic variants associated with breast cancer in a personalized medicine
data repository at the Marshfield Clinic. We train a Bayesian network on the mammographic
findings, with and without the 77 genetic variants.
8.2 Materials and Methods
8.2.1 Data
Subjects
The Personalized Medicine Research Project [115] at the Marshfield Clinic was used as the sam-
pling frame to identify breast cancer cases and controls. The project was reviewed and approved
by the Marshfield Clinic IRB. Subjects were selected using clinical data from Marshfield Clinic
Cancer Registry and Data Warehouse. We employed a retrospective case-control design. Women
with a plasma sample available, a mammogram, and a breast biopsy within 12 months after the
mammogram were included in the study. Cases were defined as women having a confirmed di-
agnosis of breast cancer obtained from the institutional cancer registry. Controls were confirmed
through the electronic medical records (and absence from the cancer registry) as never having had
a breast cancer diagnosis. In our case cohort, we included both invasive breast cancer (ductal
and lobular) as well as ductal carcinoma in situ. In order to construct case and control cohorts
that were similar in age distribution, we employed an age matching strategy. Specifically, we se-
lected a control whose age was within five years of the age of each case. Of note, we decided to
focus on high-frequency/low-penetrance genes that affect breast cancer risk as opposed to low fre-
quency genes with high penetrance (BRCA1 and BRCA2) or intermediate penetrance (CHEK-2).
High-frequency/low-penetrance SNPs generally have frequencies for the rarest allele of> 25% as
opposed to the low-frequency, high-penetrance mappings with population frequencies of < 1%.
We excluded individuals who had a known high penetrance genetic mutation.
141
Genetic Variants
Our study included 77 genetic variants which have been identified by recent large-scale genome-
wide association studies. Table 1 in [106] summarizes detailed information about the 77 SNPs,
including the IDs, the original publications associating them with breast cancer and their chro-
mosomes. The seven SNPs used in Gail study [55, 56] were also included in our study. Nine of
the ten SNPs used in Wacholder et al study [186] were included in our study, and the remaining
SNP rs7716600 from that study had a proxy rs10941679 in our study. We observed that each SNP
only confers a slight increase or decrease in the risk of breast cancer, in accordance with prior
literature. Among the 77 SNPs, 22 were evaluated in the previous study [105]. Among the 55
new SNPs, 41 were identified by COGS [120], and 14 SNPs were included based on several other
recent studies [193, 165, 162, 68, 50, 180, 2]. It is estimated that the current list of SNPs explains
14% of familial breast cancer risk [120].
Mammography Features
The American College of Radiology developed the BI-RADS lexicon [1] to homogenize mammo-
graphic findings and recommendations. The BI-RADS lexicon consists of a number of mammog-
raphy descriptors, including the characteristics of masses and microcalcifications, background
breast density and other associated findings, which can be organized in a hierarchy as shown
in Figure 8.1. Datasets containing mammography descriptors have been used to build several
successful breast cancer risk models and classifiers [6, 22]. Mammography data was originally
recorded as free text reports in the Marshfield database, and thus it was difficult to directly access
the information contained therein. We used a parser to extract mammography features from the
text reports; the parser has been shown to outperform manual extraction [130, 140]. After ex-
traction, every mammography feature takes the value “present” or “not present” except that the
variable mass size is discretized into three values, “not present”, “small” and “large”, depending
whether there is a reported mass size and whether any dimension of the reported mass size is larger
than 30mm.
142
Mas
s
Hig
hEq
ual
Low
Fat
Rou
ndO
val
Lobu
lar
Irre
gula
r
Circ
umsc
ribed
Mic
rolo
bula
ted
Obs
cure
dIn
dist
inct
Spic
ulat
ed
Ass
ocia
ted
Find
ings
Spec
ial
Cas
esA
rchi
tect
ural
Dis
tort
ion
Cal
cific
atio
ns
Coa
rse/
popc
orn
Milk
of c
alci
umR
od-li
keEg
gshe
ll/rim
Dys
troph
icLu
cent
-cen
tere
dSk
inR
ound
Punc
tate
Am
orph
ous
Pleo
mor
phic
Fine
line
arV
ascu
lar
Sutu
re
Mam
mog
raph
y Fe
atur
es
Skin
Thi
cken
ing
Skin
Ret
ract
ion
Nip
ple
Ret
ract
ion
Trab
ecul
arTh
icke
ning
Skin
Les
ion
Axi
llary
Ade
nopa
thy
Clu
ster
edLi
near
Segm
enta
lR
egio
nal
Scat
tere
d
Mar
gins
Shap
e
Den
sity
Dis
tribu
tion
ImagingObservation
Imaging ObservationFeatures
49 BI-RADS Descriptors
Arc
hite
ctur
al D
isto
rtion
Mor
phol
ogy
Size
*
Lym
ph N
ode
Foca
l Asy
mm
etry
Tubu
lar D
ensi
tyA
sym
met
ric ti
ssueB
reas
tC
ompo
sitio
n
Pred
omin
antly
Fat
Scat
tere
d Fi
brog
land
ular
Den
sitie
sH
eter
ogen
eous
ly D
ense
Extre
mel
y D
ense
Num
eric
al s
ize*
Bre
ast
Com
posi
tion
Spec
ial
Cas
esA
ssoc
iate
d Fi
ndin
gsA
rchi
tect
ural
Dis
torti
onPa
lpab
ility
*
Palp
able
*
*re
pres
ents
pre
dict
ive
feat
ures
not
incl
uded
in B
I-R
AD
S
Figu
re8.
1:M
amm
ogra
phy
feat
ures
adop
ted
from
the
Am
eric
anC
olle
geof
Rad
iolo
gy(B
I-R
AD
Sle
xico
n).
143
Each mammogram also has a BI-RADS category assigned by the radiologist who read the
mammogram. The BI-RADS category indicates the radiologist’s opinion of the absence or pres-
ence of breast cancer. In our study, the BI-RADS assessment category can take values, with an
order of increasing probability of malignancy, of 1, 2, 3, 0, 4a, 4, 4b, 4c and 5. We used the
BI-RADS assessment category as the predictions from the radiologists. Our experiment only in-
cluded diagnostic mammograms, and all the screening mammograms were excluded. Since most
of the subjects have multiple diagnostic mammograms in the electronic medical records, we se-
lected one mammogram for each subject as follows, to mimic the scenario of the most important
doctor visit before diagnosis. For cases, we selected the mammograms within one year prior to di-
agnosis. For controls, we selected the mammograms within one year prior to biopsy. If there were
still multiple mammograms left for each subject, we selected the mammogram with a more suspi-
cious BI-RADS category, with subsequent tiebreakers being, in order, recency and the number of
extracted mammography features.
8.2.2 Model
We built breast cancer diagnosis models using Naive Bayes, which can be regarded as the weighted
average of risk factors. Naive Bayes assumes that all features are conditionally independent of one
another given the class [111]. Although this assumption seems strong, it generally works well in
practical problems and provides easy interpretation of the risk contribution from different factors.
In our experiments, we used the Naive Bayes implementation in WEKA [75].
In total, we constructed three types of models on different sets of features. The first model was
built purely on the 49 mammography features, namely the Breast Imaging model. The second type
of model was based purely on genetic variants, namely the genetic models. Since we would like
to align our study with previous work [105], we tested three sets of genetic variants. The first set
consisted of the 10 SNPs in [186]. The second included the 22 SNPs in the study [105]. The last
set was our full list of the 77 SNPs. We denote the three genetic models as Genetic-10, Genetic-22
and Genetic-77 models. The third type of model was built on the 49 mammography features and
144
the genetic variants together, namely the combined models. Since we had three sets of genetic
variants with different sizes, we had three combined models, namely Combined-10, Combined-22
and Combined-77 models. In both the genetic models and the combined models, we handled the
genetic variants in the following way rather than using original genotypes of each SNP. We only
introduced one additional variable, the total count of risky alleles the person carries in the DNA.
This way of coding genetic variants was used in several models such as [186], and is helpful to
build risk models when each SNP only has a small contribution to the risk.
We treated the BI-RADS category scores from the radiologists as the predictions from the radi-
ologists, namely the baseline clinical assessment. We constructed ROC curves for each model, and
used the area under the curve (AUC) as a measure of performance. We also provided the precision-
recall (PR) curves for the models. We evaluated the models using 10-fold cross-validation.
8.3 Results
We identified 362 cases and 377 controls. Among the cases, there were 358 Caucasians, three
non-Caucasians and one case whose race information was unknown. Among the controls, there
were 373 Caucasians and four non-Caucasians. We do not disclose the race/ethnicity information
of these non-Caucasians for privacy concerns. Subject characteristics including age distribution
and family history of breast cancer are described in Table 8.1. There were more young people
(age < 50) in the case group than in the control group, and the proportion of elderly people (age
≥ 65) was roughly the same in the case group and in the control group. For the family history of
breast cancer, we observed a considerable larger proportion of people with family history in the
case group (45.3%) than in the control group (33.7%), which demonstrated the family aggregation
of breast cancer.
145
AGE CASES CONTROLS ALL
< 50 81 (22.4%) 58 (15.4%) 139 (18.8%)≥ 50, < 65 123 (34.0%) 168 (44.6%) 291 (39.4%)≥ 65 158 (43.6%) 151(40.0%) 309 (41.8%)
FAMILY HISTORY CASES CONTROLS ALL
YES 164 (45.3%) 127 (33.7%) 291 (39.4%)NO 188 (51.9%) 236 (62.6%) 424 (57.4%)
UNKNOWN 10 (2.8%) 14 (3.7%) 24 (3.2%)
Table 8.1: The distribution of age at mammogram and family breast cancer history in the casesand the controls.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
●●
●●
●●
●
●●●
ROC Curve
Combined−77 (0.760)Combined−22 (0.733)Combined−10 (0.712)Breast Imaging Model (0.693)Baseline Clinical Assessment
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.6
0.7
0.8
0.9
1.0
Recall
Pre
cisi
on
●●●●
●●
●
●●●
PR Curve
Combined−77 (0.775)Combined−22 (0.754)Combined−10 (0.739)
Breast Imaging Model (0.730)
Baseline Clinical Assessment
Figure 8.2: The ROC curves and PR curves for the baseline clinical assessment, the Breast Imagingmodel the three combined models.
8.3.1 Performance of Combined Models
The ROC and the PR curves for the baseline clinical assessment, the Breast Imaging model and
the three combined models are provided in Figure 8.2. For each model, we vertically average [47]
the ROC curves from the ten replications of the 10-fold cross-validation to obtain the final curve;
we do likewise for the PR curves. The area under the ROC curves for the Breast Imaging model,
146
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
ROC Curve
Genetic−77 (0.684)Genetic−22 (0.622)Genetic−10 (0.591)
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.6
0.7
0.8
0.9
Recall
Pre
cisi
on
PR Curve
Genetic−77 (0.668)Genetic−22 (0.613)Genetic−10 (0.578)
Figure 8.3: The ROC and PR curves for the three genetic models.
the Combined-10 model, the Combined-22 model and the Combined-77 model are 0.693, 0.712,
0.733 and 0.760. The ROC curve of the Combined-77 model almost completely dominates the
ROC curve of the Breast Imaging model, which suggests that the 77 genetic variants can help to
improve breast cancer diagnosis based on mammographic findings. We perform a two-sided paired
t-test on the area under the ten ROC curves of the Breast Imaging model and the area under the ten
ROC curves of the combined model from the 10-fold cross-validation, and the difference between
them is significant with a P-value 0.00047. We further compare the AUROC of the Combined-
77 model and the Combined-22 model with a two-sided paired t-test, and the difference between
them is significant with a P-value 0.0046, which demonstrates the discriminative power of the
55 recently identified SNPs. From PR curves, we note that the combined models dominate the
Breast Imaging model and the baseline clinical assessment in the high recall region (¿0.8) in
which clinicians operate, and therefore we want to optimize.
147
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
ROC Curve
Combined−77 (0.760)Breast Imaging Model (0.693)Genetic−77 (0.684)
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.6
0.7
0.8
0.9
1.0
Recall
Pre
cisi
on
PR Curve
Combined−77(0.775)Breast Imaging Model (0.730)Genetic−77 (0.668)
Figure 8.4: The ROC curves and PR curves for the Breast Imaging model, the Genetic-77 modeland the Combined-77 model.
8.3.2 Performance of Genetic Models
Furthermore, we compare the discriminative power of the three genetic models, namely the Genetic-
10 model, the Genetic-22 model and the Genetic-77 model. The ROC curves and the PR curves
for the three genetic models are provided in Figure 8.3, respectively. For each model, we verti-
cally average the curves from the 10-fold cross-valuation to obtain the final curve. The area under
the ROC curves for the Genetic-10 model, the Genetic-22 model and the Genetic-77 model are
0.591, 0.622 and 0.684, which demonstrates that the more associated SNPs the genetic model
includes, the more discriminative the model becomes. We also use a two-sided paired t-test to
compare the area under the ROC curves yielded by the three genetic models. The Genetic-77
model outperforms both the Genetic-22 model (P=0.028) and the Genetic-10 model (P=0.0068).
8.3.3 Comparing Breast Imaging Model and Genetic Model
We compare the performance of the Breast Imaging model, the Genetic-77 model and the Combined-
77 model. The corresponding ROC curves and the PR curves for the three models are shown in
148
Figure 8.4. We observe that the mammography features are more predictive for women with a
high probability of cancer (low FPR region in ROC space) whereas genetic variants are more pre-
dictive for women with a low probability of cancer (mid/high FPR region in ROC space). Note
that the Genetic-77 model describes the patients inherited breast cancer risk in DNA. However,
after the patient starts developing malignant features on mammograms, mammographic findings
(Breast Imaging model) provide superior discrimination. Still, knowing the genetic information
can further improve the accuracy of breast cancer diagnosis even at higher baseline risk.
8.4 Discussion
The primary contribution of this chapter is to show that the genetic variants can significantly
improve breast cancer diagnosis on mammographic findings, resulting in reduced false positives
and alleviated risk of overdiagnosis. This result indicates promise for translating discoveries from
massive collaborative GWAS into clinical breast cancer diagnosis. Our study includes the most up-
to-date breast cancer associated SNPs, the majority identified and/or verified through the massive
COGS (over 55k cases and over 54k controls), and therefore these new SNPs are credible and can
explain a larger proportion of familial breast cancer risk. Indeed, we observe that the Combined-
77 model significantly outperforms the Combined-22 model used in our previous study [105].
We also demonstrate that the Genetic-77 model significantly outperforms the Genetic-22 model.
The increased discriminative power derived from the new 55 SNPs identified by recent published
studies [186] highlights the rapid progress the breast cancer GWAS community has made since
2010. Furthermore, we make a novel discovery that mammography features are more predictive
for high-risk women whereas genetic variants are more predictive for low-risk women, which
explains the benefit of combining genetic variants and mammographic findings for personalized
breast cancer diagnosis.
Our study, in a novel way, differs from the previous study of Wacholder et al. (2010) [186]
which adds ten genetic variants to the Gail model, a risk model based on self-reported demographic
and personal risk factors. The unique contribution in our study is that we include mammography
149
features which represent richer phenotypic data directly relevant to breast cancer diagnosis and
thus provide high signal. Therefore, our study contributes the potential clinical impact of trans-
lating exciting discoveries from GWAS to the patient experience at diagnosis. The additional
discriminative power from these genetic variants can significantly rule out the false positives of
mammogram screening, and therefore has the potential to decrease recommendations for unneces-
sary breast biopsies. Of course, it will be interesting to combine the epidemiology features in Gail
model, the mammography features and the SNPs for more accurate personalized breast cancer
diagnosis.
Limitations of our study include small sample size and the pitfalls of data extraction from text
reports. We understand that parsing mammography features from text reports may introduce noise
into the data. However, despite the challenges inherent in extracting accurate data, which may
affect our results, we are encouraged that improvements in predictive accuracy remain, especially
after observing the discriminative power of genetic factors alone in the genetic models. Further-
more, we recognize that methodological issues in our study may represent shortcomings but also
signify opportunities for future investigation. First, we do not explicitly model how individual
SNPs function to alter breast cancer risk, nor do we model potential SNP interactions [181]. Our
current model only adds one extra feature which simply counts the totally number of risky alleles,
assuming that the effect size of the genetic variants are the same and that the genetic effect of the
genetic variants is non-mechanistic and additive. We do not model the individual SNPs for the
curse of dimensionality concern; each individual SNP only confers a fairly mild relative risk and
if we model them individually, the model will perform poorly on test data unless a larger cohort
of training data is available. Modeling SNP-SNP interaction is even harder and requires more
training data.
Second, we do not differentiate the different subtypes of breast cancers (for example, the
estrogen-receptor status and progesterone-receptor status) in the current study. Breast cancer is
a complex and heterogeneous disease with different subtypes, including two main subtypes of
estrogen receptor (ER) negative tumors (basal-like and human epidermal growth factor receptor-2
150
positive/ER- subtype) and at least two types of ER positive tumors (luminal A and luminal B)
[24, 141]. These molecular subtypes are important predictors of breast cancer mortality [79] and
have different genetic susceptibility [59]. Therefore it is desirable to tease them apart in the pursuit
of increasingly personalized breast cancer care.
Nevertheless, we are encouraged by these promising results in our current study, especially
after the disappointment [70] and caution [96] in the early years of translating GWAS discoveries
to personalized risk prediction. We hope that the rapid progress being made through these massive
collaborative studies together with our growing knowledge about breast cancer mechanisms and
genotype-phenotype relationships will bring us even closer to the practical personalized breast
cancer diagnosis and treatment.
The material in this chapter first appeared in AMIA’2013 and AMIA-TBI’2014 as follows:
Jie Liu, David Page, Houssam Nassif, Jude Shavlik, Peggy Peissig, Catherine McCarty, Ade-
dayo A. Onitilo and Elizabeth Burnside. Genetic Variants Improve Breast Cancer Risk Prediction
on Mammograms. American Medical Informatics Association Symposium (AMIA), 2013.
Jie Liu, David Page, Peggy Peissig, Catherine McCarty, Adedayo A. Onitilo, Amy Trentham-
Dietz and Elizabeth Burnside. New Genetic Variants Improve Personalized Breast Cancer Diag-
nosis. AMIA Summit on Translational Bioinformatics (AMIA-TBI), 2014.
Chapter 9
Future Work
This thesis develops approaches for statistical and probabilistic methods which are designed for
genome-wide genetic variation data (single-nucleotide polymorphisms), and will become insuf-
ficient for the forthcoming next-generation genomics data due to their lack of scalability and in-
ability to handle heterogeneous structured data from multiple sources. Therefore, it is urgent to
extend these methods in synchronization with the rapid development of biotechnologies. There
are three important directions for future research on the next generation genomics data, including
integration, probabilistic modeling and statistical inference.
First, there is an emerging problem related to big data analysis: integration methods for multi-
source, multi-assay next generation genomics data. Nowadays, biotechnology is moving forward
at a speed much faster than the capacity we can process and understand. On one side, the volume
of the generated data keeps increasing, introducing the ever-worsening data-rich/information-poor
dilemma. On the other side, these new technologies bring in new types of genomics data from
new perspectives, making it extremely challenging to analyze the data jointly and coherently. For
example, the next generation sequencing technologies will assay a host of meta-genomic features,
including DNA methylation, nucleosome position, binding of transcription factors to genomic
DNA, histone modifications and the 3D structure of the DNA. On the transcriptomics and pro-
teomics levels, expression data such as RNA-seq data and mass spectrometry data, respectively,
151
152
provide more comprehensive molecular portraits of cells. How to analyze the biological data from
all these different types of data, in isolation and in combination, to better understand basic molec-
ular biology will be extremely important and exciting. One key step is to integrate the information
from individual types of assays, not only producing a representation containing all important infor-
mation from the original individual platforms, but also preserving the inter-platform information
and facilitating downstream analysis. This area is new and extremely important for analyzing next
generation panomic data at the right granularity.
Second, it is desirable to develop machine learning methods, especially probabilistic methods
that are useful for modeling heterogeneous, hierarchical and dynamic information within struc-
tured data. Graphical models are powerful and elegant tools modeling joint probabilities by com-
pactly embedding the structured dependencies among random variables, and have been one of the
most popular areas in machine learning in the last 20 years. However, there are a few new chal-
lenges in learning graphical models from genomics, transcriptomics and proteomics data (even
after the data have been integrated), such as (1) capturing heterogeneity in the data from different
cell types, different types of assays, and different species; (2) adaptively recovering the hierar-
chical structure embedded with genomics, transcriptomics and proteomics data; (3) modeling dy-
namic information such as time series of cellular process. In order to deal with heterogeneity and
hierarchical structures, Bayesian techniques and nonparametric approaches can be used to strike
a balance between richness and simplicity of the models, while maintaining the interpretability
of the resulting graphical models. For capturing the dynamic information in the data, we can use
time series models, continuous time models and their nonparametric variations.
Last but not least, it is desirable to continue with the work in this thesis on large-scale sta-
tistical inference (a.k.a. multiple testing), especially the unsolved challenges such as dependence
among the hypotheses and massive scale hypotheses testing. Multiple testing has emerged as one
of the most active research areas in statistics over the last 15 years, currently contributing about
8% of the articles in the leading methodological statistics journals [10]. We still need multiple
testing in the era of big data. In practice, hypothesis testing remains the standard way for biol-
153
ogists and geneticists to report their scientific discoveries, and multiple testing (such as control
of the false discovery rate) is still one important tool to quantify the credibility of the discoveries
from genomics and proteomics data. However, more and more genomics and proteomics problems
involve moderate or strong dependencies among the hypotheses, due to the correlation structure
in the biological data. This dependence has been ignored or under-utilized historically. Therefore,
multiple testing under dependence is and will continue to be the most important direction in mul-
tiple testing research, and improving the power of testing via leveraging dependencies will be very
important to make discoveries from next generation genomics and proteomics data. In addition,
emerging genomics and proteomics data also present new types of heterogeneity and latent struc-
ture. Furthermore, the area of large-scale inference still contains many open questions of great
theoretical interest, such as optimality and consistency of multiple testing procedures.
All the three proposed directions will prominently enhance the linkage between genomics and
proteomics on the one hand and the field of big data analytics on the other. With the advent of
the next generation meta-genomic and meta-proteomic data, we will no longer be able to ana-
lyze the raw data directly without a further step of integration. The intrinsic heterogeneity and
ever-growing complex structures within data make probabilistic modeling and statistical inference
increasingly more difficult. These methodology changes and analytic challenges are likely to hap-
pen in other big data analytic problems such as social networks and business enterprise networks.
I hope that the advances achieved in the future work can further motivate and stimulate the related
research topics in the general big data analytic areas.
Bibliography
[1] American College of Radiology and American College of Radiology. BI-RADS Committee. Breast imaging
reporting and data system. American College of Radiology, 1998.
[2] Antonis C Antoniou, Xianshu Wang, Zachary S Fredericksen, Lesley McGuffog, Robert Tarrell, Olga M Sinil-
nikova, Sue Healey, Jonathan Morrison, Christiana Kartsonaki, Timothy Lesnick, et al. A locus on 19p13
modifies risk of breast cancer in brca1 mutation carriers and is associated with hormone receptor-negative breast
cancer in the general population. Nature genetics, 42(10):885–892, 2010.
[3] Peter Armitage. Tests for linear trends in proportions and frequencies. Biometrics, 11:375C386, 1955.
[4] Arthur U. Asuncion, Qiang Liu, Alexander T. Ihler, and Padhraic Smyth. Particle filtered MCMC-MLE with
connections to contrastive divergence. In ICML, 2010.
[5] Arthur U. Asuncion, Qiang Liu, Alexander T. Ihler, and Padhraic Smyth. Particle filtered MCMC-MLE with
connections to contrastive divergence. In ICML, 2010.
[6] Jay A Baker, Phyllis J Kornguth, Joseph Y Lo, Margaret E Williford, and Carey E Floyd. Breast cancer:
prediction with artificial neural network based on bi-rads standardized lexicon. Radiology, 196(3):817–822,
1995.
[7] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model selection through sparse maximum
likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res., 9:485–516, 2008.
[8] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains. ANN MATH STAT, 41(1):164–171, 1970.
[9] Matthew J. Beal, Zoubin Ghahramani, and Carl E. Rasmussen. The infinite hidden Markov model. In NIPS,
2002.
154
155
[10] Yoav Benjamini. Simultaneous and selective inference: current successes and future challenges. Biometrical
Journal, 52(6):708–721, 2010.
[11] Yoav Benjamini and Ruth Heller. False discovery rates for spatial signals. Journal of the American Statistical
Association, 102:1272–1281, 2007.
[12] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach
to multiple testing. Journal of The Royal Statistical Society Series B-Statistical Methodology, 57(1):289–300,
1995.
[13] Yoav Benjamini and Yosef Hochberg. Multiple hypotheses testing with weights. Scandinavian Journal of
Statistics, 24:407–418, 1997.
[14] Yoav Benjamini and Yosef Hochberg. On the adaptive control of the false discovery rate in multiple testing with
independent statistics. Journal of Educational and Behavioral Statistics, 25(1):60–83, 2000.
[15] Yoav Benjamini, Abba M. Krieger, and Daniel Yekutieli. Adaptive linear step-up procedures that control the
false discovery rate. Biometrika, 93:491–507, 2006.
[16] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under depen-
dency. Annals of Statistics, 29:1165–1188, 2001.
[17] Berit Bernert, Helena Porsch, and Paraskevi Heldin. Hyaluronan synthase 2 (HAS2) promotes breast cancer
cell invasion by suppression of tissue metalloproteinase inhibitor 1 (TIMP-1). J BIOL CHEM, 286(49):42349–
42359, 2011.
[18] Julian Besag. Statistical analysis of non-lattice data. JRSS-D, 24(3):179–195, 1975.
[19] Gilles Blanchard and Etienne Roquain. Adaptive false discovery rate control under independence and depen-
dence. J MACH LEARN RES, 10:2837–2871, December 2009.
[20] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.
[21] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE
Trans. Pattern Anal. Mach. Intell., 23(11):1222–1239, November 2001.
[22] Elizabeth S. Burnside, Jesse Davis, Jagpreet Chhatwal, Oguzhan Alagoz, Mary J. Lindstrom, Berta M. Geller,
Benjamin Littenberg, Katherine A. Shaffer, Charles E. Kahn, and C. David Page. A probabilistic computer
model developed from clinical data in the national mammography database format to classify mammographic
findings. Radiology, 251:663–672, 2009.
156
[23] Emmanuel Candes and Terence Tao. Rejoinder: the Dantzig selector: statistical estimation when p is much
larger than n , 2007.
[24] Lisa A Carey, Charles M Perou, Chad A Livasy, Lynn G Dressler, David Cowan, Kathleen Conway, Gamze
Karaca, Melissa A Troester, Chiu Kit Tse, Sharon Edmiston, et al. Race, breast cancer subtypes, and survival in
the carolina breast cancer study. Jama, 295(21):2492–2502, 2006.
[25] Miguel A. Carreira-Perpinan and Geoffrey E. Hinton. On contrastive divergence learning. In AISTATS, 2005.
[26] Gilles Celeux, Florence Forbes, and Nathalie Peyrard. EM procedures using mean field-like approximations for
Markov model-based image segmentation. Pattern Recognition, 36:131–144, 2003.
[27] J. M. Chapman, J. D. Cooper, J. A. Todd, and D. G. Clayton. Detecting disease associations due to linkage
disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity,
56:18–31, 2003.
[28] Sotirio P. Chatzis and Theodora A. Varvarigou. A fuzzy clustering approach toward hidden Markov random field
models for enhanced spatially constrained image segmentation. IEEE Transactions on Fuzzy Systems, 16:1351
– 1361, 2008.
[29] Sotirios P. Chatzis and Gabriel Tsechpenakis. The infinite hidden Markov random field model. In ICCV, 2009.
[30] James M Cheverud. A simple correction for multiple comparisons in interval mapping genome scans. Heredity,
87:52–58, 2001.
[31] William G. Cochran. Some methods for strengthening the common chi-square tests. Biometrics, 10:417–451,
1954.
[32] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk. Wavelet-based statistical signal processing using hidden
Markov models. IEEE T SIGNAL PROCES, 46(4):886–902, April 1998.
[33] Douglas F. Easton, Karen A. Pooley, Alison M. Dunning, Paul D. P. Pharoah, Deborah Thompson, Dennis G.
Ballinger, Jeffery P. Struewing, Jonathan Morrison, Helen Field, Robert Luben, Nicholas Wareham, Shahana
Ahmed, Catherine S. Healey, Richard Bowman, Kerstin B. Meyer, Christopher A. Haiman, Laurence K. Kolonel,
Brian E. Henderson, Loic Le Marchand, Paul Brennan, Suleeporn Sangrajrang, Valerie Gaborieau, Fabrice Ode-
frey, Chen-Yang Shen, Pei-Ei Wu, Hui-Chun Wang, Diana Eccles, Gareth D. Evans, Julian Peto, Olivia Fletcher,
Nichola Johnson, Sheila Seal, Michael R. Stratton, Nazneen Rahman, Georgia Chenevix-Trench, Stig E. Bo-
jesen, Børge G. Nordestgaard, Christen K. Axelsson, Montserrat Garcia-Closas, Louise Brinton, Stephen
157
Chanock, Jolanta Lissowska, Beata Peplonska, Heli Nevanlinna, Rainer Fagerholm, Hannaleena Eerola, Dae-
hee Kang, Keun-Young Yoo, Dong-Young Noh, Sei-Hyun Ahn, David J. Hunter, Susan E. Hankinson, David G.
Cox, Per Hall, Sara Wedren, Jianjun Liu, Yen-Ling Low, Natalia Bogdanova, Peter Schurmann, Thilo Dork, Rob
A. E. M. Tollenaar, Catharina E. Jacobi, Peter Devilee, Jan G. M. Klijn, Alice J. Sigurdson, Michele M. Doody,
Bruce H. Alexander, Jinghui Zhang, Angela Cox, Ian W. Brock, Gordon Macpherson, Malcolm W. R. Reed,
Fergus J. Couch, Ellen L. Goode, Janet E. Olson, Hanne Meijers-Heijboer, Ans van den Ouweland, Andre Uit-
terlinden, Fernando Rivadeneira, Roger L. Milne, Gloria Ribas, Anna Gonzalez-Neira, Javier Benitez, John L.
Hopper, Margaret Mccredie, Melissa Southey, Graham G. Giles, Chris Schroen, Christina Justenhoven, Hiltrud
Brauch, Ute Hamann, Yon-Dschun Ko, Amanda B. Spurdle, Jonathan Beesley, Xiaoqing Chen, Arto Manner-
maa, Veli-Matti Kosma, Vesa Kataja, Jaana Hartikainen, Nicholas E. Day, David R. Cox, and Bruce A. J. Ponder.
Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 447:1087–1093, May
2007.
[34] Bradley Efron. Correlation and large-scale simultaneous significance testing. J AM STAT ASSOC, 102(477):93–
103, March 2007.
[35] Bradley Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cam-
bridge University Press, 2010.
[36] Bradley Efron, Trevor Hastie, Lain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics,
32:407–499, 2004.
[37] Bradley Efron and Robert Tibshirani. Empirical bayes methods and false discovery rates for microarrays. Ge-
netic Epidemiology, 23(1):70–86, 2002.
[38] Bradley Efron, Robert Tibshirani, John D. Storey, and Virginia Tusher. Empirical bayes analysis of a microarray
experiment. Journal of The American Statistical Association, 96:1151–1160, 2001.
[39] V. A. Epanechnikov. Non-parametric estimation of a multivariate probability density. THEOR PROBAB APPL,
14(1):153–158, 1969.
[40] Eleazar Eskin. Increasing power in association studies by using linkage disequilibrium structure and molecular
function as prior information. Genome Research, 18(4):653–660, 2008.
[41] Jianqing Fan. Design-adaptive nonparametric regression. Journal of the American Statistical Association,
87(420):998–1004, 1992.
[42] Jianqing Fan, Xu Han, and Weijie Gu. Control of the false discovery rate under arbitrary covariance dependence.
(to appear) J AM STAT ASSOC, 2012.
158
[43] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties.
Journal of the American Statistical Association, 96(456):1348–1360, 2001.
[44] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature selection: Beyond the linear
model. Journal of Machine Learning Research, 10:2013–2038, 2009.
[45] Ruzong Fan and Michael Knapp. Genome association studies of complex diseases by case-control designs.
American Journal of Human Genetics, 72:850–868, 2003.
[46] Alessio Farcomeni. Some results on the control of the false discovery rate under dependence. SCAND J STAT,
34(2):275–297, June 2007.
[47] Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
[48] Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–
230, 1973.
[49] H. Finner and M. Roters. Multiple hypotheses testing and expected number of type I errors. ANN STAT, 30:220–
238, 2002.
[50] Olivia Fletcher, Nichola Johnson, Nick Orr, Fay J Hosking, Lorna J Gibson, Kate Walker, Diana Zelenika,
Ivo Gut, Simon Heath, Claire Palles, et al. Novel breast cancer susceptibility locus at 9q31. 2: results of a
genome-wide association study. Journal of the National Cancer Institute, 103(5):425–435, 2011.
[51] L. R. Ford and D. R. Fulkerson. Constructing maximal dynamic flows from static flows. Operations Research,
6(3):419–433, 1958.
[52] B. Freidlin, G. Zheng, Z. Li, and J. L. Gastwirth. Trend tests for case-control studies of genetic markers: power,
sample size and robustness. Human Heredity, 53(3):146–152, 2002.
[53] Chloe Friguet, Maela Kloareg, and David Causeur. A factor model approach to multiple testing under depen-
dence. J AM STAT ASSOC, 104(488):1406–1415, December 2009.
[54] Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. Beam sampling for the infinite hidden
Markov model. In ICML, 2008.
[55] M. H. Gail. Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer
risk. J Natl Cancer Inst, 100(14):1037–1041, 2008.
159
[56] M. H. Gail. Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. J Natl
Cancer Inst, 101(13):959–963, 2009.
[57] Mitchell H Gail, Louise A Brinton, David P Byar, Donald K Corle, Sylvan B Green, Catherine Schairer, and
John J Mulvihill. Projecting individualized probabilities of developing breast cancer for white females who are
being examined annually. J Natl Cancer Inst, 81(24):1879–1886, 1989.
[58] Varun Ganapathi, David Vickrey, John Duchi, and Daphne Koller. Constrained approximate maximum entropy
learning of Markov random fields. In UAI, 2008.
[59] Montserrat Garcia-Closas, Fergus J Couch, Sara Lindstrom, Kyriaki Michailidou, Marjanka K Schmidt, Mark N
Brook, Nick Orr, Suhn Kyong Rhie, Elio Riboli, Heather S Feigelson, et al. Genome-wide association studies
identify four er negative-specific breast cancer risk loci. Nature genetics, 45(4):392–398, 2013.
[60] Alan E. Gelfand and Adrian F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal
of the American Statistical Association, 85(410):398–409, 1990.
[61] Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cam-
bridge University Press, New York, 2007.
[62] Christopher Genovese and Larry Wasserman. Operating characteristics and extensions of the false discovery
rate procedure. Journal of The Royal Statistical Society Series B-Statistical Methodology, 64:499–517, 2002.
[63] Christopher Genovese and Larry Wasserman. A stochastic process approach to false discovery control. Annals
of Statistics, 32:1035–1061, 2004.
[64] R. Genovese, Kathryn Roeder, and Larry Wasserman. False discovery control with p-value weighting.
Biometrika, 93:509–524, 2006.
[65] Samuel J. Gershman and David M. Blei. A tutorial on Bayesian nonparametric models. Journal of Mathematical
Psychology, 56(1):1–12, 2012.
[66] Charles J. Geyer. Markov chain Monte Carlo maximum likelihood. COMP SCI STAT, pages 156–163, 1991.
[67] Zoubin Ghahramani and Michael I. Jordan. Factorial Hidden markov models. Machine Learning, 29:245–273,
1997.
[68] Maya Ghoussaini, Olivia Fletcher, Kyriaki Michailidou, Clare Turnbull, Marjanka K Schmidt, Ed Dicks, Joe
Dennis, Qin Wang, Manjeet K Humphreys, Craig Luccarini, et al. Genome-wide association analysis identifies
three new breast cancer susceptibility loci. Nature genetics, 44(3):312–318, 2012.
160
[69] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum flow problem. In Proceedings of the 18th
ACM Symposium on Theory of Computing, pages 136–146, 1986.
[70] DB Goldstein. Common genetic variation and human traits. The New England journal of medicine,
360(17):1696, 2009.
[71] Michael Gutmann and Jun-ichiro Hirayama. Bregman divergence as general framework to estimate unnormal-
ized statistical models. In UAI, pages 283–290, Corvallis, Oregon, 2011. AUAI Press.
[72] Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnor-
malized statistical models. In AISTATS, 2010.
[73] Isabelle Guyon and Andre Elisseeff. An introduction to variable and feature selection. Journal of Machine
Learning Research, 3:1157–1182, 2003.
[74] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification
using support vector machines. MACH LEARN, 46(1-3):389–422, 2002.
[75] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The weka
data mining software: an update. SIGKDD Explor. Newsl., 11(1):10–18, 2009.
[76] Buhm Han and Eleazar Eskin. Multiple testing in genetic epidemiology. Encyclopedia of Life Sciences, 2010.
[77] Lauren A. Hannah, David M. Blei, and Warren B. Powell. Dirichlet process mixtures of generalized linear
models. Journal of Machine Learning Research, 12:1923–1953, 2011.
[78] Chris Hans. Bayesian lasso regression. BIOMETRIKA, 96(4):835–845, December 2009.
[79] Reina Haque, Syed A Ahmed, Galina Inzhakova, Jiaxiao Shi, Chantal Avila, Jonathan Polikoff, Leslie Bernstein,
Shelley M Enger, and Michael F Press. Impact of breast cancer subtypes and treatment on survival: an analysis
spanning two decades. Cancer Epidemiology Biomarkers & Prevention, 21(10):1848–1855, 2012.
[80] Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. NEURAL COMPUT,
14:1771–1800, 2002.
[81] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. In
COLT, 2009.
[82] David J. Hunter, Peter Kraft, Kevin B. Jacobs, David G. Cox, Meredith Yeager, Susan E. Hankinson, Sholom
Wacholder, Zhaoming Wang, Robert Welch, Amy Hutchinson, Junwen Wang, Kai Yu, Nilanjan Chatterjee,
161
Nick Orr, Walter C. Willett, Graham A. Colditz, Regina G. Ziegler, Christine D. Berg, Saundra S. Buys, Cather-
ine A. Mccarty, Heather S. Feigelson, Eugenia E. Calle, Michael J. Thun, Richard B. Hayes, Margaret Tucker,
Daniela S. Gerhard, Joseph F. Fraumeni, Robert N. Hoover, Gilles Thomas, and Stephen J. Chanock. A genome-
wide association study identifies alleles in fgfr2 associated with risk of sporadic postmenopausal breast cancer.
Nature Genetics, 39(7):870–874, 2007.
[83] Aapo Hyvarinen. Connections between score matching, contrastive divergence, and pseudolikelihood for
continuous-valued variables. IEEE T NEURAL NETWOR, 18(5):1529–1531, 2007.
[84] Aapo Hyvarinen. Some extensions of score matching. COMPUT STAT DATA AN, 51(5):2499–2512, 2007.
[85] Aapo Hyvarinen. Some extensions of score matching. Computational Statistics & Data Analysis, 51(5):2499–
2512, 2007.
[86] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In
Proceedings of the International Conference on Machine Learning, 2009.
[87] Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. Structured variable selection with sparsity-inducing
norms. Technical report, 2009.
[88] Finn V. Jensen. An Introduction to Bayesian Networks. UCL Press, London, 1996.
[89] Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul. An introduction to variational
methods for graphical models. Machine Learning, 37:183–233, 1999.
[90] Junhwan Kim and Ramin Zabih. Factorial Markov random fields. In ECCV, pages 321–334, 2002.
[91] Ross Kindermann and J. Laurie Snell. Markov Random Fields and Their Applications (Contemporary Mathe-
matics ; V. 1). Amer Mathematical Society, 1980.
[92] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In Proceedings of the ninth interna-
tional workshop on Machine learning, pages 249–256, 1992.
[93] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324,
1997.
[94] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
[95] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? IEEE Trans. Pattern
Anal. Mach. Intell., 26(2):147–159, 2004.
162
[96] Peter Kraft and David J Hunter. Genetic risk prediction: are we there yet? The New England journal of medicine,
360(17):1701–1703, 2009.
[97] F.R. Kschischang, B.J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE T INFORM
THEORY, 47(2):498 –519, feb 2001.
[98] Xiangyang Lan, Stefan Roth, Daniel Huttenlocher, and Michael J. Black. Efficient belief propagation with
learned higher-order markov random fields. In ECCV, pages 269–282, 2006.
[99] Steffen L. Lauritzen and David J. Spiegelhalter. Local computations with probabilities on graphical structures
and their application to expert systems. Journal of The Royal Statistical Society Series B-Statistical Methodology,
50(2):157–224, 1988.
[100] Jeffrey T. Leek and John D. Storey. A general framework for multiple testing dependence. P NATL ACAD SCI
USA, 105(48):18718–18723, 2008.
[101] Yuejuan Li, Lingli Li, Tracey J. Brown, and Paraskevi Heldin. Silencing of hyaluronan synthase 2 suppresses
the malignant phenotype of invasive breast cancer cells. INT J CANCER, 120(12):2557–2567, 2007.
[102] Paul Lichtenstein, Niels V. Holm, Pia K. Verkasalo, Anastasia Iliadou, Jaakko Kaprio, Markku Koskenvuo, Eero
Pukkala, Axel Skytthe, and Kari Hemminki. Environmental and heritable factors in the causation of cancer–
analyses of cohorts of twins from sweden, denmark, and finland. New England Journal of Medicine, 343:78–85,
2000.
[103] D. Y. Lin. An efficient monte carlo approach to assessing statistical significance in genomic studies. Bioinfor-
matics, 21:781–787, 2005.
[104] Han Liu, Jian Zhang, Xiaoye Jiang, and Jun Liu. The group Dantzig selector. In AISTATS, 2010.
[105] Jie Liu, David Page, Houssam Nassif, Jude Shavlik, Peggy Peissig, Catherine McCarty, Adedayo A Onitilo, and
Elizabeth Burnside. Genetic variants improve breast cancer risk prediction on mammograms. In AMIA Summit
on Translational Bioinformatics (AMIA-TBI), 2014.
[106] Jie Liu, David Page, Peggy Peissig, Catherine McCarty, Adedayo A Onitilo, Amy Trentham-Dietz, and Elizabeth
Burnside. New genetic variants improve personalized breast cancer diagnosis. In AMIA Summit on Translational
Bioinformatics (AMIA-TBI), 2014.
[107] Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside, and David Page. Graphical-
model based multiple testing under dependence, with applications to genome-wide association studies. In UAI,
2012.
163
[108] Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside, and David Page. High-
dimensional structured feature screening using binary Markov random fields. In AISTATS, 2012.
[109] Hans-Andrea Loeliger. An introduction to factor graphs. IEEE Signal Processing Magazine, 21:28–41, 2004.
[110] A. Lorbert, D. Eis, V. Kostina, D. Blei, and P. Ramadge. Exploiting covariate similarity in sparse regression via
the pairwise elastic net. In AISTATS, 2010.
[111] Daniel Lowd and Pedro Domingos. Naive bayes models for probability estimation. In Proceedings of the 22nd
international conference on machine learning, pages 529–536, 2005.
[112] Siwei Lyu. Unifying non-maximum likelihood learning objectives with minimum KL contraction. In NIPS,
pages 64–72, 2011.
[113] Teri A Manolio, Francis S Collins, Nancy J Cox, David B Goldstein, Lucia A Hindorff, David J Hunter, Mark I
McCarthy, Erin M Ramos, Lon R Cardon, Aravinda Chakravarti, et al. Finding the missing heritability of
complex diseases. Nature, 461(7265):747–753, 2009.
[114] J. C. Marioni, N. P. Thorne, and S. Tavare. BioHMM: a heterogeneous hidden Markov model for segmenting
array CGH data. Bioinformatics, 22:1144–1146, 2006.
[115] CA McCarty, RA Wilke, PF Giampietro, SD Wesbrook, and MD Caldwell. Marshfield Clinic Personalized
Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank.
Personalized Med, 2:49–79, 2005.
[116] Catherine A McCarty, Rex L Chisholm, Christopher G Chute, Iftikhar J Kullo, Gail P Jarvik, Eric B Larson,
Rongling Li, Daniel R Masys, Marylyn D Ritchie, Dan M Roden, et al. The emerge network: a consortium
of biorepositories linked to electronic medical records data for conducting genomic studies. BMC medical
genomics, 4(1):13, 2011.
[117] Marina Meila. Comparing clusterings by the variation of information. In COLT, pages 173–187, 2003.
[118] Xiao-Li Meng and Wing Hung Wong. Simulating ratios of normalizing constants via a simple identity: a
theoretical exploration. Statistica Sinica, 6(4):831–860, 1996.
[119] Patrick Emmanuel Meyer, Colas Schretter, and Gianluca Bontempi. Information-theoretic feature selection in
microarray data using variable complementarity. Selected Topics in Signal Processing, IEEE Journal of, 2:261–
274, 2008.
164
[120] Kyriaki Michailidou, Per Hall, Anna Gonzalez-Neira, Maya Ghoussaini, Joe Dennis, Roger L Milne, Mar-
janka K Schmidt, Jenny Chang-Claude, Stig E Bojesen, Manjeet K Bolla, et al. Large-scale genotyping identifies
41 new loci associated with breast cancer risk. Nature genetics, 45(4):353–361, 2013.
[121] Sujit Kumar Mitra. On the limiting power function of the frequency chi-square test. Ann. Math. Statist., 29:1221–
1233, 1958.
[122] J. Møller, A.N. Pettitt, R. Reeves, and K.K. Berthelsen. An efficient Markov chain Monte Carlo method for
distributions with intractable normalising constants. Biometrika, 93(2):451–458, 2006.
[123] J. Møller, A.N. Pettitt, R. Reeves, and K.K. Berthelsen. An efficient Markov chain Monte Carlo method for
distributions with intractable normalising constants. Biometrika, 93(2):451–458, 2006.
[124] Valentina Moskvina and Karl Michael Schmidt. On multiple-testing correction in genome-wide association
studies. Genetic Epidemiology, 32(6):567–573, 2008.
[125] I. Mukhopadhyay, E. Feingold, D. E. Weeks, and A. Thalamuthu. Association tests using kernel-based measures
of multi-locus genotype similarity between individuals. Genetic Epidemiology, 34(3):213–221, April 2010.
[126] I. Mukhopadhyay, E. Feingold, D. E. Weeks, and A. Thalamuthu. Association tests using kernel-based measures
of multi-locus genotype similarity between individuals. GENET EPIDEMIO, 34(3):213–221, April 2010.
[127] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An
empirical study. In In Proceedings of Uncertainty in AI, pages 467–475, 1999.
[128] Iain Murray, Zoubin Ghahramani, and David J. C. MacKay. MCMC for doubly-intractable distributions. In UAI,
2006.
[129] E. Nadaraya. On estimating regression. Theory of Probability and Its Applications, 9(1):141–142, 1964.
[130] H Nassif, R Wood, E S Burnside, M Ayvaci, J Shavlik, and D Page. Information extraction for clinical data
mining: a mammography case study. In IEEE International Conference on Data Mining (ICDM’09) Workshops,
pages 37–42, Miami, Florida, 2009.
[131] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computa-
tional and Graphical Statistics, 9(2):249–265, 2000.
[132] Heidi D Nelson, Kari Tyne, Arpana Naik, Christina Bougatsos, Benjamin K Chan, and Linda Humphrey. Screen-
ing for breast cancer: an update for the us preventive services task force. Ann Intern Med, 151:727–737, 2009.
165
[133] Roland Nilsson, Jose M. Pena, Johan Bjorkegren, and Jesper Tegner. Consistent feature selection for pattern
recognition in polynomial time. Journal of Machine Learning Research, 8:589–612, 2007.
[134] Dale R. Nyholt. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage dise-
quilibrium with each other. American Journal of Human Genetics, 74(4):765–769, 2004.
[135] P. Orbanz and Y. W. Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning. Springer,
2010.
[136] Art B. Owen. Variance of the number of false discoveries. J ROY STAT SOC B, 67:411–426, 2005.
[137] Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. An infinite latent attribute model for network
data. In ICML, 2012.
[138] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 1988.
[139] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 27(8):1226–1238, 2005.
[140] B Percha, H Nassif, J Lipson, E Burnside, and D Rubin. Automatic classification of mammography reports by
BI-RADS breast tissue composition class. J. Am. Med. Inform. Assn., 19(5):913–916, 2012.
[141] Charles M Perou, Therese Sørlie, Michael B Eisen, Matt van de Rijn, Stefanie S Jeffrey, Christian A Rees,
Jonathan R Pollack, Douglas T Ross, Hilde Johnsen, Lars A Akslen, et al. Molecular portraits of human breast
tumours. Nature, 406(6797):747–752, 2000.
[142] Paul D. P. Pharoah, Antonis C. Antoniou, Douglas F. Easton, and Bruce A. J. Ponder. Polygenes, risk prediction,
and targeted prevention of breast cancer. New England Journal of Medicine, 358(26):2796–2803, 2008.
[143] James Gary Propp and David Bruce Wilson. Exact sampling with coupled Markov chains and applications to
statistical mechanics. Random structures and Algorithms, 9(1-2):223–252, 1996.
[144] Yuan Qi and Feng Yan. EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning.
In NIPS, 2011.
[145] Carl Edward Rasmussen. The infinite Gaussian mixture model. In NIPS, 2000.
[146] Carl Edward Rasmussen and Zoubin Ghahramani. Infinite mixtures of Gaussian process experts. In NIPS, 2001.
166
[147] Anat Reiner, Daniel Yekutieli, and Yoav Benjamini. Identifying differentially expressed genes using false dis-
covery rate controlling procedures. Bioinformatics, 19(3):368–375, 2003.
[148] Herbert Robbins. An empirical Bayes approach to statistics. In The 3rd Berkeley Symposium I, pages 157–163,
1956.
[149] K Roeder, SA Bacanu, V Sonpar, X Zhang, and B Devlin. Analysis of single-locus tests to detect gene/disease
associations. Genetic Epidemiology, 28(3):207–219, 2005.
[150] Joseph Romano, Azeem Shaikh, and Michael Wolf. Control of the false discovery rate under dependence using
the bootstrap and subsampling. TEST, 17:417–442, 2008.
[151] Lorenzo Rosasco, Matteo Santoro, Sofia Mosci, Alessandro Verri, and Silvia Villa. A regularization approach
to nonlinear variable selection. In AISTATS, 2010.
[152] Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. ANN MATH STAT,
27(3):832–837, 1956.
[153] Ruslan Salakhutdinov. Learning in Markov random fields using tempered transitions. In NIPS, pages 1598–
1606, 2009.
[154] Daria Salyakina, Shaun R Seaman, Brian L Browning, Frank Dudbridge, and Bertram Muller-Myhsok. Evalua-
tion of nyholts procedure for multiple testing correction. Human heredity, 60(1):19–25, 2005.
[155] Sanat K. Sarkar. False discovery and false nondiscovery rates in single-step multiple testing procedures. ANN
STAT, 34(1):394–415, 2006.
[156] Pal Satrom, Jacob Biesinger, Sierra M Li, David Smith, Laurent F Thomas, Karim Majzoub, Guillermo E
Rivas, Jessica Alluin, John J Rossi, Theodore G Krontiris, Jeffrey Weitzel, Mary B Daly, Al B Benson, John M
Kirkwood, Peter J ODwyer, Rebecca Sutphen, James A Stewart, David Johnson, and Garrett P Larson. A risk
variant in an mir-125b binding site in bmpr1b is associated with breast cancer pathogenesis. CANCER RES,
69(18):7459–7465, 2009.
[157] Daniel J. Schaid, Shannon K. McDonnell, Scott J. Hebbring, Julie M. Cunningham, and Stephen N. Thibodeau.
Nonparametric tests of association of multiple genes with human disease. American Journal of Human Genetics,
76(5):780–793, May 2005.
[158] John T Schousboe, Karla Kerlikowske, Andrew Loh, and Steven R Cummings. Personalizing mammography
by breast density and other risk factors for breast cancer: analysis of health benefits and cost-effectiveness. Ann
Intern Med, 155:10–20, 2011.
167
[159] Nicol N. Schraudolph. Polynomial-time exact inference in NP-hard binary MRFs via reweighted perfect match-
ing. In AISTATS, 2010.
[160] Nicol N. Schraudolph and Dmitry Kamenetsky. Efficient exact inference in planar Ising models. In NIPS, 2009.
[161] Babak Shahbaba and Radford Neal. Nonlinear models using Dirichlet process mixtures. Journal of Machine
Learning Research, 10:1829–1850, 2009.
[162] Afshan Siddiq, Fergus J Couch, Gary K Chen, Sara Lindstrom, Diana Eccles, Robert C Millikan, Kyriaki
Michailidou, Daniel O Stram, Lars Beckmann, Suhn Kyong Rhie, et al. A meta-analysis of genome-wide
association studies of breast cancer identifies two novel susceptibility loci at 6q14 and 20q11. Human molecular
genetics, 21(24):5373–5384, 2012.
[163] S. L. Slager and D. J. Schaid. Case-control studies of genetic markers: power and sample size approximations
for armitage’s test for trend. Human Heredity, 52(3):149–153, 2001.
[164] L. Song, B. Boots, S. Siddiqi, G. Gordon, and A. Smola. Hilbert space embeddings of hidden Markov models.
In ICML, 2010.
[165] Kristen N Stevens, Zachary Fredericksen, Celine M Vachon, Xianshu Wang, Sara Margolin, Annika Lindblom,
Heli Nevanlinna, Dario Greco, Kristiina Aittomaki, Carl Blomqvist, et al. 19p13. 1 is a triple-negative–specific
breast cancer susceptibility locus. Cancer research, 72(7):1795–1803, 2012.
[166] John D. Storey. A direct approach to false discovery rates. Journal of The Royal Statistical Society Series
B-Statistical Methodology, 64:479–498, 2002.
[167] John D. Storey. The positive false discovery rate: A bayesian interpretation and the q-value. Annals of Statistics,
31(6):2013–2035, 2003.
[168] John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative point estimation and
simultaneous conservative consistency of false discovery rates: a unified approach. JRSS-B, 66(1):187–205,
2004.
[169] Korbinian Strimmer. A unified approach to false discovery rate estimation. BMC bioinformatics, 9(1):303, 2008.
[170] Zhan Su, Jonathan Marchini, and Peter Donnelly. HAPGEN2: simulation of multiple disease SNPs. BIOIN-
FORMATICS, 2011.
[171] Wenguang Sun and T. Tony Cai. Oracle and adaptive compound decision rules for false discovery rate control.
J AM STAT ASSOC, 102(479):901–912, 2007.
168
[172] Wenguang Sun and T. Tony Cai. Large-scale multiple testing under dependence. Journal of The Royal Statistical
Society Series B-Statistical Methodology, 71:393–424, 2009.
[173] I. Sutskever and T. Tieleman. On the convergence properties of Contrastive Divergence. In AISTATS, 2010.
[174] Ulrika Svenson, Katarina Nordfjall, Birgitta Stegmayr, Jonas Manjer, Peter Nilsson, Bjorn Tavelin, Roger Hen-
riksson, Per Lenner, and Goran Roos. Breast cancer survival is associated with telomere length in peripheral
blood cells. CANCER RES, 68(10):3618–3623, 2008.
[175] The International HapMap Consortium. The international hapmap project. Nature, 426:789–796, 2003.
[176] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of The Royal Statistical Society
Series B-Statistical Methodology, 58(1):267–288, 1996.
[177] Robert Tibshirani and Michael Saunders. Sparsity and smoothness via the fused lasso. Journal of The Royal
Statistical Society Series B-Statistical Methodology, 67:91–108, 2005.
[178] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In
ICML, pages 1064–1071, 2008.
[179] Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In
ICML, pages 1033–1040, 2009.
[180] Clare Turnbull, Shahana Ahmed, Jonathan Morrison, David Pernet, Anthony Renwick, Mel Maranian, Sheila
Seal, Maya Ghoussaini, Sarah Hines, Catherine S Healey, et al. Genome-wide association study identifies five
new breast cancer susceptibility loci. Nature genetics, 42(6):504–507, 2010.
[181] Clare Turnbull, Sheila Seal, Anthony Renwick, Margaret Warren-Perry, Deborah Hughes, Anna Elliott, David
Pernet, Susan Peock, Julian W Adlard, Julian Barwell, et al. Gene–gene interactions in breast cancer suscepti-
bility. Human molecular genetics, 21(4):958–962, 2012.
[182] Lishanthi Udabage, Gary R. Brownlee, Susan K. Nilsson, and Tracey J. Brown. The over-expression of HAS2,
Hyal-2 and CD44 is implicated in the invasiveness of breast cancer. EXP CELL RES, 310(1):205 – 217, 2005.
[183] Mark J. van der Laan, Sandrine Dudoit, and Katherine S. Pollard. Augmentation procedures for control of
the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical
Applications in Genetics and Molecular Biology, 3, 2004.
[184] David Vickrey, C Lin, and Daphne Koller. Non-local contrastive objectives. In ICML, 2010.
169
[185] Matthieu Vignes and Florence Forbes. Gene clustering via integrated markov models combining individual and
pairwise features. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 6:260–270, April 2009.
[186] S. Wacholder, P. Hartge, R. Prentice, M. Garcia-Closas, H. S. Feigelson, W. R. Diver, M. J. Thun, D. G. Cox,
S. E. Hankinson, P. Kraft, B. Rosner, C. D. Berg, L. A. Brinton, J. Lissowska, M. E. Sherman, R. Chlebowski,
C. Kooperberg, R. D. Jackson, D. W. Buckman, P. Hui, R. Pfeiffer, K. B. Jacobs, G. D. Thomas, R. N. Hoover,
M. H. Gail, S. J. Chanock, and D. J. Hunter. Performance of common genetic variants in breast-cancer risk
models. N Engl J Med, 362(11):986–993, 2010.
[187] Sholom Wacholder, Stephen Chanock, Montserrat Garcia-Closas, Laure El ghormli, and Nathaniel Rothman.
Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. Journal
of the National Cancer Institute, 96(6):434–442, March 2004.
[188] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. A new class of upper bounds on the log partition
function. In UAI, pages 536–543, 2002.
[189] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-based reparameterization framework for
analysis of sum-product and related algorithms. IEEE Transactions on Information Theory, 49:2003, 2003.
[190] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-reweighted belief propagation algorithms
and approximate ML estimation via pseudo-moment matching. In AISTATS, 2003.
[191] Martin J. Wainwright and Michael I. Jordan. Log-determinant relaxation for approximate inference in discrete
Markov random fields. IEEE T SIGNAL PROCES, 54(6):2099–2109, 2006.
[192] Martin J Wainwright and Michael I Jordan. Graphical Models, Exponential Families, and Variational Inference.
Now Publishers Inc., Hanover, MA, USA, 2008.
[193] Helen Warren, Frank Dudbridge, Olivia Fletcher, Nick Orr, Nichola Johnson, John L Hopper, Carmel Apicella,
Melissa C Southey, Maryam Mahmoodi, Marjanka K Schmidt, et al. 9q31. 2-rs865686 as a susceptibility locus
for estrogen receptor-positive breast cancer: evidence from the breast cancer association consortium. Cancer
Epidemiology Biomarkers & Prevention, 21(10):1783–1791, 2012.
[194] Larry Wasserman and Kathryn Roeder. High-dimensional variable selection. Annals of Statistics, 37(5):2178–
2201, 2009.
[195] Geoffrey S. Watson. Smooth regression analysis. The Indian Journal of Statistics, Series A, 26(4):359–372,
1964.
170
[196] Zhi Wei, Kai Wang, Hui-Qi Qu, Haitao Zhang, Jonathan Bradfield, Cecilia Kim, Edward Frackleton, Cuiping
Hou, Joseph T. Glessner, Rosetta Chiavacci, Charles Stanley, Dimitri Monos, Struan F. A. Grant, Constantin
Polychronakos, and Hakon Hakonarson. From disease association to risk assessment: An optimistic view from
genome-wide association studies on type 1 diabetes. PLoS Genetics, 5:e1000678, 2009.
[197] Yair Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation,
12(1):1–41, January 2000.
[198] Max Welling and Charles Sutton. Learning in Markov random fields with contrastive free energies. In AISTATS,
2005.
[199] Jennifer Wessel and Nicholas J. Schork. Generalized genomic distance-based regression methodology for mul-
tilocus association analysis. American journal of human genetics, 79(5):792–806, November 2006.
[200] Fa-Yueh Wu. The potts model. Reviews of Modern Physics, 54:235–268, 1982.
[201] Michael C. Wu, Peter Kraft, Michael P. Epstein, Deanne M. Taylor, Stephen J. Chanock, David J. Hunter, and
Xihong Lin. Powerful snp-set analysis for case-control genome-wide association studies. American Journal of
Human Genetics, 86(6):929–942, June 2010.
[202] Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare variant association
testing for sequencing data using the sequence kernel association test (skat). American Journal of Human
Genetics, 2011.
[203] Tong Tong Wu, Yi Fang Chen, Trevor Hastie, Eric M. Sobel, and Kenneth Lange. Genome-wide association
analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721, 2009.
[204] Wei Biao Wu. On false discovery control under dependence. ANN STAT, 36(1):364–380, 2008.
[205] GuiCBo Ye, Yifei Chen, and Xiaohui Xie. Efficient variable selection in support vector machines via the alter-
nating direction method of multipliers. In AISTATS, 2011.
[206] Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Generalized belief propagation. In NIPS, pages
689–695. MIT Press, 2000.
[207] Daniel Yekutieli and Yoav Benjamini. Resampling-based false discovery rate controlling multiple test proce-
dures for correlated test statistics. J STAT PLAN INFER, 82:171–196, 1999.
[208] Laurent Younes. Estimation and annealing for Gibbsian fields. Annales de l’Institut Henri Poincare, Section B,
Calcul des Probabilities et Statistique, 24(2):269–294, 1988.
171
[209] Lei Yu and Huan Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine
Learning Research, 5:1205–1224, 2004.
[210] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of The
Royal Statistical Society Series B-Statistical Methodology, 68:49–67, 2006.
[211] Alan Yuille. The convergence of contrastive divergences. In NIPS, 2004.
[212] Chunming Zhang, Jianqing Fan, and Tao Yu. Multiple testing via FDRL for large-scale imaging data. ANN
STAT, 39(1):613–642, 2011.
[213] Hao Helen Zhang, Jeongyoun Ahn, and Xiaodong Lin. Gene selection using support vector machines with
nonconvex penalty. BIOINFORMATICS, 22(1):88–95, 2006.
[214] Yongyue Zhang, Michael Brady, and Stephen Smith. Segmentation of brain MR images through a hidden
Markov random field model and the expectation-maximization algorithm. IEEE Transactions on Medical Imag-
ing, 2001.
[215] Yang Zhou, Rong Jin, and Steven Hoi. Exclusive lasso for multi-task feature selection. In AISTATS, 2010.
[216] Jun Zhu, Ning Chen, and Eric P. Xing. Infinite latent SVM for classification and multi-task learning. In NIPS,
2011.
[217] Jun Zhu, Ning Chen, and Eric P. Xing. Infinite SVM: a Dirichlet process mixture of large-margin kernel ma-
chines. In ICML, 2011.
[218] Song Chun Zhu and Xiuwen Liu. Learning in gibbsian fields: How accurate and how fast can it be? IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24:1001–1006, 2002.
[219] Hui Zou and Trevor Hastie. Regularization and variable selection via the Elastic Net. Journal of The Royal
Statistical Society Series B-Statistical Methodology, 67:301–320, 2005.
[220] Hui Zou and Hao Helen Zhang. On the adaptive elastic-net with a diverging number of parameters. ANN STAT,
37:1733, 2009.
[221] Verena Zuber and Korbinian Strimmer. Gene ranking and biomarker discovery under correlation. Bioinformatics,
25(20):2700–2707, 2009.
[222] Verena Zuber and Korbinian Strimmer. High-dimensional regression and variable selection using CAR scores.
Statistical Applications in Genetics and Molecular Biology, 10(1), 2011.