A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical...

45
A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment of the requirements for the Ph.D. program in the Department of Statistics University of Wisconsin-Madison Committee Members: Professor Christina Kendziorski Professor Alan Attie Professor Michael Newton Professor Brian Yandell 1

Transcript of A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical...

Page 1: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

A Statistical Framework for Expression Trait Loci (ETL) Mapping

Meng Chen

Prelim Paper in partial fulfillment of the requirements

for the Ph.D. program

in the

Department of Statistics

University of Wisconsin-Madison

Committee Members:Professor Christina KendziorskiProfessor Alan AttieProfessor Michael NewtonProfessor Brian Yandell

1

Page 2: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Contents

1 Introduction 2

2 ETL mapping experiments 3

3 QTL Mapping Methods 5

3.1 Single Phenotype - Single QTL Models . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Single Phenotype - Multiple QTL Models . . . . . . . . . . . . . . . . . . . . . . 7

3.3 Multiple Phenotype - Single or Multiple QTL Models . . . . . . . . . . . . . . . . 8

4 ETL Mapping Methods 9

4.1 Transcript Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Transcript Based Approach with FDR control . . . . . . . . . . . . . . . . . . . . 9

4.3 Marker Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.4 Mixture Over Markers Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.5 Current Status of ETL Mapping Methods . . . . . . . . . . . . . . . . . . . . . . 11

5 Research Plan 13

5.1 ETL Interval Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Pseudomarker-MOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.2 Two-Stage Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1.3 Theoretical Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Multiple ETL mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Future Research Questions 20

References 23

Appendix 27

1

Page 3: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

1 Introduction

Identifying the genetic loci responsible for variation in quantitative traits is of great importance to

biologists. Although quantitative trait loci (QTL) mapping studies have been going on for over 80

years starting with Sax in 1923 (Sax 1923; Rasmusson 1933; Thoday 1961), where he proposed

that the association between seed weight and seed coat color in beans was due to the linkage

between the genes controlling weight and the genes controlling color, the vast majority of studies

have taken place in the last 20 years. The increased rate was due largely to two major advances in

the 1980s: the advent of restriction fragment length polymorphisms (RFLPs) (Botstein et al. 1980)

so that it’s possible to genotype markers on a large scale and the advent of statistical methods for

data analysis (Lander and Botstein 1989).

A recent advance of comparable significance has been made in the area of phenotyping.

With high throughput technologies now widely available, investigators can measure thousands

of phenotypes at once. Gene expression measurements are particularly amenable to QTL mapping

and much excitement abounds for this field of “genetical genomics” (Jansen and Nap 2001; Jansen

2003; Cox 2004; Broman 2005).

The so called expression QTL (eQTL) or expression trait loci (ETL) studies have been used to

identify candidate genes (Dumas et al. 2000; Eaves et al. 2002; Karp et al. 2000; Wayne et al.

2003; Schadt et al. 2003; Brstrykh et al. 2005; Hubner et al. 2005), to infer not only correlative

but also causal relationships among modulator and modulated genes (Brem et al. 2002; Schadt et

al. 2003; Yvert et al. 2003), to better define traditional phenotypes (Schadt et al. 2003), and to

serve as a bridge between genetic variation and the traditional complex traits of interest (Schadt et

al. 2003).

Although successful in many ways, the results obtained from ETL studies to date are limited.

In the early published studies, the ETL mapping problem had been addressed by treating each

transcript separately as a phenotype for QTL mapping. Single trait QTL analysis was then carried

out thousands of times (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003). Notably, although

adjustments were made for multiple tests across the genome, no adjustments had been considered

for multiple tests across transcripts. There are hundreds of test locations across the genome but

tens of thousands of transcripts leading to a potentially serious multiple testing problem and an

inflated false discovery rate (FDR). For some labs, an inflated FDR is tolerable as many genes can

be tested quickly for certain properties and discarded if found to be false positives. However, for

many labs, such tests are prohibitively expensive. Statistical methods that control error rates and

2

Page 4: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

that are more sensitive and more specific are needed. In a few recent studies, there has been some

effort in attempting to account for both sets of multiplicities (Chesler et al. 2005; Hubner et al.

2005; Bystrykh et al. 2005). Permutation tests were performed to derive the genome-wide LOD

score threshold and q-values were computed for the set of transcripts declared significant using the

corresponding genome-wide empirical p-values. As discussed in Section 4.2, this last approach

may not properly control FDR and may suffer from very low power.

The main aim of the proposed thesis is to develop a statistical framework for ETL mapping

that properly accounts for multiplicities while maintaining or improving upon the operating

characteristics of currently used approaches. Section 2 provides a brief background on the

questions addressed and data collected in ETL mapping experiments. Statistical methods for ETL

mapping are reviewed in Section 4. As discussed there, the mixture over markers (MOM) model is

the only statistically rigorous ETL mapping method developed to account for multiplicities across

both markers and transcripts. However, MOM has a number of shortcomings. For experiments

with sparse maps, the MOM model is lacking as information between markers is not available.

When dense maps are available, the MOM model by itself may not be applicable as the number of

mixture components is too big to fit. Finally, MOM does not allow for multiple ETL. Statistical

methods to address each of these shortcomings are detailed in Section 5 and preliminary results

are demonstrated on simulated data and data from a study of diabetes in mouse. Future directions

are discussed in Section 6.

2 ETL mapping experiments

The general data collected in an ETL mapping experiment consists of a genetic map, marker

genotypes, and microarray data (phenotypes) collected on a set of individuals. A genetic marker is

a region of the genome of known location. These locations make up the genetic map. The distance

between markers is given by genetic distance, in the unit of centimorgan (cM). It is defined as the

expected percentage of crossovers between two loci during meiosis. At each marker, genotypes

are obtained. ETL mapping studies take place in both human and experimental populations. We

focus on the latter. For these populations, the possibilities of marker genotypes are simplified.

For example, studies with experimental populations most often involve arranging a cross

between two inbred strains differing substantially in some trait of interest to produce F1 offspring.

Segregating progeny are then typically derived from a B1 backcross (F1 x Parent) or an F2

3

Page 5: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

intercross (F1 x F1). Repeated intercrossing (Fn x Fn) can also be done to generate so-called

recombinant inbred (RI) lines. For simplicity of notation, we focus on a backcross population.

Consider two inbred parental populations P1 and P2, genotyped as AA and aa, respectively, at M

markers. The offspring of the first generation (F1) have genotype Aa at each marker (allele A from

parent P1 and a from parent P2). In a backcross, the F1 offspring are crossed back to a parental

line, say P1, resulting in a population with genotypes AA or Aa at a given marker. We denote AA

by 0 and Aa by 1.

For each member of the backcross population, phenotypes are collected via microarrays.

Microarrays allow us to snapshot the expressions of thousands of genes at the same time. The

oligonucleotide and cDNA microarrays are the two types of technology that are most widely used.

A nice review of the microarray technologies can be found in Nguyen et al. (2002). We present a

very brief, by no means complete review here.

Affymetrix is one company that produces oligonucleotide chips which contain tens of

thousands of probe sets, or DNA sequences related to a gene. We will refer to these sequences

throughout this paper as “transcripts”. Each gene is represented by some number (usually 11-

20) of features. Many Affymetrix arrays use 20 features (Nguyen et al. 2002). Each feature is

a short sequence of oligonucleotides. Present in the features are pairs of perfect match (PM) and

mismatch (MM) sequences. The PM is a piece of gene, 25 nucleotides in length; the corresponding

MM is identical to PM except for the middle (i.e., 13th) position. After some pre-processing and

normalization, one summary score is derived for each probe set. There are a number of methods

for processing the probe set intensities and for normalization, such as DNA chip Analyzer by Li

and Wong (2001), and Robust Multi-array Analysis (RMA) by Irizarry et al. (2003).

In a cDNA array experiment, a gene is represented by a long cDNA fragment (500 to 1000

bases). The experimental sample of interest is often labeled with a red fluorescent dye, and a

reference sample is labeled green. The amount of cDNA hybridized to each probe can be captured

through some imaging device which measures the amount of the fluorescent intensity. Image files

are processed to give a summary expression score, log2(R/G). Yang et al. (2002) propose methods

for cDNA array data normalization and compare their methods with a number of other approaches.

With proper pre-processing and normalization, from either technology we obtain a single summary

score of expression for each transcript on each array.

4

Page 6: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

3 QTL Mapping Methods

As noted above, ETL studies are very similar to QTL studies, but with thousands of phenotypes.

Perhaps it is not surprising then that early ETL studies repeatedly applied methods developed for

QTL mapping to each transcript. The literature on QTL mapping methods is quite large. We here

review only those methods relevant to this proposal and refer the interested reader to Doerge et al.

(1997) or Lynch and Walsh (1998) for more information on QTL mapping methods.

3.1 Single Phenotype - Single QTL Models

Consider a backcross with n progeny with univariate phenotypes yj measured on all the individuals,

j = 1, . . . , n, together with genotypes for a set of M markers. Let mij = 0 or 1 according to

whether the individual j has genotype AA or Aa at the ith marker, i = 1, . . . , M , The simplest

method to test for trait-marker association is marker regression (MR), to test mean trait value

differences between different marker groups for a particular marker. Specifically, for a test at the

ith marker, the single QTL model is:

yj = µ + βi mij + εj (3.1)

where εj are independent and identically distributed (iid) as Normal(0, σ2) and one can test

H0 : βi = 0 vs. H1 : βi 6= 0

or equivalently H0 : µ0 = µ1 vs. µ0 6= µ1 where µ0 = µ and µ1 = µ + βi.

This is equivalent to an analysis of variance (ANOVA) at each marker (Soller et al. 1976)

when there are more than two marker genotype groups (a t-test for two genotype groups as in a

backcross). Usually, instead of F or t-statistics, geneticists prefer to report a LOD score, which

is defined as the (base 10) log-likelihood ratio comparing the two hypotheses. A LOD score is

calculated at each marker position, and marker loci giving significant LOD scores are identified

as putative QTLs. For these putative QTLs, we loosely say the phenotype is “linked” to them.

This approach is conceptually very simple, and clearly there are problems with it (Lander and

Botstein 1989). First, if the true QTL is not located exactly at the marker, its effect will likely be

underestimated because of recombination between the marker and the true QTL. Second, because

of the confounding effects, the power for QTL detection will decrease, especially when the markers

are widely spread, requiring more individuals for the test. Third, this approach considers one

5

Page 7: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

marker loci at a time, which is not very powerful comparing with multiple QTL models in the

presence of more than one QTL.

Lander and Botstein (1989) proposed interval mapping (IM) which addresses the first two

problems above. Their approach also assumes a single segregating QTL, but allows for tests

between markers where genotypes are not known. Specifically, for a backcross, the proposed

model to test for a QTL in the marker i and i + 1 interval is

yj = µ + β∗ m∗kj + ej (3.2)

where m∗k is the genotype at the test position between marker i and i + 1. It takes value 0 or 1 with

probability depending on the genotypes of the flanking markers and the test position (see Table 1).

β∗ is the effect of the putative QTL. Technically, (3.2) is a mixture of two normal distributions,

since p(yj) = p(m∗j = 0|m, r)p(yj|m

∗kj = 0) + p(m∗

j = 1|m, r)p(yj|m∗kj = 1) (here, r denotes

flanking marker distances). As in MR, tests are done at each location to test

H0 : β∗ = 0 vs. H1 : β∗ 6= 0

The test compares the hypothesis of a single QTL at the current locus to the null hypothesis of

no QTL. The two likelihood functions must be maximized over their respective parameters. The

procedure described above is repeated for each locus in the genome. In practice, test loci are set

up every 1cM or some other user-defined distance. The likelihood under the alternative varies with

the test locus, so the EM algorithm (Dempster et al. 1977) must be applied at each locus. A LOD

score profile can be constructed by plotting the LOD scores against the test positions.

The LOD score is then compared to a genome-wide threshold. Whenever the LOD profile

exceeds such threshold, we infer there exists a QTL. Generally, the genome-wide threshold is

obtained using the 95th percentile of the distribution of the maximum (genome-wide) LOD score,

under the null hypothesis of no segregating QTLs. Much effort has been expended to derive the

appropriate genome-wide LOD score cutoff value (Churchill and Doerge 1994; Dupuis etl al.

1995; Dupuis and Siegmund 1999; Feingold et al. 1993; Lander and Botstein 1989; Rebai et al.

1994; Rebai et al. 1995).

The major advantage of IM over MR at marker loci is that it gives more precise estimates of the

QTL locations and effects. However, the computational cost involved in IM is bigger, and in the

case of dense genetic markers and complete genotype data, the advantage tends to be very little.

More importantly, IM, like MR at marker loci, is still a single-QTL model.

6

Page 8: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

3.2 Single Phenotype - Multiple QTL Models

The principal reasons for modelling multiple QTls are to increase sensitivity and to achieve better

separation of linked QTLs. Also, epistasis (i.e., interactions between alleles at different QTLs) can

only be identified through multiple-QTL models. When a single QTL model is used when really

there are multiple QTLs, the genetic variation due to other segregating QTLs is incorporated into

the environmental variation. This reduces sensitivity.

A straight-forward extension of MR to model multiple QTLs is multiple regression, which

includes a number of different markers in the model, rather than looking at them one at a time. Let

mij = 0 or 1, according to whether individual j has genotype Aa or AA at marker i. The model

becomes

yj = µ +∑

i∈S

βi mij + ej (3.3)

where S is the set of markers for which βi is not 0. To implement this, one must find a way to

search through the model space to find such a set of S. As the number of markers gets larger,

it would be impossible to consider every possible model in the model space. In addition, there

remains a question of how to choose between candidate models; some form of criterion is needed.

Broman (2002) looked at this problem in a model selection framework.

Direct use of the multiple regression analysis is not easy. Moreover, Zeng (1993) showed

that the partial regression coefficient is generally a biased estimate of the relevant QTL effect.

An approach for multiple QTL mapping that combines ideas from IM and multiple regression is

composite IM (CIM) (Zeng 1993, 1994). The method attempts to reduce the multidimensional

search for QTLs to a series of one-dimensional searches. It conditions on markers outside the

region of interest while performing IM to control for the effects of QTLs in other intervals. That

way, there will be better power for QTL detection and also the QTL effects can be estimated more

accurately.

CIM can be described as follows. One chooses a subset of markers, S, to control background

genetic variation. As in IM, suppose one wants to test for QTL between marker i and i + 1. For a

test at the kth location, between markers i and i + 1, the statistical model can be written as

yj = µ + β m∗kj +

l 6=i,i+1

βl mlj + ej

where β is the effect of the putative QTL. A general guideline for practice would be to drop those

markers that are within 10cM of the test position (Broman 2002 calls this subset of markers S ∗).

7

Page 9: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Under this model, the contribution of each individual to the likelihood has the form of a mixture

of two normal distributions with means µ +∑

l∈S∗ βl mlj and µ + β +∑

l∈S∗ βl mlj with mixing

proportions equal to the conditional probabilities of the individual having QTL genotype 0 or 1,

given flanking marker genotypes and test position. Zeng (1994) used the ECM algorithm (Meng

and Rubin 1993) to obtain the maximum likelihood estimates.

As in IM, a LOD score is calculated at each test position, comparing the likelihood assuming

there is a QTL at the putative test locus, to the likelihood assuming that there is no QTL there.

The LOD score is then plotted as a function of test positions in the genome, and is compared to a

genome-wide threshold to declare significance.

Jansen independently developed a similar approach to handle multiple QTL combining IM and

MR, multiple-QTL mapping (MQM) (Jansen 1993; Jansen and Stam 1994). It fits single QTL

models with selected markers as cofactors in the regression to eliminate the effects of possible

QTLs in other intervals.

The major problem with these approaches is how to choose the set of markers to be included

in the model. Too many markers will give low power for QTL detection, and too few will cause

low accuracy. Zeng (1994) compared the performance of including all other markers to including

only unlinked markers through simulation. He then recommended some combinations of deleting

or inserting some linked markers in practice. Jansen (1993) and Jansen and Stam (1994) used

backward elimination with AIC (Akaike 1969), or a slight variant, to pick the subset of markers

in the model. Broman (2002) recommended the use of the BICδ criterion, with the value δ chosen

by the approximate correspondence between BICδ and a genome-wide threshold on the LOD

score. Kao et al. (1999) proposed multiple IM (MIM) which uses multiple marker intervals

simultaneously to construct multiple QTL in the model. In MIM, Kao et al. (1999) adopted

stepwise selection with Likelihood Ratio Test (LRT) as the selection criterion to identify QTLs.

3.3 Multiple Phenotype - Single or Multiple QTL Models

In many QTL mapping studies there is more than one trait being measured. Performing single

trait analysis repeatedly is not optimal clearly because it doesn’t take into account the correlation

structure among the traits. It has been shown that analyzing traits jointly will increase the power of

QTL detection (Jiang and Zeng 1995; Knott and Haley 2000). In these studies, joint distribution

of the multi-trait is imposed, which requires specification of the covariance structure of the traits

(for a review of multi-trait QTL mapping methods, see Lund et al. 2003 and references therein.)

8

Page 10: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Multi-trait methods are very attractive in that they try to capture the inner structure among traits,

because many traits are genetically or environmentally correlated. However, as the number of traits

gets bigger, so does the number of parameters that need to be estimated.

4 ETL Mapping Methods

To this date, most ETL mapping methods consider single transcripts at a time. Multi-trait mapping

using available methods has not been attempted. Most recognize that this would be impossible

with thousands of traits since estimation of a phenotype covariance matrix is not feasible. We

detail exact methods used below.

4.1 Transcript Based Approach

The earliest ETL mapping studies applied single phenotype-single QTL mapping methods to every

transcript (Brem et al. 2003; Schadt et al. 2003). We call this type of approach transcript

based (TB). In Brem et al. (2002), a Wilcoxon-Mann-Whitney rank sum test was applied to

every transcript and marker pair. Nominal p-values were reported and the number of linkages

expected by chanced was estimated by permuation tests (Churchill and Doerge 1994). In Schadt

et al. (2003), transcript specific LOD score profiles were obtained using standard QTL IM. A

common genome-wide LOD score threshold was chosen to account for the potential increase of

type I error induced by testing across multiple markers. Neither study accounted for multiple tests

across transcripts. We view this as a problem that needs serious attention because in ETL studies,

we usually deal with thousands of traits.

4.2 Transcript Based Approach with FDR control

Recently, investigators have made attempts to account for multiplicities across transcripts (Chesler

et al. 2005; Hubner et al. 2005). They first computed genome-wide empirical p-values of the

maximum LOD score for every transcript using permutation tests and then estimated the q-values

(Storey and Tibshirani 2003) accordingly. It has to be pointed out that even though effort to adjust

for multiple testings across the transcripts was presented, this is by no means a systematic way

to approach the multiple testing problem. Preliminary results from simulations such as those

described in Kendziorski et al. (2004) show that FDR is not properly controlled and power is

9

Page 11: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

relatively lower than other approaches. This is consistent with the results of Chesler et al. (2005),

where the q-value threshold was increased to 0.25 so that a reasonable number of transcripts could

be identified.

4.3 Marker Based Approach

Instead of conducting the analysis at every transcript, ETL mapping could be done by conducting

the analysis at every marker. We call these marker based (MB) approaches. They consist of

identifying differentially expressed (DE) transcripts across groups of animals where groups are

determined by the genotype at a given marker. Any DE methods could be used. Usually, the DE

evidence threshold can be chosen such that multiple testing performed across the transcripts can

be accounted for. However, the MB approach does not consider multiplicities across the genome.

As was shown in Kendziorski et al. (2004), both TB and MB approaches share similar flaws.

In TB, separate tests are conducted for each transcript. In MB, each marker is tested separately.

And for both, the evidence that a transcript maps to a marker is measured against the evidence

that it doesn’t map there. Since in reality a transcript can map to any of the marker locations, the

evidence that a transcript maps to a particular marker should be judged relative to the possibility

that it maps nowhere or to some other marker. This idea motivates what we call the Mixture Over

Markers (MOM) model (Kendziorski et al. 2004).

4.4 Mixture Over Markers Model

Let yt be the expression level for tth transcript, yt = {yt1, yt2, . . . , ytn}, where n is the number of

animals in the ETL study. The MOM model assumes a transcript t maps nowhere with probability

p0 and maps to marker m with probability pm, such that∑M

m=0 pm = 1, where M denotes the total

number of markers. The marginal distribution of yt is then given by

p0f0(yt) +

M∑

m=1

pmfm(yt) (4.4)

where fm is the predictive density of the data if transcript t maps to marker m; f0 is the

predictive density when the transcript maps to nowhere. Specifically, suppose transcript abundance

measurements ytj arise independently from some observation distribution fobs(·|µt,·, θ). The

dependence among the underlying means µt,· is captured by a distribution π(µ). With this setting,

f0(yt) =∫

(

∏nj=1 fobs(ytj|µ)

)

π(µ)dµ. For a transcript that maps to marker m say, the underlying

10

Page 12: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

expression means defined by the marker genotype groups are not equal (µt,0 6= µt,1), but they both

are assumed to come from π(µ). The governing distribution for yt is then:

fm(yt) = f0(y0t ) f0(y

1t )

where y0(1)t denotes the set of transcript t values for animals with genotype 0(1).

Model fit proceeds via the EM algorithm. Once the parameter estimates are obtained, posterior

probabilities of mapping nowhere or to any of the M locations can be calculated via Bayes rule.

For instance, the posterior probability that transcript t maps to location l, l = 0, . . . , M is given by

plfl(yt)

p0f0(yt) +∑M

m=1 pmfm(yt)(4.5)

With the MOM approach, a transcript is identified to be DE if the posterior probability of

DE exceeds some threshold, where the threshold is chosen to control the expected posterior false

discovery rate (Newton et al. 2004). In order to make a transcript specific call, the highest posterior

density (HPD) region can be constructed in a straightforward fashion. A 1 − α HPD region is

obtained by including those marker locations until the sum of the posterior probability exceeds

1 − α.

4.5 Current Status of ETL Mapping Methods

Here we present some of our early simulation results comparing different approaches. We

considered a single ETL simulation with two chromosomes. Marker data was obtained from chr 2

and chr 3 from the F2 data from Dr. Alan Attie’s lab on campus. Chromosome 2 has 17 markers

and there are 6 markers on chromosome 3. A single ETL was simulated at marker 5 of chromosome

2. We generated 20 data sets for each of the seven values of ν0, which is a tuning parameter to

control the variance pattern in the simulated data (for details, see Kendziorski et al. 2004) so that

operating characteristics could be evaludated without biasing towards one method.

We consider applying MR for every transcript as transcript based MR (TB-MR). For TB-MR,

the genome wide type I error rate per transcript is controlled at 5% (Dupuis and Siegmund 1999).

We also have four marker based approaches. The first is EBarrays with LogNormal-Normal (LNN)

model where posterior probabilities of DE are computed for all the transcripts at every marker

(MB-EB). See Kendziorski et al. (2003) for details of the LNN model. The second one is a t-test

of equal means for every transcript, followed by calculations of q-values (Storey and Tibshirani

11

Page 13: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

2003) at every marker (MB-Q). The third and fourth methods calculate moderated t-statistics using

SAM (Tusher et al. 2003) and LIMMA (Smyth 2004), followed by q-value calculation at every

marker. We refer to these as MB-SAM and MB-LIMMA. For MB-EB, MB-Q, MB-SAM, and

MB-LIMMA, the false discovery rate per marker is controlled at 5%.

A fifth method considered attempts to test all the transcripts and all the markers simultaneously.

P-values from t-tests for every transcript and marker pair were used to calculate q-values (Storey

and Tibshirani 2003) at once. Note that by doing so, we assume that a certain dependence structure

among tests is satisfied (Storey 2003), which is likely to be not true here. We, nevertheless, include

this method (Q-ALL) in our simulation, because we’d like to see the performance of this kind of

ad-hoc procedure. In addition, Storey and Tibshirani (2003) use the ETL mapping data of Brem

et al. (2002) as motivation for considering calculation of all q-values simultaneously. For Q-ALL,

FDR control is targeted at 5%.

Power is defined as the probability of calling marker 4, 5 or 6 (flanking region of the true

ETL) on chromosome 2 for mapping transcripts. FDR is the proportion of transcripts identified

incorrectly as mapping to chromosome 2 or 3, i.e., they were EE or they were DE but mapped

outside the flanking region of the true ETL. Figure 1 shows the average power and FDR (over the

20 simulated data sets) against each ν0 value, together with 95% point-wise confidence interval.

As can be seen, the power is around 80% and 90% for all of the methods; however, for all methods

except MOM, the FDR is well above 5% for most values of ν0. These simple and somewhat ad-hoc

approaches fail to control the FDR at their claimed level, because they couldn’t adjust for multiple

tests across the markers and the transcripts simultaneously.

Although MOM does adjust appropriately for these multiplicities, the approach has a number

of shortcomings. For experiments with sparse maps, the MOM model is lacking as information

between markers is not available. Also, MOM does not allow for multiple ETL. Some dense

maps are currently available or under development, particularly those that use single nucleotide

polymorphisms (SNPs) (see The SNP Consortium). Because of its proximity, SNPs may be shared

among groups of people with harmful but unknown mutations and serve as markers for them.

Such markers can help to reveal the mutations and expedite therapeutic drug discovery. When

dense maps are available, the MOM model as it is may not be applicable as the number of mixture

components to fit will be huge.

12

Page 14: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

5 Research Plan

The main objective of the proposed thesis is to develop a powerful statistical framework within

which ETL can be localized. The framework relies on the MOM model, in that it simultaneously

controls for tests done across the genome locations and across all the transcripts. However, the

proposed methods, detailed below, significantly increase the utility and applicability of MOM by

addressing the first two shortcomings listed above. The effort of developing a method to handle

dense maps is ongoing.

5.1 ETL Interval Mapping

The genomic regions identified using MOM are limited by their size, which may be large as

analysis is conducted at genotyped markers only. When dense maps are not available (Attie lab

data, Schadt et al. 2003), this limitation can be a serious one. The biological techniques currently

available to search for genes in large genomic regions (e.g. candidate gene approach, congenic

lines) can take years and, as a result, additional statistical methods capable of narrowing down

regions are necessary.

We here propose a method for IM of ETL. Consider for simplicity of notation a backcross

population genotyped as 0 or 1 at M markers. The observed phenotype data y is a T × n matrix

of transcript abundance levels for transcript t = 1, . . . , T and individuals j = 1, . . . , n; m is an

M × n matrix of marker genotypes for markers i = 1, . . . , M and individuals j = 1, . . . , n.

Consider a set of L locations spanning the entire genome, we model the expression data for

transcript t as a L + 1 component mixture. To be specific, we imagine that the transcript may map

to nowhere with probability p0, and to any of the L locations with probability pl, l = 1, 2, . . . , L.

The p’s are mixing proportions. As noted in Section 3.1, transcript t is said to be linked (mapped)

to location l if µ0t,l 6= µ1

t,l, where µ0(1)t,l denotes the latent mean level of expression for transcript

t for the population of individuals with genotype 0(1) at location l. Let z lt be an indicator of

whether transcript t maps to location l. If l is at the markers, then we can decompose the predictive

density under the alternative hypothesis such that fl(yt) = f0(y0t ) f0(y

1t ), where the grouping is

determined by genotypes at that marker. However, when l is between markers, the decomposition

13

Page 15: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

is no longer valid. Instead, we have

fl(yt) =

∫ ∫ ∫

j∈Gl0

fobs(ytj|µt,0)∏

j∈Gl1

fobs(ytj|µt,1) π(µt,0) π(µt,1) p(gl|m) dµt,0 dµt,1 dgl

=

fl(yt|gl) p(gl|m) dgl

where gl = (gl1, g

l2, . . . , g

ln) denotes the unknown genotype vector at location l; G l

0(1) denotes the

set of population having genotype 0(1) at location l. Under the null hypothesis, the predictive

density of the data, f0(yt) can be calculated as before since it doesn’t rely on genotype groupings.

Parameter estimates for mixing proportions and hyper parameters can be obtained via the EM

algorithm (see Appendix A). We show that the posterior probability for transcript t to be mapped

to location l, after integrating out the µ’s, is given by

p(zlt = 1|y, m) =

p(zlt = 1)

fl(yt|gl)p(gl|m)dgl

p(yt|m)(5.6)

where p(zlt = 1) is the prior probability that transcript t maps to location l.

At a particular location l, the conditional distribution of genotype given the expression and

marker data p(gl|m) is assumed to only depend on the two markers flanking l. Notice that g l is a

vector of length n. Theoretically, there are 2n possible genotype vectors. So the integral in (5.6)

is a huge mixture. In practice, one can restrict to consider a smaller number (Table 1, Zeng 1994,

reproduced here as Table 2) since a lot of them have small probabilities. However, as the number

of individuals in the study is large, this 2n problem quickly becomes computationally infeasible.

5.1.1 Pseudomarker-MOM

To get around with the 2n problem, here we propose a general framework of ETL IM using

importance sampling and pseudomarker generation. The idea of pseudomarkers was introduced

by Sen and Churchill (2001) (see Appendix B). Let us introduce some slightly more general

notation. Recall that a transcript t is linked to location l if µ0t,l 6= µ1

t,l, where µ0(1)t,l denotes the

latent mean level of expression for transcript t for the populations of individuals with genotype

0(1) at location l. When this is the case, we say that transcript t is in expression pattern P1

at location l, denoted as P1lt; similarly P0l

t denotes the null pattern of expression at location

l (µ0t,l = µ1

t,l). It is useful to introduce specific patterns in our framework, since when an F2

population is considered, for example, there are numerous patterns. Each defines a way in which

the latent means can be different across genotype groups at a location. All of the non-null patterns

14

Page 16: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

imply linkage to that location). Two T × L matrices, θ0 and θ1, contain the latent mean levels

of expression (θ = (θ0, θ1)); L denotes the total number of locations considered. Let z lt = 1 if

transcript t is in expression pattern P1 at location l and 0 otherwise. Then z is a T × L indicator

matrix specifying QTL locations for each transcript.

Let us first consider a simple case where a transcript is associated with at most one genomic

location l and consider inference at location l. This assumption simplifies algebraic development

and will be relaxed later. At location l, of primary interest is the posterior probability that z lt∗ = 1

for transcript t∗. Reexpressing (5.6) in terms of the patterns, we have

p(Pklt∗|y, m) ∝ pl

t∗,Pk

fPk

(

yt∗|gl)

p(gl|m)dgl (5.7)

where plt∗,Pk denotes the prior probability that transcript t∗ is in pattern k at location l and fPk

describes the predictive density of the data. k = 0, 1 for backcross.

A normalizing constant is not required if further calculations of operational characteristics such

as FDR is not of interest, which may be the case if a simple ranking of the genes is desired. For the

model we propose here, we are interested in the calculation of estimated FDR. Therefore, we must

specify the normalizing constant p(yt∗|m). Expanding the probability in terms of different possible

mapping locations, we have p(yt∗|m) = p(yt∗|m, z·t∗ = 0)p(z·t∗ = 0) +∑L

l′=1 p(yt∗|m, zl′

t∗ =

1)p(zl′

t∗ = 1), where p(z·t∗) +∑L

l′=1 p(zl′

t∗ = 1) = 1 and p(z·t∗ = 0) implies that the transcript does

not map to any of the L locations, i.e., the transcript is in pattern P0 at every location. Therefore,

the exact form of (5.7) is given by

p(Pklt∗|y, m) =

plt∗,Pk

fPk

(

yt∗|gl)

p(gl|m) dgl

∫ (

pt∗,P0fP0 (yt∗) +∑L

l′=1 pl′

t∗,P1fP1 (yt∗|gl′))

p(gl′|m) dgl′(5.8)

The pt∗,P0 and plt∗,P1’s are unknown. We estimate them from the data, with the average of posterior

probability that each transcript belongs to a particular pattern. Due to the 2n problem, calculation of

p(gl|y, m), the posterior distribution of the unobserved genotype at location l given the expression

and marker data, becomes computationally prohibitive. Here we use importance sampling by first

simulating multiple versions of pseudomarkers from p(g l|m), and then replace the exact integral

with its Monte Carlo approximation.

In simulating the pseudomarkers, one can use a simple Markov Chain structure where the

putative QTL genotype at a given location only depends on two flanking markers. However, this

might not work well if there is genotyping error or noninformative markers in the marker data.

A hidden Markov Model (HMM) can be considered where the “true” marker genotypes follow a

15

Page 17: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Markov Chain, and the observed marker genotypes are characterized by distributions conditional

on the underlying state process. Using an HMM, one can account for genotyping error and missing

marker data in a coherent way. R/qtl (Broman et al. 2003) has this option as well. Figure 11 gives

an example. The upper stripe gives the marker data for one animal. Blue shows AA, and yellow

for Aa. There are some missing marker data, represented by light blue. The lower panel has 20

realizations of sampling from the HMM model trained by the marker data.

Suppose for each location l′, Q genotype vectors are sampled from the proposal distribution

p(g|m), where g = {g1, g2, . . . , gL} to give (gl′

1 , gl′

2 , . . . , gl′

Q) for l = 1, . . . , L. Then equation (5.8)

can be approximated by Monte Carlo integration using importance samples (see Appendix C.)

p(Pklt∗|y, m) ≈

plt∗,Pk

∑Q

q=1 fPk

(

yt∗|glq

)

pt∗,P0

∑Q

q=1 fP0 (yt∗) +∑L

l′=1 pl′

t∗,P1

∑Q

q=1 fP1

(

yt∗|gl′q

) (5.9)

This approach is an extension of the MOM model evaluated by averaging over the pseudomarkers.

We call it pseudomarker-MOM.

5.1.2 Two-Stage Approach

Pseudomarker-MOM scans the genome at some small distance step in order to find potential

locations to which transcripts are mapped. But applying it over the entire genome, even for each

chromosome separately, can be computationally prohibitive. We propose a two-stage approach

where we first apply MOM at markers to identify interesting regions, then follow up by applying

pseudomarker-MOM to those regions to better localize ETLs.

MOM calculates for each transcript, the posterior probability that it maps to a particular marker,

or it doesn’t map at all. To identify potential ETL regions, we average the linkage evidence across

all the transcripts, to give a marker specific linkage score. We can choose those hot-spot markers

by thresholding the marker specific linkage evidence, using an HPD region.

5.1.3 Theoretical Result

We here justify the fact that picking those regions with the highest average linkage evidence is

indeed the correct thing to do under simplified conditions.

Theorem 1: If we assume transcripts map to at most one genomic location and the prior

probability of mapping to a particular marker is known and equal for all the markers, and

hyperparameter values are known, then the expected posterior probability of a transcript mapping

16

Page 18: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

to a particular marker is a non-increasing function of the recombination frequency between that

marker and the ETL.

Proof: See Appendix D.

As a result of Theorem 1, we can be sure that under these conditions, the interesting marker

regions picked by using MOM will be those regions that are the closest to the ETL. Once these

regions are defined, we set up some equally-spaced pseudomarker grid and use pseudomarker-

MOM to help localize the ETL with greater accuracy. We show by simulation studies in the next

section that this approach works quite well, even under more general conditions.

5.1.4 Simulation Results

A simulation was set up where there are 5000 transcripts and 100 individuals. The proportion

of differential expression (DE) was 10%. The hypothetical marker map is composed of one

chromosome and is equally spaced with 10cM in between. There are 10 markers in total (i.e,

from 0cM to 90cM). Intensity values are simulated as described in Section 4.4. Two simulations

were considered: one with a single ETL at 35cM and one with two ETLs: one at 35cM and one at

75cM.

The two-stage approach was applied in the simulation. Specifically, two hotspot marker regions

were selected from the average posterior probability profile at every marker, obtained by MOM.

Pseudomarker-MOM was then applied across a 2cM grid within the hot-spot regions, with 50

pseudomarker realizations (Q = 50). For comparison, we applied traditional QTL IM transcript

by transcript on the same data set and obtained genome-wide cutoffs on LOD scores based on an

approximation formula from Rebai et al. (1994). Here we show results from the two simulations.

Figure 2 (left panel) shows the average posterior probability profile, averaged across the

mapping transcripts. The ETL region is identified both by MOM and pseudomarker-MOM.

However, MOM picks up a wide peak between marker 4 and marker 5, whereas pseudomarker-

MOM identifies the ETL location with much better accuracy. To ensure that non-mapping

transcripts were not falsely identified, we considered the posterior probability profiles averaged

across non-mapping transcripts. They show little structure, as expected (figure not shown).

Figure 2 (right panel) shows the 96.8% HPD regions for true mapping transcripts (96.8% is

used to compare with IM results below). As shown, the ETL is identified correctly for most of

the mapping transcripts. Just as shown in left panel, pseudomarker-MOM provides good ETL

localization.

17

Page 19: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

In comparison, we also considered IM on the same simulated data set. This was implemented

using R/qtl (Broman et al. 2003). Figure 3 (left panel) is the average LOD score, averaged across

mapping transcripts. The region containing the ETL has the highest average LOD, but the average

LOD scores are overall very high for mapping transcripts, and it’s not clear what cutoff one should

use in order to correctly identify the ETL region. For example, if we use 5, then it chooses almost

the entire simulated chromosome.

To compare with Figure 2, a “confidence interval” was constructed around the ETL using a

1-LOD drop interval around peak LOD score (Mangin et al. 1994). This procedure is designed

to approximate a 96.8% confidence interval, but in general, these intervals can be biased in that

they are too small and bootstrap procedure has been recommended (Visscher et al. 1996). In the

ETL setting, obtaining bootstrap samples for thousands of transcripts doesn’t seem feasible. On

the other hand, confidence intervals that are slightly too small favors IM here as ETL appear to be

better localized. It is not always clear which peaks to construct confidence intervals around. To

give IM the best results, we consider a 10 cM window around the true ETL (35cM) and define the

LOD peak as the highest LOD within the window. The 1-LOD drop interval is then constructed.

Of course, in practice, one does not have the luxury of knowing where to choose these peaks and

perhaps only the largest peak would be identified. For these reasons, this method of identifying

ETL regions favors IM. Even so, in comparison to pseudomarker MOM, IM does not provide as

precise estimates of ETL location.

Similar results were obtained for the 2-ETL case. As shown in Figure 4 (left panel), for a

particular simulation, the two ETL regions are identified both by MOM and pseudomarker-MOM.

However, spurious peaks show up in different places from different simulations, but in general

have low average posterior probability compared with the main peaks (e.g., in the bottom row). As

shown, pseudomarker-MOM increases localization of the ETL compared with MOM. In Figure

4 (right panel), the distinct ETLs are identified for most of the mapping transcripts. We see that

pseudomarker-MOM provides good ETL localization.

We again considered IM on the same two simulated data sets. Figure 5 (left panel) shows the

average LOD score profile, averaged across mapping transcripts. The regions containing the ETLs

has the highest average LOD, but distinct ETLs are not as clear as in Figure 3. As before, the 96.8

% “confidence intervals” were around each ETL using a 1-LOD drop interval around peak LOD

scores, with peak chosen from the two 10cM windows around the true ETLs (35cM and 75cM).

IM suffers from the same problem as before in that has relatively more false positive calls.

18

Page 20: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

5.2 Multiple ETL mapping

The MOM approach can be extended to account for multiple ETL. For example, if transcript t

is possibly affected by two genotype locations l1 and l2, then four latent means are of interest:

µ0,0t,(l1,l2)

, µ0,1t,(l1,l2)

, µ1,0t,(l1,l2) and µ1,0

t,(l1,l2), where µj,k

t,(l1,l2)denotes the latent mean level of expression

for transcript t for the populations of individuals with genotype (j, k) at locations l1 and l2. These

latent means can be arranged in 15 possible expression patterns, all of which may be of interest.

For simplicity, we consider:

P0: µ0,0t,(l1,l2)

= µ1,0t,(l1,l2)

= µ0,1t,(l1,l2)

= µ1,1t,(l1,l2)

P1: µ0,0t,(l1,l2)

= µ0,1t,(l1,l2)

6= µ1,0t,(l1,l2)

= µ1,1t,(l1,l2)

P2: µ0,0t,(l1,l2)

= µ1,0t,(l1,l2)

6= µ0,1t,(l1,l2)

= µ1,1t,(l1,l2)

P3: µ0,0t,(l1,l2)

6= µ1,0t,(l1,l2)

= µ0,1t,(l1,l2)

6= µ1,1t,(l1,l2)

P4: µ0,0t,(l1,l2)

6= µ1,0t,(l1,l2)

= µ0,1t,(l1,l2)

= µ1,1t,(l1,l2)

P5: µ0,0t,(l1,l2)

6= µ1,0t,(l1,l2)

6= µ0,1t,(l1,l2)

6= µ1,1t,(l1,l2)

Pattern P0 allows for the possibility that a transcript maps to neither location. The latent means of

a transcript mapping only to location l1 would satisfy pattern P1 since only allelic differences at l1

affect the mean level of expression. Similarly, the means of transcripts mapping only to location l2

satisfy P2. Patterns P3-P5 describe possible ways in which the allelic variation at both locations

can act and interact to affect expression level. P3 describes a scenario in which the alleles at each

location have equal, but not dominant, effects. A dominant model would be described by P4 and

an additive model by P5.

The multiple ETL MOM model (M-MOM) has 5(

M

2

)

+ 1 patterns. As before, of primary

interest is the posterior probability of particular expression patterns. They can be calculated

similarly as in (4.5) for any pattern of interest, where k = 0, 1, . . . , 5.

5.2.1 Simulation Results

We apply M-MOM to the simulated 2-ETL data described in Section 5.1.4. We perform a

two dimensional scan, by looking at every possible marker pair and all the expression patterns

simultaneously. In the simulation, the two ETL effect sizes are generated to be the same, thus

corresponding to pattern 3. If we look at the average posterior probabilities for all non-null patterns,

most of them are negligible as expected, except for pattern 3 (figure not shown).

Looking closely at pattern 3, we plot the average posterior probabilities for every marker pair

in Figure 6. The first ETL location is on the Y axis and the second ETL location on the X axis. As

19

Page 21: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

shown, the plot locates the two ETLs at 30cM and 80cM, which is very close to the truth and the

best accuracy that M-MOM can achieve, since the true ETLs are not at markers.

For comparison purposes, we implement a 2-D marker regression scan on the same data set

(see Figure 7.) On the upper triangle, we plot the average LOD score for epistasis, which has very

low probability, as expected. The diagonal is the average LOD sore from 1-D IM scan. The lower

triangle gives the average joint LOD scores. Also shown in the plot are the contour lines over

the range of the LOD scores obtained. The region between 30cM, 40cM and 70cM to 90cM has

relatively high average probabilities compared to the others. In order to assess the significance,

we randomly sample 10 transcripts corresponding to every 10th percentile of the log of expression

means, and perform permutation tests of size 1000 on each of them. The average 95th percentile of

the LOD scores from each of the 1000 permutation tests is about 3.2. Using this as our 2-D LOD

score cutoff, we see from the contour lines that it gives a much wider region than the actual ETL.

In line with the previous comparisons, using traditional QTL mapping techniques on the ETL data

seems to yield more false positives.

6 Future Research Questions

The proposed framework for ETL mapping enables IM of single ETL and the identification of

multiple ETL at markers. Preliminary results from simulated data sets are encouraging. There are

a lot of questions that need to be explored further.

1. Our methods are extensions of the MOM model and a number of questions of

implementation remain open. One question that is important to our extension is:

How sparse is sparse? In other words, how many markers can MOM handle relative

to the number of transcripts? Generally, one might also be concerned about whether

MOM could fit mixtures with so many components. We did a series of simulations,

trying to shed some light on this. Consider a backcross, and suppose the genome has 400

equally spaced markers with 5cM in between. There are 25% of DE transcripts, mapping

to 80% of the markers with equal probabilities. We varied the number of transcripts

to be 100, 1000, 5000 and 40000, representing very small, small, moderate and large

number of transcripts in a realistic experiment, with animal number being 10, 60 and 100.

Here 10 might be a little unrealistic, but we were curious about how the results will look like.

20

Page 22: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

We plot the posterior probabilities for all the transcripts of a particular mixture component

in the MOM model in Figures 8, 9 and 10. The number of transcripts mapping to that

mixture component is 1, 1, 4 and 22 corresponding to the four values of total transcripts

number. The true DE transcripts are colored in red. Surprisingly, it seems that MOM can

fit a mixture model with the number of components bigger than the number of transcripts

pretty well (first panel in Figure 9 and Figure 10), as long as there are enough animals in the

experiment. Our impression is that when the number of animals is small, there tends to be

very few recombinations. The degree of distincitiveness between the MOM components is

not very high, resulting in relatively low power (see Figure 8.)

When the number of transcripts is small, there are also very few DE transcripts mapped

to every marker. We might expect the posterior probabilities of those transcripts not to be

very accurate. But as seen from Figure 9, this does not seem to be the case. There are

60 animals, but even when we only have 100 transcripts, MOM still detects the one DE

transcript with posterior probabilitiy being 1.0. When there are 40000 transcripts, among the

top 22 transcripts with the highest posterior probabilities, 20 of them are true DE (22 DE in

total). The posterior probabilities range from 0.714 to 1.0. In Figure 10, when there are 100

animals and 40000 transcripts, 21 out of the top 22 transcripts are DE, and their posterior

probabilities range from 0.76 to 1.0. Much more work is required to investigate these results

and define precise conditions under which parameters are well estimated.

2. Importance sampling is applied in pseudomarker-MOM. Importance samples are taken from

the proposal distribution p(gl|m) to approximate what we really desire p(gl|y, m). We

will investigate ways to choose Q so that we balance between computational burden and

accuracy. Perhaps a lower bound on Q can be obtained.

3. The proposed two-stage approach relies on the ability of MOM to identify the correct

“interesting” regions for follow-up. We have shown theoretically that under simplified

conditions, the highest posterior probability region obtained using MOM is that region

closest to ETL. The simplified conditions need to be relaxed. In particular, we will

investigate the case of multiple ETL and varying prior probabilities.

4. We have developed a method for mapping multiple-ETL. Preliminary results from

simulations investigating a 2 ETL system are encouraging. We would like to extend this

21

Page 23: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

approach so that interval mapping can be done. All the questions we want to answer for

pseudomarker-MOM in the single ETL case will be applicable here, with greater complexity.

5. SNP maps are coming. When dense maps are available, fitting all the marker components

in the MOM model would be impossible. Time permitting, some model selection tool to

proceed with MOM followed by pseudomarker MOM will be considered.

22

Page 24: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

References

1. Brem, R.B., G. Yvert, R. Clinton, and L. Kruglyak. (2002). Genetic Dissection of

Transcriptional Regulation in Budding Yeast. Science 296, 752-755.

2. Broman, K.W., and T.P. Speed. (2002). A model selection approach for the identification

of quantitative trait loci in experimental crosses (with discussion). Journal of the Royal

Statistical Society Series B 64, 641-656 and 737-775 (discussion).

3. Broman, K.W., H. Wu, S. Sen, and G.A. Churchill. (2003). R/qtl: QTL mapping in

experimental crosses. Bioinformatics 19 (7), pages 889-890.

4. Carlin, B., and T. Louis. (1998). Bayes and Empirical Bayes Methods for Data Analysis.

Chapman & Hall, New York, New York.

5. Churchill, G.A. and R.W. Doerge. (1994). Empirical threshold values for quantitative trait

mapping. Genetics, 138, 963-971.

6. Cox, N.J. (2004). An expression of interest. Nature 430, 733-734.

7. Doerge, R. W., Zeng, Z.-B. and Weir, B. S. (1997). Statistical issues in the search for genes

affecting quantitative traits in experimental populations. Statis. Sci., 12, 195-219.

8. Dupuis, J., P.O. Brown and D. Siegmund. (1995). Statistical methods for linkage analysis of

complex traits from high resolution maps of identity by descent. Genetics, 140: 843-856.

9. Dupuis, J., and D. Siegmund. (1999). Statistical Methods for Mapping Quantitative Trait

Loci From a Dense Set of Markers. Genetics 151, 373-386.

10. Feingold, E., P.O. Brown and D. Siegmund. (1993). Gaussian models for genetic linkage

analysis using complete high resolution maps of identity-by-descent. Am. J. Jum. Genet.

53: 234-251.

11. Haley, C., and S. Knott. (1992). A simple regression method for mapping quantitative trait

loci in line crosses using flanking markers. Heredity 69, 315-324.

12. Hubner, N., Wallace, C.A., Zimdahl, H., Petretto, E., Schulz, H., Maciver F., Mueller M.,

Hummel, O., Monti, J., Zidek, V., Musilova, A., Kren, V., Causton, H., Game, L., Born, G.,

23

Page 25: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Schmidt, S., Muller, A., Cook, S.A., Kurtz, T.W., Whittaker, J., Pravenec, M., and Aitman,

T.J. (2005). Integrated transcriptional profiling and linkage analysis for identification of

genes underlying disease. Nature Genetics, Vol 37, Number 3, 243-253.

13. Irizarry, R.A., B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf,

and T.P. Speed. (2003). Exploration, Normalization, and Summaries of High Density

Oligonucleotide Array Probe Level Data. Biostatistics 4(2), 249-264.

14. Jansen, R. (1993). A general mixture model for mapping quantitative trait loci by using

molecular markers. Theoretical and Applied Genetics 85, 252-260.

15. Jansen, R., and P. Stam. (1994). High resolution of quantitative traits into multiple loci via

interval mapping. Genetics 136, 1447-1455.

16. Jansen, R., and J.P. Nap (2001). Genetical genomics: the added value from segregation.

Trends in Genetics 17 388-391.

17. Jansen, R. C. (2003). Studying complex biological systems using multifactorial perturbation.

Nature Rev. Genet. 4: 145-151.

18. Jiang, C., and Z-B. Zeng. (1995). Multiple trait analysis of genetic mapping of quantitative

trait loci. Genetics 140: 1111-1127.

19. Kendziorski, C.M., Chen, M., Yuan, M., Lan, H., and A.D. Attie. (2004). Statistical

Methods for Expression Trait Loci (ETL) Mapping. Department of Biostatistics and Medical

Informatics Technical Report #184, submitted.

20. Kendziorski, C.M., Newton, M.A., Lan, H., and M.N. Gould. (2003). On parametric

empirical Bayes methods for comparing multiple groups using replicated gene expression

profiles. Statistics in Medicine, 22, 3899-3914.

21. Knott, S.A., and C.S. Haley. (2000). Multitrait least squares for quantitative trait loci

detection. Genetics 156: 899-911.

22. Lan, H., Rabaglia, M.E., Stoehr, J.P., Nadler, S.T., Schueler, K.L., Zou, F., Yandell, B.S.,

and A.D. Attie. (2003a). Gene expression profiles of non-diabetic and diabetic obese mice

suggest a role of hepatic lipogenic capacity in diabetes susceptibility. Diabetes, 52(3), 688-

700.

24

Page 26: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

23. Lan, H., Stoehr, J.P., Nadler, S.T., Schueler, K.L., Yandell, B.S., and A.D. Attie. (2003b).

Dimension reduction for mapping mRNA abundance as quantitative traits. Genetics 164,

1607-1614.

24. Lander, E. S., and D. Botstein. (1989). Mapping mendelian factors underlying quantitative

traits using RFLP linkage maps. Genetics 121, 185-199.

25. Lund, M.S., Sorenson, P., Guldbrandtsen, B., and D.A. Sorensen. (2003). Multitrait Fine

Mapping of Quantitative Trait Loci Using Combined Linkage Disequilibria and Linkage

Analysis. Genetics 163(1), 405-410.

26. Lynch, M. and B. Walsh (1998). Genetics and Analysis of Quantitative Traits. Sunderland:

Sinauer.

27. Mangin, B., Goffinet, B., and A. Rebai. (1994). Constructing Confidence Intervals for QTL

Location. Genetics 138, 1301-1308.

28. Morley, M., C.M. Molony, T.M. Weber, J.L. Devlin, K.G. Ewens, R.S. Spielman, and V.G.

Cheung. (2004). Genetic analysis of genome-wide variation in human gene expression.

Nature 430, 743-747.

29. Nguyen, D.V., Arpat, A.B., Wang N., and R.J. Carroll. (2002). DNA Microarray

Experiments: Biological and Technological Aspects. Biometrics 58, 701-717.

30. Newton, M.A., Kendziorski, C.M, Richmond, C.S., Blattner, F.R., and K.W. Tsui. (2001).

On differential variability of expression ratios: Improving statistical inference about gene

expression changes from microarray data. Journal of Computational Biology, 8, 37-52.

31. Newton, M.A., Noueiry, A., Sarkar, D., and P. Ahlquist. (2004). Detecting differential gene

expression with a semiparametric hierarchical mixture method. Biostatistics 5, 155-176.

32. R Development Core Team (2004). R: A language and environment for statistical computing.

R Foundation for Statistical Computing, Vienna, Austria.

33. Rebai, A., Goffinet, B. and B. Mangin. (1994). Approximate thresholds for interval mapping

test for QTL detection. Genetics 138: 235-240.

25

Page 27: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

34. Rebai, A., Goffinet, B. and B. Mangin. (1995). Comparing Power of Different Methods of

QTL Detection. Biometrics 51, 87-99.

35. Sax, K. (1923). The association of size differences with seed-coat pattern and pigmentation

in Phaseolusvulgaris. Genetics 8, 552 - 560.

36. Schadt, E., Monks, S., Drake, T.A., Lusis, A.J., Che, N., Collnayo, V., Ruff, T.G., Milligan,

S.B., Lamb, J.R., Cavet, G., Linsley, P.S., Mao, M., Stoughton, R.B., and S.H. Friend.

(2003). Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297-

302.

37. Sen, S., and G.A. Churchill. (2001). A Statistical Framework for Quantitative Trait

Mapping. Genetics 159, 371-387.

38. Storey, J.D., and R. Tibshirani. (2003). Statistical significance for genomewide studies.

Proceedings of the National Academy of Sciences 100(16), 9440-9445.

39. Storey JD. (2003). The positive false discovery rate: A Bayesian interpretation and the

q-value. Annals of Statistics, 31, 2013-2035.

40. Thoday, J.M. (1961). Location of Polygenes. Nature 191: 368-370.

41. Yvert, G., R.B. Brem, J. Whittle, J.M. Akey, E. Foss, E.N. Smith, R. Mackelprang, and L.

Kruglyak. (2003). Trans-acting regulatory variation in Saccharomyces cerevisiae and the

role of transcription factors. Nature Genetics 35(1), 57-64.

42. Zeng, Z.B. (1993). Theoretical basis of separation of multiple linked gene effects on

mapping quantitative trait loci. Proceedings of the National Academy of Sciences 90, 10972-

10976.

43. Zeng, Z.B. (1994). Precision of mapping of quantitative trait loci. Genetics 136, 1457-1468.

26

Page 28: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Table 1: Probability of QTL genotype conditional on the flanking markers.

QTL genotype

Mi M(i+1) Aa AA

Aa Aa (1−rL)(1−rR)1−r

rLrR

1−r

Aa AA (1−rL)rR

r

(1−rR)rL

r

AA Aa (1−rR)rL

r

(1−rL)rR

r

AA AA rLrR

1−r

(1−rL)(1−rR)1−r

where rL is the recombination frequency between marker i

and the putative QTL, rR is the recombination frequency

between marker (i + 1) and the putative QTL, r is the

recombination frequencey between marker i and i + 1.

Table 2: Reproduced from Table 1 of Zeng (1994)

Marker genotype

Group i i + 1 Pr(x∗ = 1)

1 + + 1

2 + -

1 with prob (1 − p)

0 with prob p

3 - +

1 with prob p

0 with prob (1 − p)

4 - - 0

where x∗ is the unknown genotype in the marker interval

and p = rL/r. Double recombination within the marker

the marker interval is ignored.

27

Page 29: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

ν0

Pow

er

5−5 5−3 5−1 50 51 53 55

0.80

0.85

0.90

0.95

1.00

TB−MRMB−EBMB−QQ−ALLMOMSAMLIMMA

ν0

FD

R

5−5 5−3 5−1 50 51 53 55

0.0

0.1

0.2

0.3

0.4

TB−MRMB−EBMB−QQ−ALLMOMSAMLIMMA

Figure 1: Average power and FDR (over 20 simulated data sets) for each ν0, with 95% point-wise

confidence intervals.

28

Page 30: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

cM

Ave

rage

Pos

terio

r P

roba

bilit

y

MOMpseudo−MOM

0 10 20 30 40 50 60 70 80 90

cM

Map

ping

Tra

nscr

ipts

0 10 20 30 40 50 60 70 80 90

Figure 2: The left panel gives the average posterior probability profiles calculated across mapping

transcripts. The ETL is marked by a vertical line. The right panels show the 500 mapping

transcripts (rows) from the left. For each transcript, genome locations inside (outside) the 96.8%

HPD region for pseudomarker-MOM are colored red (blue). Note that 96.8% is used to compare

with confidence intervals in Figure 3, which are based on existing methods.

29

Page 31: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80

510

1520

cM

Ave

rage

LO

D S

core

TB−MRIM

0 10 20 30 40 50 60 70 80 90

cM

Map

ping

Tra

nscr

ipts

0 10 20 30 40 50 60 70 80 90

Figure 3: The left panel gives the average LOD score calculated across mapping transcripts. The

ETL is marked by a vertical line. The right panels show the 500 mapping transcripts (rows). For

each transcript, genome locations within (outside) the 96.8% confidence region for the IM method

are colored red (blue).

30

Page 32: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80

0.0

0.1

0.2

0.3

0.4

cM

Ave

rage

Pos

terio

r P

roba

bilit

y

MOMpseudo−MOM

0 10 20 30 40 50 60 70 80 90

cMM

appi

ng T

rans

crip

ts

0 10 20 30 40 50 60 70 80 90

0 20 40 60 80

0.0

0.1

0.2

0.3

0.4

0.5

cM

Ave

rage

Pos

terio

r P

roba

bilit

y

MOMpseudo−MOM

0 10 20 30 40 50 60 70 80 90

cM

Map

ping

Tra

nscr

ipts

0 10 20 30 40 50 60 70 80 90

Figure 4: The left panel gives the average posterior probability profiles calculated across mapping

transcripts for two simulations (top and bottom). Each ETL is marked by a vertical line. The

right panels show the 500 mapping transcripts (rows) from each simulation on the left. For each

transcript, genome locations inside (outside) the 96.8% HPD region for pseudomarker-MOM are

colored red (blue). Note that 96.8% is used to compare with confidence intervals in Figure 5, which

are based on existing methods.

31

Page 33: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80

510

1520

cM

Ave

rage

LO

D S

core

TB−MRIM

0 10 20 30 40 50 60 70 80 90

cM

Map

ping

Tra

nscr

ipts

0 10 18 30 40 50 60 70

0 20 40 60 80

810

1214

1618

20

cM

Ave

rage

LO

D S

core

TB−MRIM

0 10 20 30 40 50 60 70 80 90

cM

Map

ping

Tra

nscr

ipts

0 10 18 30 40 50 60 70

Figure 5: The left panel gives the average LOD score calculated across mapping transcripts for

the two simulations shown in Figure 4. Each ETL is marked by a vertical line. The right panels

show the 500 mapping transcripts (rows). For each transcript, genome locations within (outside)

the 96.8% confidence region for the IM method are colored red (blue).

32

Page 34: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80

0

20

40

60

80

cM (QTL 2)

cM (

QT

L 1)

2−D MOM Scan

0 10 20 30 40 50 60 70 80 90

0

10

20

30

40

50

60

70

80

90

0

0.02

0.04

0.06

0.08

Figure 6: Average posterior probability of pattern 3 from the 2-D MOM scan, for all marker pairs.

The Y axis is for the first ETL, and the X axis is for the second. The diagonal is the average

posterior probability from the 1-D MOM scan.

33

Page 35: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80

0

20

40

60

80

cM (QTL 2)

cM (

QT

L 1)

2−D Marker Regression Scan

0 10 20 30 40 50 60 70 80 90

0

10

20

30

40

50

60

70

80

90

0

0.5

1

1.5

2

2.5

3

3.5

Figure 7: Average LOD scores from 2-D marker regression scan. Upper triangle is the average

LOD scores for epistasis; the lower triangle is the average LOD scores comparing the full model

with the null model; the diagonal is the average LOD score from a 1-D marker regression scan.

The contour lines are also shown.

34

Page 36: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80 100

0e+

002e

−06

4e−

066e

−06

8e−

06

100 transcripts

0 200 400 600 800 1000

0.00

0.05

0.10

0.15

0.20

0.25

1000 transcripts

post

erio

r pr

ob

0 1000 3000 5000

0.00

0.05

0.10

0.15

0.20

0.25

0.30

5000 transcripts

post

erio

r pr

ob

0 10000 20000 30000 40000

0.00

0.05

0.10

0.15

0.20

0.25

0.30

40000 transcripts

post

erio

r pr

ob

Figure 8: Posterior probabilities of a mixture component. The data consists of 10 animals. The

number of mapping DE transcripts are 1, 1, 4 and 22 for the four panels. The red dots are true DE

transcripts.

35

Page 37: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

100 transcripts

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

1000 transcripts

post

erio

r pr

ob

0 1000 3000 5000

0.0

0.2

0.4

0.6

0.8

1.0

5000 transcripts

post

erio

r pr

ob

0 10000 20000 30000 40000

0.0

0.2

0.4

0.6

0.8

1.0

40000 transcripts

post

erio

r pr

ob

Figure 9: Posterior probabilities of a mixture component. The data consists of 60 animals. The

number of mapping DE transcripts are 1, 1, 4 and 22 for the four panels. The red dots are true DE

transcripts.

36

Page 38: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

100 transcripts

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

1000 transcripts

post

erio

r pr

ob

0 1000 3000 5000

0.0

0.2

0.4

0.6

0.8

1.0

5000 transcripts

post

erio

r pr

ob

0 10000 20000 30000 40000

0.0

0.2

0.4

0.6

0.8

1.0

40000 transcripts

post

erio

r pr

ob

Figure 10: Posterior probabilities of a mixture component. The data consists of 100 animals. The

number of mapping DE transcripts are 1, 1, 4 and 22 for the four panels. The red dots are true DE

transcripts.

37

Page 39: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

maps.out

0.0 19.6 25.5 42.9 50.1 62.9 69.6 77.9 85.4 98.7 110.5 121.6

2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98 104 110 116

Figure 11: Example of genotype sampling using the HMM. The upper thin panel consists of the

marker data for one animal. The lower panel has 20 samples of pseudomarkers. Blue indicates

AA, yellow for Aa and light blue for missing data.

38

Page 40: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Appendix A

EM in interval mapping

Denote zt and gt to be the mapping location and QTL genotype for transcript t, t = 1, . . . , T .

zt can take on value among 0, 1, . . . , L. When zt = 0, we consider the transcript doens’t map

anywhere in the genome. gt is a vector of length n (number of animals in the experiment.) Denote

z = {z1, . . . , zT}, g = {g1, . . . , gT} and y = {y1, . . . ,yT}.

The complete joint likelihood of the data is:

Lc(θ) = p(m)p(θ)T∏

t=1

p(yt|zt, gt, θ)p(zt)p(gt|m)

= p(m)p(θ)

T∏

t=1

(

[

p0f0(yt)]I(zt=0)

L∏

l=1

[

plfl(yt|glt)]I(zt=l)

p(gt|m)

)

The Q function can be written as:

Q(θ, θ(i−1)) = E[log(Lc(θ))|y, m, θ(i−1)]

=∑

z1

· · ·∑

zT

g1

· · ·

gT

[

T∑

t=1

log(p(zt))

]

p(z, g|y, m, θ(i−1))

+∑

z1

· · ·∑

zT

g1

· · ·

gT

[

T∑

t=1

log(p(gt))

]

p(z, g|y, m, θ(i−1))

+∑

z1

· · ·∑

zT

g1

· · ·

gT

[

T∑

t=1

log(p(yt|zt, gt))

]

p(z, g|y, m, θ(i−1)) + const

= (I) + (II) + (III) + const

We can maximize Q(θ, θ(i−1)) by maximizing (I), (II) and (III) individually. It can be shown that

in the M-step, p̂(gt) = p(gt|y, m) and p̂(zt) =∫

p(zt|yt, gt)p(gt|y, m)dgt. (III) can be maximized

using numerical methods in R.

The mixing proportion pl, l = 1, . . . , L thus can be estimated as 1T

∑T

t=1

p(zt =

l|yt, gt)p(gt|y, m)dgt. And the estimate for p0 can be expressed as 1T

∑Tt=1 p(zt = 0|yt) since

it doesn’t depend on the QTL genotype.

The posterior probability of transcript t mapping to the lth location can be written as:

p(zlt = 1|y, m) =

p(zlt = 1)p(yt|z

lt = 1, m)

p(yt|m)

=p(zl

t = 1)∫

fl(yt|gl)p(gl|m)dgl

p(yt|m)

39

Page 41: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

where p(zlt = 1) is the prior probability.

Appendix B

Multiple Imputation: Pseudomarker Algorithm

Sen and Churchill (2001) proposed the pseudomarker algorithm to implement Bayesian QTL

analysis. Suppose that quantitative traits are measured for n members of an inbred line cross.

Denote the traits by y = (y1, y2, . . . , yn)′ and denote the corresponding marker data by the n × M

matrix m where M denotes the total number of markers. Marker location and genetic distances

are assumed known, although in practice these quantities are estimated.

A genetic model H describes the way in which QTL genotypes determine phenotype. A model

is prescribed by the number of QTL, their locations, and the way in which they act and interact

to affect the phenotype. Let µ denote the parameters of the genetic model. Assuming there are p

QTL in the genetic model, let γ denote the p-dimensional vector of QTL locations and g denote

the n × p matrix of QTL genotypes. The authors observed that one can decompose the problem

into two parts conditional on the unknown QTL genotypes.

p(y, m, g, µ, γ) =(

p(y|g, µ)p(µ))(

p(g|m, γ)p(m)p(γ))

(6.10)

Conditional on the QTL genotypes, the genetic part of the problem can be solved independently

from the linkage part.

Of primary interest is the posterior distribution of QTL locations, p(γ|y, m) given by:

p(γ|y, m) ∝ p(γ)

p(y|g) p(g|m, γ) dg (6.11)

An exact evaluation of the above equation is computationally prohibitive, but a Monte Carlo

approximation can be obtained by first sampling multiple versions of the putative QTL genotypes

and then averaging the results. The detailed pseudomarker algorithm is as follows:

1. Select a regularly spaced grid G of pseudomarker locations and create q realizations of

the pseudomarkers by sampling from p(g|m). Assuming known genetic distances and no

crossover interference, a Markov chain sampling scheme can be used. Each realization of

pseudomarker genotypes is an n × G matrix.

40

Page 42: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

2. For the assumed genetic model H , a p-dimensional vector of pseduomarker locations

corresponding to the QTL, γH , is prescribed; and the ith realization of pseudomarker

genotypes provides gi(γH), an n×p matrix of pseudomarker genotypes at the QTL locations.

3. For each realization, calculate the weight under the assumed genetic model H . The weight

for the ith realization is:

WH(gi(γH)) = p(y|g = gi(γ

H))p(γ = γH) (6.12)

If the prior on the QTL locations is uniform, then p(γ = γH) can be dropped.

4. An average of these weights approximates (6.11):

p(γ|y, m) ≈ C ·

q∑

i=1

WH(gi(γH))

for some constant of proportionality C.

Appendix C

Importance Sampling in pseudomarker-MOM

In pseudomarker-MOM, we are interested in sampling all the unknown genotypes over the L

locations simultaneously, in order to fit pseudomarker0-MOM. Denote g = {g1, g2, . . . , gn},

where gl is the unknown genotype vector at location l, l = 1, . . . , L. Specifically, we want to

sample from p(g|y, m). Direct sampling is computationally prohibitive. One can get around this

difficulty using importance sampling. Note that

p(g|y, m) ∝ p(y|g)p(g|m)

One can instead sample from p(g|m) and weight each sample by the importance weight p(y|g)

which can be obtained as∫

p(y|zt, g)p(zt)dzt.

p(zlt|yt, m) =

p(zlt|yt, g)p(g|yt, m)dg

∑Qq=1 p(zl

t|yt, gq)∑Q

q=1 p(y|gq)

=pl

∑Q

q=1 fl(yt|glq)

∑Q

q=1

(

p0f0(yt) +∑L

l=1 plfl(yt|glq))

where gq = {g1q , g

2q , . . . , g

Lq } is the qth sample from p(g|m).

41

Page 43: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Appendix D

Theorem 1

Assumptions:

1. For each mapping transcript, there is only 1 ETL.

2. The prior probability of a transcript mapping to the ith component is known, and equal for

all the components.

3. The hyper-parameters are assumed to be known.

Result:

For each mapping transcript, the posterior probability of the pattern that is closest to the true pattern

will have the highest posterior probability among all the patterns fit in the MOM model.

Proof:

For algebraic simplicity, consider a backcross.

We assume the LNN model described in Kendziorski et al. (2003) and use that same notation. The

log predictive density of mRNA expression for transcript t, assuming the null pattern, is:

log(f0(yt)) = −n

2log(2π) −

n − 1

2log(σ2) −

1

2log(σ2 + nτ 2

0 )

∑n

i=1(yit − µ0)2

2(σ2 + nτ 20 )

+τ 20 [(∑

yit)2 − n

y2it]

2σ2(σ2 + nτ 20 )

(A-1)

For non-null patterns, the log predictive density is: log(f1(yt)) = log(f0(y0t ))+ log(f0(y

1t )), where

y0t and y1

t denote the tth transcript intensity for population that have genotype 0 and 1, respectively.

Suppose at the ETL, the segregating population has n1 and n2 animals having genotype 0 and 1,

respectively. Then, the log predictive density of pattern 1 at ETL can be written as

log(f ∗1 (yt)) = −

n1

2log(2π) −

n1 − 1

2log(σ2) −

1

2log(σ2 + n1τ

20 )

i∈P 0(y0it − µ0)

2

2(σ2 + n1τ 20 )

+τ 20 [(∑

y0it)

2 − n1

(y0it)

2]

2σ2(σ2 + n1τ 20 )

−n2

2log(2π) −

n2 − 1

2log(σ2) −

1

2log(σ2 + n2τ

20 )

i∈P 1(y1it − µ0)

2

2(σ2 + n2τ 20 )

+τ 20 [(∑

y1it)

2 − n2

(y1it)

2]

2σ2(σ2 + n2τ 20 )

(A-2)

42

Page 44: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

Since n1 + n2 is fixed for a given experiment, comparison of (A-2) is equivalent to comparison of

−1

2log(σ2 + n1τ

20 ) −

1

2log(σ2 + n2τ

20 ) +

τ 20 (∑

y0it + µ0σ2

τ2

0

)2

2σ2(σ2 + n1τ 20 )

+τ 20 (∑

y1it + µ0σ2

τ2

0

)2

2σ2(σ2 + n2τ 20 )

(A-3)

for n1, n2 and y0it, y

1it evaluated at each MOM component.

Now, consider a MOM pattern that is described according to a test location l that has recombination

frequency r to the ETL. The assignment of y0it and y1

it has changed according to the genotype at

that location, denote as y′0it and y

′1it , respectively. (A-3) evaluated at l becomes

−1

2log(σ2 + (n1(1 − r) + n2r)τ

20 ) −

1

2log(σ2 + (n2(1 − r) + n1r)τ

20 )

+τ 20 (∑

y′0it + µ0σ2

τ2

0

)2

2σ2(σ2 + (n1(1 − r) + n2r)τ 2)+

τ 20 (∑

y′1it + µ0σ2

τ2

0

)2

2σ2(σ2 + (n2(1 − r) + n1r)τ 2)

(A-4)

We therefore have

i

y′0it +

µ0σ2

τ 20

∼ N(µ0(n1(1 − r) + n2r) +µ0σ

2

τ 20

, v20)

i

y′1it +

µ0σ2

τ 20

∼ N(µ0(n1(1 − r) + n2r) +µ0σ

2

τ 20

, v21)

where v20 = (σ2 + τ 2

0 )(n1(1 − r) + n2r) + τ 20 (n1(n1 − 1)(1 − r)2 + n2(n2 − 1)r2) and v2

1 =

(σ2 + τ 20 )(n2(1 − r) + n1r) + τ 2

0 (n2(n2 − 1)(1 − r)2 + n1(n1 − 1)r2).

Thus,(∑

i y′0it + µ0σ2

τ2

0

)2

v0

∼ χ21(δ).

with non-central parameter δ =

(

µ0(n1(1−r)+n2r)+µ0σ2

τ20

)2

v2

0

. Similary for(P

i y′1

it +µ0σ2

τ20

)2

v1.

Taking the expection of (A-4), we are left with

−1

2log(

σ2 + (n1(1 − r) + n2r)τ20

)

−1

2log(

σ2 + (n2(1 − r) + n1r)τ20

)

+ (I) + (II) (A-5)

43

Page 45: A Statistical Framework for Expression Trait Loci …kendzior/FORJD/PRELIM/MC...A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment

where

(I) =τ 20

2σ2

(σ2 + τ 20 )(n1(1 − r) + n2r) + τ 2

0 (n1(n1 − 1)(1 − r)2 + n2(n2 − 1)r2)

σ2 + τ 20 (n1(1 − r) + n2r)

+µ2

0

(

σ2 + τ 20 (n1(1 − r) + n2r)

)

2σ2τ 20

(II) =τ 20

2σ2

(σ2 + τ 20 )(n2(1 − r) + n1r) + τ 2

0 (n2(n2 − 1)(1 − r)2 + n1(n1 − 1)r2)

σ2 + τ 20 (n2(1 − r) + n1r)

+µ2

0

(

σ2 + τ 20 (n2(1 − r) + n1r)

)

2σ2τ 20

We want to show (A-5) is decreasing in r, in other words, it reaches its maximum when r = 0.

Taking the partial derivative with respect to r of the first two terms, we have

−12

τ4

0(n1−n2)2(1−2r)

(

σ2+τ2

0(n1(1−r)+n2r)

)(

σ2+τ2

0(n2(1−r)+n1r)

) , which is ≤ 0 since 0 ≤ r ≤ 12.

Taking the partial derivative with respect to r of (I) + (II) is a little messy. It can be shown that

∂(I)

∂r+

∂(II)

∂r=

τ 20

2σ2

[ rB + C

(σ2 + τ 20 (n1(1 − r) + n2r))2(σ2 + τ 2

0 (n2(1 − r) + n1r))2

]

(A-6)

where B = 2τ 20 n1(n1 − 1)(σ2 + n2τ

20 )2(2σ2 + (n1 + n2)τ

20 ) + 2τ 2

0 n2(n2 − 1)(σ2 + n1τ20 )2(2σ2 +

(n1 + n2)τ20 ) − 2σ2τ 2

0 (σ2 + τ 20 )(n2 − n1)

2(2σ2 + (n1 + n2)τ20 ) and C = −B

2.

When r = 12, (A-6)= 0, and when r = 0, (A-6)< 0, since

B = (2n1n2 − n1 − n2)(n1n2τ40 + σ2τ 2

0 (n1 + n2) + σ4)

> 0

Therefore, the partial derivative of (A-4) with respect to r is < 0. (A-2) is maximized in expectation

when r = 0. Assuming equal prior probabilities of patterns, we show that the posterior probability

of the pattern described by a test location that is closest to the ETL is the highest.

44