Characterizing allelic associations from unphased diploid data by graphical modeling

13
Characterizing Allelic Associations From Unphased Diploid Data by Graphical Modeling Alun Thomas Department of Medical Informatics and Center for High Performance Computing, University of Utah, Salt Lake City, Utah A method for estimating a graphical model to describe allelic associations between genetic loci is extended to use diploid genotypes rather than haploid data. It also provides haplotype frequency estimates, estimates of phase for sampled individuals, allows for missing or partial information, genotyping errors, and arbitrary penetrance functions. A data set of 688 unrelated individuals genotyped on 25 genetic loci is used to illustrate haplotype frequency estimation. The frequencies obtained are shown to be similar to those obtained by the PHASE program. We also illustrate how putative loci for traits can be included in the analysis in order to detect allele-phenotype associations. Haplotype reconstruction is illustrated on a standard set of 40 male X-chromosome haplotypes, randomly paired to simulate diploid genotype data. The novel method is shown to be comparable to most existing ones, though the PHASE method does consistently better on this problem. The graphical model approach is shown to have some advantages in terms of tractability and can be used to select informative subsets of loci and to map loci influencing phenotypes. Genet. Epidemiol. 29:23–35, 2005. & 2005 Wiley-Liss, Inc. Key words: linkage disequilibrium; phase estimation; likelihood; triangulated graphs Contract grant sponsor: NIH NIGMS; Contract grant number: R21 GM070710; NIH National Cancer Institute; Contract grant numbers: R01 CA90752, R01 CA89600; Contract grant sponsor: NIH; Contract grant number: N01-PC-67000; Contract grant sponsor: Utah Department of Health; Contract grant sponsor: University of Utah. n Correspondence to: Alun Thomas, Genetic Epidemiology, 391 Chipeta Way, Suite D, Salt Lake City, UT 84108. E-mail: [email protected] Received 16 November 2004; Accepted 6 February 2005 Published online 18 April 2005 in Wiley InterScience (www.interscience.wiley.com) DOI:10.1002/gepi.20076 INTRODUCTION Thomas and Camp [2004] developed the use of estimated graphical models to describe the joint distribution of alleles at loci in linkage disequili- brium. In their report, it was assumed that perfectly observed haplotypes were available as a source on which to base estimation. In effect, this meant that the processes of haplotype reconstruction and estimation of linkage disequi- librium were separated. Although haplotype reconstruction methods such as PHASE [Stephens et al., 2001], HAPLOTYPER [Niu et al., 2002], and those of Lin et al. [2002], SNPHAP (SNPHAP Web site) and Thomas [2003a], are reported to work well in general, this separation of reconstruction and estimation is undesirable, particularly if there is a substantial amount of missing data or genotyping error. We now describe a development of graphical modeling to include a layer representing the relationship between haplotypes and the obser- vable data, either genotypes or phenotypes, at the individual loci. Using this, we can combine the two stages of the problem into a single iterative process: given a current estimate of the graphical model for allelic association and the observed data, we reconstruct the haplotypes; given a current haplotype reconstruction, we estimate a new graphical model for allelic association. The process can be viewed as sampling from the joint posterior distribution of models and haplotypes, or as a Monte Carlo version of the EM algorithm [Ceppellini et al., 1955, Kirkpatrick et al., 1982] where the expectation is evaluated by simulation, and maximization is done by simulated annealing. This iterative process is similar in principle to that used in the PHASE program of Stephens et al. [2001], the difference being that the ancestral recombination graph that PHASE uses as a prior is replaced by a graphical model. A potential disadvantage of graphical modeling is that the method is entirely empirical and has no under- lying population genetic model, thus cannot contribute directly to inferences on population history. However, if the purpose of estimating Genetic Epidemiology 29: 23–35 (2005) & 2005 Wiley-Liss, Inc.

Transcript of Characterizing allelic associations from unphased diploid data by graphical modeling

Page 1: Characterizing allelic associations from unphased diploid data by graphical modeling

Characterizing Allelic Associations From Unphased Diploid Data byGraphical Modeling

Alun Thomas

Department of Medical Informatics and Center for High Performance Computing, University of Utah, Salt Lake City, Utah

A method for estimating a graphical model to describe allelic associations between genetic loci is extended to use diploidgenotypes rather than haploid data. It also provides haplotype frequency estimates, estimates of phase for sampledindividuals, allows for missing or partial information, genotyping errors, and arbitrary penetrance functions. A data set of688 unrelated individuals genotyped on 25 genetic loci is used to illustrate haplotype frequency estimation. The frequenciesobtained are shown to be similar to those obtained by the PHASE program. We also illustrate how putative loci for traits canbe included in the analysis in order to detect allele-phenotype associations. Haplotype reconstruction is illustrated on astandard set of 40 male X-chromosome haplotypes, randomly paired to simulate diploid genotype data. The novel methodis shown to be comparable to most existing ones, though the PHASE method does consistently better on this problem. Thegraphical model approach is shown to have some advantages in terms of tractability and can be used to select informativesubsets of loci and to map loci influencing phenotypes. Genet. Epidemiol. 29:23–35, 2005. & 2005 Wiley-Liss, Inc.

Key words: linkage disequilibrium; phase estimation; likelihood; triangulated graphs

Contract grant sponsor: NIH NIGMS; Contract grant number: R21 GM070710; NIH National Cancer Institute; Contract grant numbers:R01 CA90752, R01 CA89600; Contract grant sponsor: NIH; Contract grant number: N01-PC-67000; Contract grant sponsor: UtahDepartment of Health; Contract grant sponsor: University of Utah.nCorrespondence to: Alun Thomas, Genetic Epidemiology, 391 Chipeta Way, Suite D, Salt Lake City, UT 84108.E-mail: [email protected] 16 November 2004; Accepted 6 February 2005Published online 18 April 2005 in Wiley InterScience (www.interscience.wiley.com)DOI:10.1002/gepi.20076

INTRODUCTION

Thomas and Camp [2004] developed the use ofestimated graphical models to describe the jointdistribution of alleles at loci in linkage disequili-brium. In their report, it was assumed thatperfectly observed haplotypes were available asa source on which to base estimation. In effect,this meant that the processes of haplotypereconstruction and estimation of linkage disequi-librium were separated. Although haplotypereconstruction methods such as PHASE [Stephenset al., 2001], HAPLOTYPER [Niu et al., 2002], andthose of Lin et al. [2002], SNPHAP (SNPHAP Website) and Thomas [2003a], are reported to workwell in general, this separation of reconstructionand estimation is undesirable, particularly if thereis a substantial amount of missing data orgenotyping error.

We now describe a development of graphicalmodeling to include a layer representing therelationship between haplotypes and the obser-vable data, either genotypes or phenotypes, at the

individual loci. Using this, we can combine thetwo stages of the problem into a single iterativeprocess: given a current estimate of the graphicalmodel for allelic association and the observeddata, we reconstruct the haplotypes; given acurrent haplotype reconstruction, we estimate anew graphical model for allelic association. Theprocess can be viewed as sampling from the jointposterior distribution of models and haplotypes,or as a Monte Carlo version of the EM algorithm[Ceppellini et al., 1955, Kirkpatrick et al., 1982]where the expectation is evaluated by simulation,and maximization is done by simulated annealing.This iterative process is similar in principle to thatused in the PHASE program of Stephens et al.[2001], the difference being that the ancestralrecombination graph that PHASE uses as a prior isreplaced by a graphical model. A potentialdisadvantage of graphical modeling is that themethod is entirely empirical and has no under-lying population genetic model, thus cannotcontribute directly to inferences on populationhistory. However, if the purpose of estimating

Genetic Epidemiology 29: 23–35 (2005)

& 2005 Wiley-Liss, Inc.

Page 2: Characterizing allelic associations from unphased diploid data by graphical modeling

linkage disequilibrium is to enable efficientdetection of allele-phenotype associations, or tochoose informative subsets of loci for use infurther studies, then what is needed is a flexible,accurate, and tractable representation of the jointdistributions of alleles at associated loci of the sortthat graphical modeling provides.

Two aims are considered. The first is to providea sample of likely models and phase estimates thatallows both analysis of the sensitivity of derivedresults to the particular model chosen, andintegration of results over sampled models andreconstructions to properly account for modeluncertainty. This approach is used below toestimate haplotype frequencies for a set of 25 locigenotyped on 688 unrelated individuals, and todetect association between alleles at these loci andincidence of prostate cancer. The second aim is tofit a single optimal model to describe the patternof alleleic association between genetic loci. Such amodel can then be used, for example, to imputemissing or partial information such as phase,failed genotypes, or possible genotype errors foreach individual at each locus. This is illustratedbelow by applying the graphical model approachto the data set used by Lin et al. [2002] andStephens and Donnelly [2003].

METHODS

GRAPHICAL MODELS

The joint distribution of a set of randomvariables can usefully be described by a graphicalmodel when it factorizes into a product of termsthat involve small subsets of the variables. Forinstance, a simple example is that observationsX1 . . . Xn from a Markov chain have joint distribu-tion

PðX1; . . . ;XnÞ ¼ PðX1ÞYn

i¼2

PðXijXi�1Þ:

A familiar example from genetics is the factoriza-tion of the joint distribution of the genotypes ofindividuals in a pedigree into terms that involvepopulation frequencies for founders and Mende-lian transmission probabilities for non-founders

PðX1; . . . ;XnÞ ¼Yi2F

PðXiÞYi=2F

PðXijXfðiÞ;XmðiÞÞ

where f(i) and m(i) are the parents of individual i.This factorization enables the efficient computa-tional methods of Elston and Stewart [1971] andCannings et al. [1978], which are in fact early

developments of graphical modeling methodsbefore they were so called and before they wereused more generally.

If we now create a graph in which verticesrepresent variables and edges connect any vari-ables that appear as arguments to the same factor,the graph can be used to read off conditionalindependences defined by the factorization. Inparticular, the graph has the Markov property thatgiven the states of its neighbours, or adjacentvariables, a variable is conditionally independentof all other variables. This Markov graph can alsobe used to determine an efficient order for manycalculations involving the joint distribution [Laur-itzen and Spiegelhalter, 1988]. In genetic terms, itcan be used to find a peeling sequence [Elston andStewart, 1971; Cannings et al., 1978]. In pedigreeanalysis, the Markov graph can be obtained byallocating a vertex for each individual andconnecting individuals to their parent and theirparents to each other.

ESTIMATING A GRAPHICAL MODEL

Thus, while not always made explicit, graphicalmodels are used in many applications familiar tostatistical geneticists. Less well known is thatgiven a sample of observations from a jointdistribution, it is possible to estimate a graphicalmodel and its implied pattern of dependences andconditional independences. In particular, the classof graphical models whose Markov graph is atriangulated graph [Golumbic, 1980] has a decom-position that makes calculation of the log like-lihood and degrees of freedom of a model verystraightforward, as outlined below. A triangulatedgraph is one in which there are no unchordedcycles of length greater than three.

Given that characterizing allelic associations canbe viewed as estimating the dependences in thejoint distribution of the alleles at a set of loci,graphical modeling has considerable intuitiveappeal as an approach for addressing this pro-blem. Moreover, as models with local depen-dences can explain long-range associations, suchas in the Ising model [Ising, 1925], it is reasonableto expect estimated graphical models for allelicassociation to be tractable and require relativelyfew degrees of freedom. The method developedby Thomas and Camp [2004] addresses a simplerproblem than encountered in most real situations,namely that of estimating allelic association whenphase known haplotypes are observed withouterror and without any missing data. However,

Thomas24

Page 3: Characterizing allelic associations from unphased diploid data by graphical modeling

indirect observation, error, and missing data canbe readily incorporated into graphical models andit is this approach we now develop.

The method used by Thomas and Camp [2004]is an application of standard graphical modelestimation similar to that implemented in theBIFROST program [H�jsgaard and Thiesson,1995]. The cliques Ci and corresponding cliqueseparators Si of the Markov graph defined by adecomposable graphical model are found using amaximum cardinality search [Tarjan and Yanna-kakis, 1984] and the joint distribution representedas a function of the clique and separator marginals

PðXjGÞ ¼Y

i

PðCiÞPðSiÞ

:

This allows us to compute the log likelihood anddegrees of freedom for the graphical model, G,from those of the cliques and separators

logðLðGÞÞ ¼X

i

logðLðCiÞÞ �X

i

logðLðSiÞÞ

dfðGÞ ¼X

i

dfðCiÞ �X

i

dfðSiÞ:

The clique and separator models are justcontingency tables whose log likelihoods anddegrees of freedom are straightforward to com-pute. Simulated annealing and Metropolis sam-pling were then used to maximize the penalizedlog likelihood or sample from the implied poster-ior distribution on the space of decomposablemodels.

ESTIMATION USING UNPHASED DIPLOIDGENOTYPES

The underlying concept involved in this exten-sion is straightforward and illustrated in Figure 1.Figure 1a shows a graphical model for allelicassociation as might be estimated from haploiddata using the method of Thomas and Camp[2004]. The vertices represent the observations ofthe alleles at the genetic loci, numbered inphysical order along the chromosome. The natureof the interaction of the processes of recombina-tion and mutation mean that the structure of thegraph is not derivable from the positions of theloci.

In Figure 1b the graphical model in Figure 1ahas been applied in parallel to the alleles of anindividual on both the maternal and paternalchromosomes. The graphical model for diploidobservations is then completed by connecting thevariable representing the observable phenotype or

genotype at each locus to the unobservableordered pair of alleles at the correspondingpositions on each chromosome. We also includean error state variable that indicates whether ornot a genotyping error occurred. The priorprobability of a typing error is a parameter thatcan be set for the analysis, and can be set to zero ifwe do not want to allow for errors. The samegraphical model is applied to each individual inthe sample, only the observed states of thegenotypes vary across individuals.

Let Yi,j be the observed data for the ithindividual at the jth locus, coded in somestandard way. Let n be the number of individualsand m the number of loci. Let Mi,j and Fi,j be,respectively, the maternal and paternal alleles ofthe ith individual at the jth locus. We use Y todenote the collection of all observations, and Mand F to denote the collection of all alleles.Furthermore, let Mi ¼ fMi;1; . . . Mi;mg and Fi ¼fFi;1; . . . Fi;mg be the maternal and paternal haplo-types for the ith individual. Let Ei,j be the eventthat a typing error occurred for the ith person atthe jth locus. The prior distribution for Ei,j is aBernoulli with probability set according to thenature of the data. For example, in the case ofsingle nucleotide polymorphism loci a priorprobability of about 1% would be appropriate[Butcher et al., 2004]. However, microsatelliterepeat markers are generally more reliable and ifthe data have been checked and corrected, orsubject to other analyses that might reveal errors,a smaller value, possibly 0, would be moreappropriate. Although not done as part of theanalyses below, posterior error probabilities can beobtained for each genotype call. These posteriorscombine the prior information with that fromgenotypes called at loci that are estimated to beassociated.

Given some graphical model G for the jointdistribution of alleles on any particular haplotype,the likelihood of the model is

PðYjGÞ ¼XallE

XallM

XallF

Yn

i¼1

PðMijGÞPðFijGÞ

�Ymj¼1

PðYi;jjMi;j; Fi;j;Ei;jÞPðEi;jÞ:ð1Þ

The term PðYi;jjMi;j; Fi;j;Ei;jÞ is a generalizedpenetrance function. Choosing this appropriatelyallows us to deal with error, missing data,and arbitrary modes of expression of phenotypes.

Graphical Models for Allelic Association 25

Page 4: Characterizing allelic associations from unphased diploid data by graphical modeling

This is discussed more fully in the followingsubsection.

As with the case of estimation from haploiddata, we need to penalize the likelihood by amultiple of the degrees of freedom in order toavoid over-fitting. Thus, the function we seek tooptimize with respect to choice of G is

logðPðYjGÞÞ � adfðGÞ: ð2Þ

where a is some penalizing constant. We havechosen this, using Schwarz’s information criter-ion, to be a ¼ 1

2logðnÞ [Schwarz, 1978]. Althoughsomewhat arbitrary, this worked well for estimat-ing general graphical models from simulated data

(data not shown). However, this is an issue wewill return to below.

Our iterative process is suggested by the form ofequation 1 and is as follows:

1. Set an initial state for the haplotypes, M and F,and of the error variables E.

2. Fixing M and F, sample a new value of G withprobability proportional to

Ymj¼1

PðMjjGÞPðFjjGÞe�adfðGÞ

8<:

9=;

1g

ð3Þ

for decomposable graphical models G.

(a) (b)

L

L

L

L

L

M

M

M

M

M5

E

E

E

E

G

E

G

F

F

F

F

1

2

3

4

5

1

2

3

G

4

5

4

4

4

1

1

1

2

2

2G

5

3

3

G 3

F5

Fig. 1. a: An example of a graphical model for allelic association between 5 loci in haploids. This model corresponds to the factorizationof the joint distribution of the alleles at each of the loci, L1,y,L5, as follows

PðL1; L2;L3;L4; L5Þ ¼PðL1;L2ÞPðL1;L3ÞPðL3;L4; L5Þ

PðL1ÞPðL3Þ:

b: The same model applied to both of an individual’s haplotypes, (M1,y,M5), and (F1,y,F5). The alleles at each locus thenindependently determine the observed genotypes, G1,y,G5, subject to genotyping error indicators E1,y,E5, to give the complete diploid

graphical model. The parameters of the model that connects Mi, Fi, Gi and Ei are given in Table I.

Thomas26

Page 5: Characterizing allelic associations from unphased diploid data by graphical modeling

3. Fixing G, for each individual i, sample newvalues for Mi, Fi, and Ei with probabilitiesproportional to

PðMijGÞPðFijGÞYn

j¼1

PðYi;jjMi;j; Fi;j;Ei;jÞPðEi;jÞ

ð4Þ

4. Iterate steps 2 and 3.

Step 1 is straightforward and consists ofrandomly assigning the phase at each locusindependently of the other loci assuming no errorhas occurred. This is equivalent to starting fromthe graphical model with no edges, whichcorresponds to an initial assumption of perfectequilibrium.

Step 2 is now an application of the haploidmethod for estimating a graphical model. Theincumbent graphical model is perturbed byadding or removing edges and the likelihoodgiven the reconstructed haplotypes is calculated,as is the degrees of freedom. The proposed graphis then either accepted or rejected with theappropriate Metropolis probability. This is re-peated some fixed number of times. The para-meter g is the annealing or temperature parameter.For Metropolis sampling, g is fixed at 1, while foroptimization it is reduced from a high startingvalue gradually, typically geometrically, with eachiteration. Full details of the implementation aregiven by Thomas and Camp [2004].

Step 3 is the innovative element in the process. Itis done as a single blocked Gibbs update [Jensen etal., 1995] following the method of Dawid [1992].The simulations produced using this method forany fixed graphical model are independent ofeach other, thus a single update suffices at thisstep and no further iteration is necessary. It is alsopossible, in a similar process that requires thesame computational time and storage, to findhaplotypes of maximum probability given thecurrent model and observed genotypes. Theoperations involved in this maximization are ageneralized form of dynamic programming. Thecomputational time and storage required in thisstep, whether sampling or maximizing, are ex-ponential in the size of the largest clique in thetriangulation of the Markov graph, such as shownin Figure 1b. If the largest clique in the Markovgraph for the current model G is of size c, fordiallelic loci we can make this update in time andstorage proportional to 22c. As can be seen belowthis is not typically limiting.

PENETRANCE FUNCTIONS, ERRORS, ANDMISSING DATA

A great deal of the flexibility of the methodsdeveloped here is due to the generalized pene-trance function PðYi;jjMi;j;Fi;j;Ei;jÞ. A standardpenetrance function is defined as the probabilityof any particular phenotype given the genotype ata genetic locus, and a genotype is an unorderedpair of alleles. Such a function is used to encodediverse modes of phenotypic expression such asdominance, recessiveness, and dependence onsex, age, or other covariates. For marker loci, apenetrance function is not required as the geno-type itself is observable and, unless there is thepossibility of error, there is a one-to-one corre-spondence between the observation and theunderlying state.

However, for multi-locus data the order of thealleles is of critical importance but the observedgenotype does not give this information. Theobservation may also be subject to error. We,therefore, define the generalized penetrance func-tion to be the probability of any consequentobservation, phenotype, or genotype, given thestate of the ordered pair of alleles at the locus andthe state of an indicator for the presence orabsence of error.

If Yi,j encodes a phenotype, a standard pene-trance function can be specified in this format byignoring the error indicator and making thefunction invariant to the allele order. The exten-sion allows encoding phenotype misclassificationand also genetic imprinting, which may be usefulfor some applications. The phenotype, thus,becomes an indirect observation of the underlyingstates of alleles at a putative genetic locus for thetrait, which are imputed as part of the phaseestimation stage of the algorithm. The modelestimation stage of the algorithm treats thisputative locus just like a marker locus andsearches for associations between its alleles andmarker alleles. In this way, associations betweenthe phenotype and specific combinations of allelescan be found.

If Yi,j is a genotype, the generalized penetrancefunction encodes the loss of information aboutorder, and allows for error. For example, Table Igives the generalized penetrance function for thecase of a diallelic marker locus. We model theobservation as being perfect if no error hasoccurred and to be completely uninformative ifit has. Other models that give different types oferror different probabilities are also possible.

Graphical Models for Allelic Association 27

Page 6: Characterizing allelic associations from unphased diploid data by graphical modeling

If Yi,j is missing, for example if the individualwas not available for phenotyping or if thegenotyping assay failed at some locus, then weassume that the observation was missing atrandom and set PðYi;jjMi;j; Fi;j;Ei;jÞ ¼ 1 for allvalues of Mi,j, Fi,j, and Ei,j. In practice, to avoidunnecessary multiplication by unity the term isomitted.

IMPLEMENTATION

The Java program called HapGraph described byThomas and Camp [2004] has been extended toimplement the above process. If presented with asingle input file, the program assumes that theinput are haplotypes and proceeds as before,except that it has now been extended to allowfor missing observations. If presented with twoinput files, however, it assumes that these are,respectively, a parameter data file and a pedigreedata file in the same format as used by theLINKAGE program. Information about theseinput file formats can be found at the LINKAGEweb page. From the parameter file, the programreads the number of loci and the number of allelesat each locus. From the pedigree file, it reads theindividual genotypes. Currently, only loci speci-fied as LINKAGE numbered allele or affection statusloci can be input, but as these are the mostconvenient format for markers and phenotypes,this is not a serious restriction. Any informationabout the relationships between the individuals isignored and they are assumed to be unrelated. Ascreen shot of the graphical user interface forHapGraph is shown in Figure 2. HapGraph canbe obtained from Alun Thomas’s web page, as caninstructions for the program’s use.

RESULTS

MODEL-SAMPLED ESTIMATES OFHAPLOTYPE FREQUENCIES

As an illustration, HapGraph was run on thesame data that was used by Thomas and Camp[2004], which consists of 25 markers, 24 of themsingle nucleotide polymorphisms, in and aroundELAC2 typed for 688 unrelated individuals takenfrom family studies of prostate, breast, andovarian cancers in Utah. Because these data arefrom families, we have been able to compare thegenotypes with those for relatives and correctobvious errors. They have also been subjected toother analyses that would reveal problems andinconsistencies [Camp et al., 2005]. We, therefore,decided not to model genotyping error and so theprior probability for errors was set at 0. Weincluded in this analysis only individuals success-fully genotyped at at least 12 of the 25 markers.However, there remains a substantial amount ofmissing data. On average, the number of missinggenotypes is 2.4 per individual and 25% of themhave 4 or more dropouts.

We first used these data to estimate thehaplotype frequencies across the 25 loci. Hap-Graph was run with 10,000 Metropolis updates ofthe graphical model for each of 1,000 blockedGibbs updates of the haplotypes. After each Gibbsupdate, we sampled the estimated haplotypes andused these to estimate haplotype frequencies. Thistook 1,362 seconds on the author’s laptop com-puter.

The same data were also analyzed using PHASEversion 2.0.2, using the default values as set atdownload. These were number of iterations¼100,thinning interval¼1 and burn in¼100. The run took3,867 seconds. The frequencies estimated by thetwo methods are shown in Figure 3.

Figure 4 shows the progress of the penalized loglikelihood through 5 different runs of the samplerfor 5,000,000 iterations each. In all, 10 runs weremade. The 5 not shown behaved similarly to thefirst run shown, moving to the equilibrium levelalmost immediately, and these have been omittedin order to simplify the figure. Several runs werealso made in which simulated annealing was usedto first move the sampler to a local mode beforeswitching to Metropolis sampling. In every case,upon switching the level dropped back to whatappears to be the equilibrium level in Figure 4,suggesting that there is no phase transition to ahigher level waiting to happen. A typical such run

TABLE I. An extended penetrance function describingthe dependence of observed genotype at a codominant,diallelic marker on the underlying allelic state and errorindicator

Ordered alleles

Observed genotype 0,0 0,1 1,0 1,1

Error¼000 1 0 0 001 0 1 1 011 0 0 0 1

Error¼100 1

313

13

13

01 13

13

13

13

11 13

13

13

13

Thomas28

Page 7: Characterizing allelic associations from unphased diploid data by graphical modeling

is shown Figure 4 (bottom right). For all theseruns, the haplotypes were sampled, not max-imized.

Figure 5 shows the progress of the penalized loglikelihood for 4 simulated annealing runs for thesame data. In each case, 10,000,000 iterations weremade reducing the simulated annealing para-meter g from 50 to 1 geometrically at rate0.999999. For these runs, the haplotypes werereconstructed by dynamic programming to find amaximum likelihood reconstruction given theincumbent model for allelic association. Someinitial experimentation showed that using anannealing temperature of less than 1 did not seemto affect the probability of visiting an optimumduring these runs, so in order to preserve somemixing in the chain, we did not go below thisvalue.

ASSOCIATION WITH A PHENOTYPE

Of the 688 unrelated typed individuals usedabove, 231 were male and of these 8 had prostatecancer. As the number of cases was small,complex modeling of penetrance would be in-appropriate. Thus, we modeled this as parsimo-niously as possible, first as a male-specificdominant trait with full penetrance and nosporadic cases, and then as a male-specificrecessive trait, again with full penetrance and nosproradics. The models were specified usingLINKAGE format in the usual way with separateliability classes for males and females. Thisputative locus was then included in with themaker loci in HapGraph runs.

For the dominant model in a run of 4,000,000sampling iterations, we observed the prostatecancer locus linked to at least one of the

Fig. 2. A snapshot of the graphical user interface for HapGraph. The graph shown corresponds to the best model for linkagedisequilibrium between 25 loci in ELAC2, labeled 1 to 25. Vertex 0 is a putative locus for prostate cancer modeled as a male-specific

dominant trait and is seen to associate with locus 2, although this association is shown on further inspection not to be statistically

significant.

Graphical Models for Allelic Association 29

Page 8: Characterizing allelic associations from unphased diploid data by graphical modeling

ELAC2 markers 90% of the time. In 30% ofthe linked cases, the edge was between thedisease vertex and vertex 2. The best scoringgraphical model for this run is shown in thescreen shot of Figure 2. The putative locus forprostate cancer is labeled ‘‘0.’’ A contingencytable showing the relationship between the dis-ease and marker 2 for the 222 males successfullygenotyped at this marker is given in Table II. Ascan be seen, the cases can all be explained by adominant allele in complete disequilibrium withallele 1 at the marker. The hypothesis that allele 1

at locus 2 is itself a dominant allele causingprostate cancer is also consistent with the data.However, the numbers are very small and theresult is not statistically significant: the p valuewas 0.13.

For the recessive model, only 65% of thesimulations showed a disease-gene link. Again,the vertex most often adjacent to the diseasevertex was that for marker 2. However, this wasless frequent at 20% of linked simulations. As isclear from Table II, the dominant model associatesbetter with this marker.

0.00 0.02 0.04 0.06 0.08 0.10 0.12

0.00

0.02

0.04

0.06

0.08

0.10

0.12

HapGraph

PH

AS

E

(a)

−16 −14 −12 −10 −8 −6 −4 −2

−16

−14

−12

−10

−8

−6

−4

−2

HapGraph

PH

AS

E

(b)

Fig. 3. a: The haplotype frequencies obtained by HapGraph against those obtained by PHASE. b: The same data on a log-log scale. In b,

logs of estimated frequencies of zero have been arbitrarily set at �16, which is smaller than the log of any non-zero frequencies seen,

rather than omit them from the plot.

Thomas30

Page 9: Characterizing allelic associations from unphased diploid data by graphical modeling

POINT ESTIMATION OF RECONSTRUCTEDPHASE

HapGraph was also used to estimate recon-structed phases on the X-chromosome haplotypesas used by Lin et al. [2002] and Stephens andDonnelly [2003]. These are haplotypes of between45 and 117 single nucleotide polymorphisms in 8genes ascertained for 40 males. These haplotypeswere randomly paired up to form phase unknownautosomal-like data and then the reconstructedhaplotypes were compared with the originals.Like Stephens and Donnelly [2003], we repeatedthis 100 times and measured how often a perfectreconstruction failed, the error rate. We alsomeasured the switch error rate [Stephens andDonnelly, 2003], which scores partial reconstruc-tions. These results are presented in Table III.

For each of the HapGraph runs, we made500,000 iterations of sampling, 400,000 iterationsof sampled reconstructions but optimized models,and finally 100,000 iterations of maximized recon-structions and maximized models. The best modelover all 1,000,000 iterations was chosen and a finalhaplotype reconstruction made conditional on that.A single run for the gene ATR, which has the leastloci, 45, took 130 seconds, while one for MECP2,which has the most loci, 117, took 346 seconds.

It was also possible for this data to compare thegraphical models estimated from the haplotypeswith those from the simulated genotypes. Figure 6shows a typical result from such an analysis forthe 45 loci in the ATR gene. This allows us toevaluate the effect of the loss of information wesuffer from not knowing phase.

0 2000 4000

−6400

−6000

−5600

−6400

−6000

−5600

−6400

−6000

−5600

−6400

−6000

−5600

Pen

aliz

ed lo

g lik

elih

ood

0 2000 4000

−6400

−6000

−5600

0 2000 4000

−6400

−6000

−5600

2000 4000

Thousands of iterations

Pen

aliz

ed lo

g lik

elih

ood

00 2000 4000

Thousands of iterations

0 2000 4000

Thousands of iterations

Fig. 4. The penalized log likelihood ratio of the sequence of simulations for 5 runs of 5,000,000 iterations of the combined Metropolis

and blocked Gibbs sampling Markov chain. One blocked Gibbs sample for the haplotypes was made for every 10,000 Metropolis

updates of the linkage disequilibrium graphical model. The final frame shows the effect of using simulated annealing to first move

near a local optimum before reverting to Metropolis and Gibbs sampling.

Graphical Models for Allelic Association 31

Page 10: Characterizing allelic associations from unphased diploid data by graphical modeling

DISCUSSION

Figure 3 shows that the haplotype frequencyestimates from PHASE and HapGraph matchwell, although there is some rearrangement inthe order of the three most frequent haplotypes.Other runs of HapGraph using random startingpoints produced similar results. While HapGraph

0 2000 6000 10000

Pen

aliz

ed lo

g lik

elih

ood

0 2000 6000 10000

−5800

−5650

−5500

−5800

−5650

−5500

−5800

−5650

−5500

−5800

−5650

−5500

0 2000 6000 10000

Thousands of iterations

Pen

aliz

ed lo

g lik

elih

ood

0 2000 6000 10000

Thousands of iterations

Fig. 5. The penalized log likelihood ratio of the sequence of simulations for 4 runs of 10,000,000 iterations of the combined simulated

annealing and dynamic programming Markov chain. One dynamic program update was made to find an optimal haplotype for every

10,000 simulated annealing updates of the linkage disequilibrium graphical model. The annealing temperature was reduced from 50 to

1 geometrically with rate 0.999999. The horizontal dotted lines indicate the largest penalized likelihood encountered in each run.

TABLE II. A contingency table for the associationbetween incidence of prostate cancer in 222 malestyped for genotypes at marker 2 in the ELAC2 gene

Genotype

Phenotype 11 12 22

Affected 4 4 0Unaffected 45 103 66

TABLE III. A comparison of the results of haplotype reconstructiona

Error rate GLRA2 MAOA KCND1 ATR GLA TRPC5 BRS3 MECP2

LCZC .79 .61 .54 .62 .89 .58 .72 .85HAPLOTYPER .89 .76 .72 .72 .79 .72 .79 .64PHASE .76 .54 .46 .45 .68 .58 .67 .77HapGraph .83 .60 .51 .43 .78 .71 .61 .84

Switch errorLCZC .14 .10 .22 .29 .22 .13 .14 .23HAPLOTYPER .16 .12 .27 .32 .16 .20 .15 .19PHASE .10 .07 .13 .18 .11 .13 .10 .15HapGraph .14 .11 .17 .20 .20 .23 .12 .23

aLCZC is the method Lin et al. [2002], HAPLOTYPER is that of Niu et al. [2002], and PHASE is that of Stephens et al. [2001]. The resultspresented for these methods are taken directly from Stephens and Donnelly [2003].

Thomas32

Page 11: Characterizing allelic associations from unphased diploid data by graphical modeling

sampled 816 haplotypes that PHASE did not,PHASE sampled only 54 haplotypes that Hap-Graph missed. Thus, neither method completelysamples the state space of all possible haplotypes.The approximate gene counting method describedby Thomas [2003b] was also used to estimatehaplotype frequencies (data not shown) and thesewere found to be more similar to the results ofPHASE than HapGraph.

This example also illustrates that missing data iseasily handled, though with a corresponding loss

of information. Missing data are imputed duringthe blocked Gibbs update stage of the methodusing information from called genotypes atassociated loci. Thus, the Metropolis samplingstage still runs with complete phase knownhaplotypes. This process is handled very naturallywithin the graphical modeling framework and noextra programming is required to handle thesecases.

Figure 4 shows that the mixing propertiesof the method when sampling is performed are

Fig. 6. Estimated graphical models for the 45 loci in the X-chromosome gene ATR. a: The model estimated from the observed haplotypes

with penalty parameter a= 12log(40). b: A typical result obtained from genotypes simulated by randomly pairing the observed

haplotypes and a= 12log(40). c: The same as b but with a= log(40).

Graphical Models for Allelic Association 33

Page 12: Characterizing allelic associations from unphased diploid data by graphical modeling

reasonable. More than half of the runs performedmoved to equilibrium levels of the score functionalmost immediately. Of the 10 runs, the last tomove to the equilibrium phase took around2,000,000 iterations. Figure 4 (bottom right) showswhat happens if simulated annealing is initiallyused to move to a high level of penalizedlikelihood before switching to Gibbs-Metropolissampling. When switched, the process reverts tosampling at the same range of level as whensampling is used from the beginning. Similarbehaviour was seen when this was repeated anumber of times. The indication is that levelssampled are the equilibrium levels and that nofurther phase transition to a higher level ofsampling is possible.

For optimization, however, the mixing was notas good. Even with long runs and slow cooling, asshown in Figure 5, different optimal scores werereached. However, although the penalized loglikelihood scores attained differed, the estimatedgraphical models were similar. The difference inlevels of penalized log likelihood seen whenmaximizing and when sampling is an indicationof how scarce the high scoring states are.

In the reconstruction of the male X-chromosomehaplotypes, as shown in Table III, PHASE con-sistently performed better than HapGraph by bothcriteria used, although HapGraph was generallybetter than the other two methods using the wholehaplotype error rate criterion with a mean rank of2.125 over the 8 genes used. Overall, the methodseems more reliable when sampling than whenoptimizing.

Graphical models have some advantages interms of tractability over the other methods.Although the ancestral recombination graph mod-el used by PHASE allows for independences andconditional independences between loci, it doesnot take advantage of the computational savingsthat these allow. The adaptive window method ofExcoffier et al. [2003] does respond to and exploitindependences in the data. However, it considersjointly only contiguous sets of loci and hence maymiss longer range associations and fail to takeadvantage of spanned loci that behave indepen-dently. By estimating a graphical model, and notenforcing constraints due to relative location, wesee that small sets of loci are sufficient to describethe joint distribution. In our example, we find thatsets of no more than 4 or 5 loci need to beconsidered jointly.

Additionally, graphical modeling allows puta-tive disease loci to be included in the analysis with

arbitrary penetrance functions. For simplicity ofillustration, and in view of the small number ofcases, we chose to model prostate cancer as amale-specific, complete dominant or completerecessive. However, more realistic models thatincorporate age dependence and other covariateinformation can also be used. When an appro-priate model is available, this compares favour-ably with methods for detecting association thatrequire marking a haplotype according to whetherthe carrying individual was affected or not. Thenumber of prostate cancer cases in the exampleabove is small and the result is not significant,nonetheless this does illustrate the method’sability to find the most interesting allele-pheno-type associations (see Table II). With largernumbers of cases, not only could more complexpenetrance functions be used, but also Bayesianmethods could be used to estimate penetrances.

The particular choice of the penalty parameter ais still something of an open question. Figure 6shows that in order to get results similar to thoseobtained using phase known data, when usingphase unknown data the penalty needed to beincreased. In this case, doubling the value had thiseffect. However, as the increase in likelihood is afunction of how well the unphased data canreconstruct the phased data rather than just afunction of the number of observations, makingthis a general rule is not justified. Fortunately, thissensitivity to the penalty parameter seems to be aproblem only for small data sets such as the X-chromosome data, which has only 40 haplotypes.For the prostate cancer data, which has 688individuals, or 1,376 haplotypes, doubling thepenalty parameter had very little effect on theestimated models (data not shown). That similarresults can be had from phase known and phaseunknown data strongly suggests that any addi-tional cost incurred in using laboratory techniques[McDonald et al., 2002] to observe the phase is notlikely to be justified if the objective is to estimatepopulation parameters. It may, however, beworthwhile if the information is relevant togenetic risks for an individual due, for instance,to genetic imprinting.

Our current developments of these methods arein extending greatly the number of loci that can beconsidered in order to enable full chromosomeanalysis, and in applying the graphical models tothe founder alleles in pedigree data with a view toenabling joint linkage and association analysis inpedigrees. While most of the the programming isat an advanced stage, the mixing behavior of the

Thomas34

Page 13: Characterizing allelic associations from unphased diploid data by graphical modeling

Markov chains involved is largely unexplored andwill be the focus of future investigation.

ELECTRONIC DATABASEINFORMATION

The URLs for programs and data presentedherein are as follow:Alun Thomas’s Web site http://bioinformatics.-med.utah.edu/Balun (for HapGraph, the Javaprogram implementing the simulated annealingsearch procedure for jointly fitting a graphicalmodel to diploid observations and estimatinghaplotype frequency).

SNPHAP Web site, http://www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt (for DClayton’s SNPHAP program).

PHASE Web site, http://www.stat.washington.edu/stephens/phase.html (for M. Stephens’sPHASE program).

LINKAGE Web site, http://linkage.rockefeller.edu (for information about the LINKAGE pro-gram including the input file formats).

ACKNOWLEDGEMENTS

This work was supported by NIH NIGMS grantR21 GM070710 (to Alun Thomas), NIH NationalCancer Institute grant R01 CA90752 (to L.A.Cannon Albright), a subcontract from JohnsHopkins University with funds provided by grantR01 CA89600 from the NIH National CancerInstitute (to L.A. Cannon Albright), and the UtahCancer Registry supported by National Institutesof Health Contract NO1-PC-67000 with additionalsupport from the Utah Department of Health andthe University of Utah. Genotyping was providedby Myriad Genetics Inc.

I thank David Cutler for letting me have the Xchromosome data for the comparative analysis inResults, and Matthew Stephens for distributingthe PHASE program and responding to myqueries about its implementation.

REFERENCES

Butcher LM, Meaburn E, Liu L, Fernandes C, Hill L, Al-Chalabi A,Plomin R, Schalkwyk L, Craig IW. 2004. Genotyping pooledDNA on microarrays: A systematic genome screen of thousandsof SNPs in large samples to detect QTLs for complex traits.Behaviour Genetics 34:549–555.

Camp NJ, Swensen J, Horne BD, Farnham JM, Thomas A,Cannon-Albright LA, Tavtigian SV. 2005. Haplotypes, linkagedisequilibrium structure and mutation history in ELAC2 andassociations with familial early-onset prostate cancer. GenetEpidemiol 28:232–243.

Cannings C, Thompson EA, Skolnick MH. 1978. Probabilityfunctions on complex pedigree. Ann Appl Prob 10:26–61.

Ceppellini R, Siniscalo M, Smith CAB. 1955. The estimation ofgene frequencies in a random-mating population. Ann HumGenet 20:97–115.

Dawid AP. 1992. Applications of a general propogation algorithmfor probabilistic expert systems. Stat Comput 2:25–36.

Elston RC, Stewart J. 1971. A general model for the geneticanalysis of pedigree data. Hum Hered 21:523–542.

Excoffier L, Laval G, Balding D. 2003. Gametic phase estimationover large genomic regions using and adaptive windowapproach. Hum Genom 1:7–19.

Golumbic MC. 1980. Algorithmic graph theory and perfectgraphs. New York: Academic Press.

H�jsgaard S, Thiesson B. 1995. BIFROST: block recursive modelsinduced from relevant knowledge, observations, and statisticaltechniques. Comput Stat Data Anal 19:155–175.

Ising E. 1925. Beitrag zur Theorie des Ferromagnetismus. Z Physik31:253–258.

Jensen CS, Kong A, Kjaerulff U. 1995. Blocking-Gibbs sampling invery large probabilistic expert systems. Int J Hum Comput Stud42:647–666.

Kirkpatrick S, Gellatt CD Jr, Vecchi MP. 1982. Optimization bysimulated annealing. Technical Report RC 9353. YorktownHeights, NY: IBM.

Lauritzen SL, Spiegelhalter DJ. 1988. Local computations withprobabilities on graphical structures and their applications toexpert systems. J R Stat Soc B 50:157–224.

Lin S, Cutler DJ, Zwick ME, Chakravarthi A. 2002. Haplotypeinference in random population samples. Am J Hum Genet71:1129–1137.

McDonald OG, Krynetski EY, Evans WE. 2002. Molecularhaplotyping of genomic DNA for multiple single-nucleotidepolymorphisms located kilobases apart using long-rangepolymerase chain reaction and intramolecular ligation.Pharmacogenetics 12:93–99.

Niu T, Qin ZS, Xu X, Liu JS. 2002. Bayesian haplotype inference formultiple linked single-nucleotide polymorphisms. Am J HumGenet 70:157–169.

Schwarz G. 1978. Estimating the dimension of a model. Ann Stat6:461–464.

Stephens M, Donnelly P. 2003. A comparison of Bayesian methodsfor haplotype reconstruction from population genotype data.Am J Hum Genet 73:1162–1169.

Stephens M, Smith NJ, Donnelly P. 2001. A new statistical methodfor haplotype reconstruction from population data. Am J HumGenet 68:978–989.

Tarjan RE, Yannakakis M. 1984. Simple linear-time algorithms totest chordality of graphs, test acyclicity of hypergraphs, andselectively reduce acyclic hypergraphs. SIAM J Comput 13:566–579.

Thomas A. 2003a. Accelerated gene counting for haplotypefrequency estimation. Ann Hum Genet 67:608–612.

Thomas A. 2003b. GCHap: fast MLEs for haplotype frequencies bygene counting. Bioinformatics 19:2002–2003.

Thomas A, Camp NJ. 2004. Graphical modeling of the jointdistribution of alleles at associated loci. Am J Hum Genet74:1088–1101.

Graphical Models for Allelic Association 35