Haplotype reconstruction using perfect phylogeny …Haplotype reconstruction using perfect phylogeny...

Haplotype reconstruction using perfect

phylogeny and sequence data

Anatoly Efros1 and Eran Halperin1,2,3,0

1The Blavatnik School of Computer Science, Tel-Aviv University,

Tel-Aviv, Israel.2International Computer Science Institute, Berkeley, California.

3Department of molecular microbiology and Biotechnology,

Tel-Aviv, Israel.

May 16, 2012

Abstract

Haplotype phasing is a well studied problem in the context of geno-type data. Recent developments in high-throughput sequencing requirenew algorithms for low number of sequenced samples and low sequencingcoverage.

High-throughput sequencing technologies o�er new possibilities for theinference of haplotypes. Since each read originates from a single chromo-some, all the variant sites it covers must derive from the same haplotype.Moreover, the sequencing process yields much higher SNP density thanprevious methods, resulting in a higher correlation between neighboringSNPs. This thesis o�ers a new approach for haplotype phasing, lever-aging on these two properties. The suggested algorithm, called Perfect

Phylogeny Haplotypes from Sequencing (PPHS) uses a perfect phylogenymodel and models the sequencing errors explicitly.

This thesis evaluates the proposed method using both simulated andreal data. The results show that the suggested algorithm outperforms pre-vious algorithms when the sequencing error rate is high or when coverageis low.

1

Contents

1 Introduction 3

2 Background and Existing Methods 5

2.1 Phasing of Genotype Data . . . . . . . . . . . . . . . . . . . . . . 52.2 Phasing via Perfect Phylogeny . . . . . . . . . . . . . . . . . . . 72.3 Sequencing based phasing . . . . . . . . . . . . . . . . . . . . . . 9

3 Methods 9

3.1 Generation of haplotypes for each window . . . . . . . . . . . . . 133.2 Stitching of windows . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Results 15

4.1 Evaluation under Simulated data . . . . . . . . . . . . . . . . . . 154.2 Evaluation under real data . . . . . . . . . . . . . . . . . . . . . 17

5 Conclusions 18

References 20

2

1 Introduction

The etiology of complex diseases is composed of both environmental and geneticfactors. In the last few decades, there has been a tremendous e�ort to discoverthe genetic components responsible for a large number of common traits. Inother words, characterizing the heritability of these traits. In recent years, muchof this e�ort has been focused on genome-wide association studies (GWAS). Inthese studies the DNA of a population of cases (individuals carrying the stud-ied condition), and a population of controls (general population) is measuredand compared. These studies focus on measurements of single nucleotide poly-morphisms (SNPs). SNPs are positions in the genome which at some pointunderwent a mutation that became �xed in the population.

In recent years, genotyping technology, or the extraction of SNP informationfrom the genome, has advanced rapidly. Genome-wide association studies wereinfeasible until only a few years ago. Even today, the most modern genotypingtechnologies provide only a partial picture of the genome. With modern tech-nology the number of positions measured is typically less than a million, whilethere are more than three billion positions in the genome. Moreover, there aremany di�erent types of genetic variations that are not captured adequately bygenotyping technologies, particularly rare SNPs, and short deletions and inser-tions. Therefore, the next generation of genetic studies of diseases will surelyinclude the new high-throughput sequencing technologies, or next generationsequencing platforms (NGS). These technologies provide hundreds of millionsof short sequence reads of the sampled DNA within a single experiment.

The technical analysis of disease associated studies encountered a few com-putational challenges, some of which still exist when considering NGS basedstudies. One of the major obstacles in these studies has been the inferenceof haplotypes from the genotype data (phasing). As opposed to genotypes,haplotypes are sequences of alleles across a chromosome. Genotype technolo-gies provide information about the number of minor alleles occurring at eachposition, but does not include information about the relations between thesepositions. Mathematically, one can think of a haplotype as a binary string, andof a genotype as the sum of the two haplotypes of corresponding chromosomes.Many phasing algorithms have been suggested [10, 17, 27, 24, 14, 16, 3]. Thesealgorithms take a set of genotypes as an input, and use either the correlationsbetween neighboring SNPs, or the linkage disequilibrium (LD), to derive thehaplotypes.

The phasing problem is of a di�erent nature when applied to sequencing stud-ies. Four main di�erences are identi�ed. First, unlike genotyping technologies,sequence data allows us to consider both SNPs and short structural variations(e.g. short deletions). Second, in sequencing technologies the measured SNPsare closer to each other than in genotyping, resulting in a much higher LD andcorrelations between neighboring SNPs. Third, the short reads obtained withthe sequencing platform are always read from a single chromosome, and some ofthem may contain more than a single SNP, suggesting that partial haplotypesare provided. Finally, noise of NGS technologies is inherently di�erent from

3

Figure 1: A perfect phylogeny tree representing six di�erent haplotypes(00000,10000,01001,11000,11100,11110). Every edge represents a mutation ona single allele, and every node repersents a haplotype. Each node might alsohave a frequency value associated with it which denotes the frequency of thehaplotype in the population.

that of genotyping technologies. Sequence reads contain substantially more er-rors than genotyping, especially towards the end of the reads. In addition, the�nal errors made by the algorithms are highly dependent on speci�c parameterssuch as the coverage, which is the expected number of reads overlapping with aspeci�c position in the genome.

Sequencing allows for the possibility of phasing a single individual, unlikegenotyping that requires a population. Very recently, a few methods were sug-gested as solutions to the problem[15, 2]. These methods use the fact that eachsequenced read originates from a single chromosome, and that all variants cov-ered by the read must originate from the same haplotype. These sets of partialhaplotypes are then used to reconstruct a full haplotype.

Current methods can be divided into two main groups. The �rst groupincludes methods that work on genotype data of a population, and thereforeignore the information of partial haplotypes provided by the raw reads. Thesecond group includes methods that use the reads of only a single individual inthe phasing process. Both of these approaches require the coverage to be largein order to obtain high accuracy phasing.

In this thesis, a phasing method PPHS which takes into account low coverageand sequencing errors is proposed. This method models the data using a ge-nealogy model known as perfect phylogeny[12] (�gure 1). The perfect phylogenymodel assumes that the haplotypes had no recombination events throughouthistory, as well as no recurrent or backward mutations. The input to the sug-gested algorithm is a set of reads coming from a population of individuals. Thealgorithm then searches for the most likely perfect phylogeny tree that explainsthe reads. Methods for phasing via perfect phylogeny have been previously sug-gested in the context of genotype data[14, 26]. The main di�erences betweenthose approaches and the proposed method are that with the proposed methodthe errors in the SNP measurements are modeled and some of the reads contain

4

partial haplotypes.A large number of simulations were performed to evaluate the algorithm's

phasing accuracy in di�erent scenarios, both with the Coalescent model[18],and with data generated from the 1000 genomes project[25]. The algorithmshows high accuracy compared to state-of-the-art phasing methods such asBEAGLE[3] and HapCUT [2], especially in cases of low coverage and high errorrates.

2 Background and Existing Methods

For more than a decade, many di�erent algorithms have been suggested forphasing methods. These algorithms are usually aimed at phasing genotype dataof a population. Some recent algorithms suggest phasing methods for sequencedata originated from one individual.

2.1 Phasing of Genotype Data

Methods working with genotype data receive as input the genotype data of apopulation of individuals. We view a genotype as the sum of two haplotypeswhere each haplotype is a binary string. The problem of phasing is the inferenceof the two haplotype strings of each genotype. Clearly, this problem is impossibleto solve without prior knowledge, since there are multiple pairs of haplotypesthat may explain any given genotype. However, studies showed that in typicalshort region in the genome, there is a small number of haplotypes that accountfor most of the population[23, 6], and thus one can learn about the correct phaseof one individual based on the genotypes of the entire population. All currentmethods for phasing a set of unrelated individuals leverage on this fact.

Clark's method. Many phasing methods (hundreds) have been proposed inthe last 20 years. The �rst computational method for phasing was proposed byClark, and is known as Clark's algorithm [5]. This algorithm is motivated bythe fact that the number of haplotypes in the population is small. It follows agreedy approach, by repeatedly resolving haplotypes based on a set of alreadyknown haplotypes. More formally, the algorithm �rst considers individuals whoare heterozygous at no more than one SNP. The phase of these individualsis unambiguous, and the algorithm then adds the haplotypes used by thesegenotypes to a set of 'known haplotypes'. Following the creation of the initial listof known haplotypes the algorithm tries to resolve the remaining genotypes usingthe haplotypes from the list. For each genotype we select one haplotype fromthe list and a complementary haplotype. All the complementary haplotypesare than added back to the list of known haplotypes and the process continues.Clearly, the algorithm performance in terms of accuracy is highly dependent onthe order of resolved genotypes and on the initial haplotypes used for the listof known haplotypes. Moreover, the algorithm does not always converge sincesome haplotypes may not be explained by any of the known haplotypes.

5

Parsimony Clark's method is a greedy approach to minimizing the parsimonycriterion, in which one would like to minimize the number of haplotypes in thepopulation. It was shown that this problem is NP-hard[16], and therefore a fewheuristics have been suggested. Particularly, exponential size integer program-ming approaches were suggested and were shown to perform well in terms ofrunning time for small enough instances. However, the parsimony criterion didnot seem to result in the most accurate results compared to methods such asPHASE and the EM [13].

The EM algorithm. Shortly after Clark's method was suggested, Exco�erand Slatkin[10] suggested the use of the Expectation Maximization algorithmfor phasing. Expectation Maximization (EM) is a hill climbing method thataims at maximizing the likelihood of the data. The assumption of this algo-rithm is that there is an underlying haplotype distribution, F = {F1...Fn}, andthat each genotype is sampled from the distribution by sampling the two hap-lotypes independently, and then summing up the corresponding binary stringto generate a genotype. Given the genotypes, and a speci�c guess of the haplo-type frequencies, f1, . . . , fn, the likelihood function of the data can be written(i.e., the probability of observing the genotypes given the haplotypes frequenciesf1, . . . , fn):

L(f1, . . . , fn) =∏i

Pr(Gi | f1, . . . , fn) =∏i

∑(h1,h2)∈Ci

fh1fh2

where G is the genotype data and Ci is the set of haplotype pairs that canexplain the genotype Gi using a legal phase. In each iteration, the EM algorithmhas a speci�c guess for the haplotype frequencies, and it searches for anotherguess that is guaranteed to increase the likelihood. In the case of phasing, thenew guess is performed by computing the expected frequency of each haplotypebased on the previous guess. The algorithm's performance is superior to that ofClark's method, and it is model-based. However, recently it was shown[21] thatthe EM algorithm is relatively inaccurate compared to more modern algorithms.

PHASE. One possible explanation for the inaccuracy of the EM algorithm isthat the model used by the algorithm does not leverage the population history.In order to overcome this drawback, the PHASE [27] algorithm has been sug-gested; the PHASE algorithm starts by assigning a random phase. It than usesa Markov Chain Monte Carlo procedure (MCMC [11]) to obtain a set of samplesof haplotype phases given the genotype set. The MCMC sampling technique isperformed as follows. At each step we are given a speci�c sample (starting fromthe EM solution), and we are interested in suggesting another solution close toit. In the case of PHASE, it chooses a random individual. It then considersall other haplotypes in the population as �xed, inducing a distribution on thehaplotypes; the phase of the chosen individual is randomly picked based on thedistribution of haplotypes in the rest of the population, and using a prior which

6

is based on population history. Particularly, the prior is based on the Coalescentmodel. The new haplotypes are then given as the input for the next iteration.

The PHASE algorithm is extremely accurate, and arguably to date it is themost accurate phasing algorithm. However, the algorithm is prohibitively slow,and it would take many years to run the algorithm on a whole genome dataseteven in with a large cluster.

Hidden Markov Models Other probabilistic attempts have been made, es-pecially due to the slow running time of PHASE. Particularly, FastPHASE [24]is a method based on Hidden Markov Models. It assumes a �xed number ofclusters, where from each cluster a haplotype is drawn with cluster speci�c al-lele frequencies. The idea behind the algorithm is to capture the clusteringof haplotypes in tightly linked SNPs. Other methods with a �xed number ofcluster were also suggested. One of these methods is Gerbil[17]. The Gerbilalgorithm splits the genome into short regions and in each region it assumesa set of ancestral haplotypes. The HMM has transition probabilities betweenthe regions and emission probabilities which are probabilistic variations of theancestral haplotypes.

One of the problems with the HMM methods is that the number of statesin each position is �xed, thus not taking into account the fact that the char-acteristics of the haplotype distribution varies across the genome. Recently, amethod that uses a Markov model with a varying number of states has beensuggested[3]. The method, called BEAGLE, uses a localized haplotype clustermodel in order to make use of the LD structure. Localized haplotype clustermodel is a directed acyclic graph which follows several properties: every nodel has an associated level n and all nodes which have an incoming arc to l areat level n − 1 while all node that have and incoming arc from l have at leveln + 1. The graph has a root of level 0. Each arc in the graph is labeled withan allele, where two arcs leaving the same node cannot be have the same label.Using this graph we can de�ne a haplotype of length m as a path in the graphfrom the root to a node in the m − th level. Also using this graph we cande�ne a cluster of haplotypes for arc e as all the haplotypes which go through e(�gure 2). The BEAGLE algorithm works in iterations; in each iteration it usesphased data to create the model, using the model as a Hidden Markov Model(HMM) it samples haplotype out the genotypes and passes them again to thenext iteration until convergence.

The BEAGLE model is highly �exible and it scales well to large populationsand large genomes. Apart from its e�ciency, it has been shown that BEAGLEis highly accurate[3], and for this reason, BEAGLE is currently the most widelyused phasing method.

2.2 Phasing via Perfect Phylogeny

As explained in the previous section, most methods do not explicitly modelthe population history (apart from PHASE). PHASE is using the populationhistory by essentially sampling phasing solutions from the distribution de�ned

7

Figure 2: An example of a localized haplotype graph for four sites. The boldroute represents the haplotype 1111. Node e represents a cluster which de�nestwo haplotypes {1111,1001}.

by all possible coalescent histories. A related model for phasing, the perfectphylogeny model, has been suggested shortly after the introduction of PHASE.This model assumes that the chromosomes of each parent are transmitted tothe child with some possible mutations, but with no recombination events. Fur-thermore, it assumes that each mutation occurs once in history and that thereare no backward mutations. Formally, this model can be viewed as a rootedtree in which each edge corresponds to a SNP and each vertex corresponds toa haplotype; the haplotypes of the two nodes adjacent to the edge labeled by aspeci�c SNP di�er in exactly the position of the SNP. Moreover, there are notwo edges in the tree that are labeled with the same SNP (see Figure 2).

Some of the ideas behind the algorithm suggested in this work is based onone of the algorithms for phasing via perfect phylogney, HAP[14]. Under theperfect phylogeny model, it is shown in [7] that a simple polynomial algorithmcan be devised to �nd all phasing solution that are compatible with the perfectphylogeny. The HAP algorithm uses a top down approach by selecting the rootSNP as the SNP with the highest frequency and dividing all other SNPs intotwo sub trees, one that includes all descendants of the root, and another thatincludes the rest of the SNPs. The partition of the SNPs to subtrees is performedby building a set of constraints for every pair of SNPs based on their appearancein the data. Particularly, under the perfect phylogeny model, every pair of SNPsinduces a subtree which is also a perfect phylogeny tree, and therefore for eachpair of SNPs there are exactly three haplotypes (this is often referred to as thefour gamete property). The HAP algorithm uses the four gamete property todecide whether a SNP has to be a descendant of the root.

In practice, the perfect phylogeny model does not always hold, especiallyin cases where the SNPs are relatively far apart, in which case the correlationbetween the SNPs is weak. In case the data does not �t perfect phylogeny modelthe HAP algorithm constructs the tree based on looser restrictions, where thefour gamete test is 'almost' satis�ed.

Unlike BEAGLE, HAP divides the data into blocks of limited diversity andit performs the phasing in each of the blocks. This is a reasonable assumptionsince when the window length is too long the model may not hold. The windowsare then glued together using a dynamic programming approach[8]

8

2.3 Sequencing based phasing

The problem of phasing based on sequence data was �rst introduced as `hap-lotype assembly' problem [20] and was proven to be NP-hard even for reads oflength 2 [4]. During the last years many approximation algorithm appeared.

One of the �rst algorithms trying to solve this problem was the Greedyalgorithm [19]. It works by concatenating reads with minimum errors together.The greedy algorithm is highly e�cient but is inaccurate when the sequencingerror rate is high. Other more advanced algorithms have appeared over the lastfew years. One such method ,HapCut [2], it aims to �nd the solution whichminimizes the MEC (minimum error correction) score. It reduces the problemto a graph and then improves the MEC score in every iteration by solving aMax-Cut problem. The graph GX(H) is de�ned as a graph over SNP positionswhere an edge e exists between two positions if there exists a read which coversboth positions.

From here on a H is de�ned as a haplotype separation. The weight of edgein the graph is consider as the accuracy of the phase in this edge (in particularthe number of reads which agree with the phase minus the number of readscontradicting the phase). The algorithm then shows how using a cut in thegraph it is possible to build new haplotypes H∗ which have a better MEC scorethan H.

HapCut is not guaranteed to reach the optimal result, it was also impossibleto approximate the distance from optimum. Another algorithm was suggestedby Dan He[15], this algorithm in 99.98% of the cases returned an optimal result.The algorithm works by converting the problem to a MaxSAT problem and thenrunning a MaxSAT solver algorithm.

The haplotype assembly algorithms share several disadvantages, among thosethe fact that they need a high coverage and very long reads which does not al-ways exist. Also, they ignore the additional information gained by exploringpopulation data. This extra information can help eliminating sequencing er-rors. In our algorithm we use the extra information from reads covering morethan a single allele, to study and model the LD structure in the population.

3 Methods

The basic assumption of our algorithm is that within a short region, the his-tory of the genetic variants (SNPs or deletions) follows the perfect phylogenymodel, i.e., there are no recurrent mutations or recombinations. We denote byT = (V,E) the underlying directed perfect phylogeny tree, where each v ∈ Vcorresponds to a haplotype of length m (a string over {0, 1}m) and every edgecorresponds to a mutation. For every v ∈ V , let pv be the haplotype frequencyin the population.

Modeling the sequencing procedure. The haplotypes themselves are notgiven by the sequencing procedure, but instead, the sequencing procedure pro-

9

vides a large set of short reads, arguably sampled from random positions in thegenome and from a random copy of the chromosome (out of the two copies).The sequencing itself is a noisy procedure, which depends on several parame-ters, the coverage k, the read length l, and the sequencing error rate ε. We willassume each read is a copy of l bases extracted from the genome starting at arandom position. The copy is not an exact copy and we will assume that thereis a probability ε for the base to be read incorrectly. We note that in practice εdepends on the position of the base in the read, however when we consider allreads that overlap with a speci�c position in the genome, then the error ratedoes not depend on the position since the position itself is uniformly distributed.The coverage k refers to the expected number of reads overlapping with any sin-gle base (the actual coverage at a speci�c position is Poission distributed withparameter k).

Problem statement. The input for our algorithm is the sequence data, i.e.,the set of reads obtained from the sequencer, where each read is assumed to begenerated by randomly picking a position in the genome, randomly picking oneof the copies of the chromosome in that position, and adding noise using theparameter ε in each position of the read independently. We will denote the setof reads of individual i as Ri. Additionally, we will denote the two haplotypesof individual i as Hi = (h1i , h

2i ), where h

1ij , h

2ij are the alleles of the haplotypes

in SNP j, and h1i,S , h2i,S , are the haplotypes restricted to the set of SNPs S. If

Hi is known for each i, we can write the likelihood of the data as

Pr(R1, . . . , Rn | H1, . . . ,Hn) =

n∏i=1

∏r∈Ri

Pr(r | Hi),

where Pr(r | Hi) is the probability of observing read r given the two haplo-types of individual i. Particularly, if r spans the positions of a set of SNPs S,and if d(x, y) is the Hamming distance between two sequences x and y (restrictedto the set of SNPs), then

Pr(r | Hi) =1

2εd(h

1i,S ,r)(1− ε)|S|−d(h

1i,S ,r) +

1

2εd(h

2i,S ,r)(1− ε)|S|−d(h

2i,S ,r) (1)

Our algorithm aims at �nding a perfect phylogeny tree on the set of SNPs ina given window, and a corresponding haplotype assignment for each individual.The haplotypes assigned to each individual need to be taken from the tree, andthe objective is to optimize the likelihood of the reads. We will explain laterhow the perfect phylogeny assumption can be relaxed.

Tree reconstruction within a window Within a window, the tree recon-struction algorithm works in the following way. We �rst search for a SNP j∗

that is adjacent to the root of the tree. We then use a partitioning procedureto decide for every other SNP j, whether it is a descendent of j∗ in the tree.

10

This partitioning procedure splits the set of SNPs into two, and we recursivelycompute the two subtrees.

Throughout, we assume that the alleles of each SNP are represented by the{0, 1} notation, where 0 is the more common allele. As shown in [7], the set ofhaplotypes corresponds to a perfect phylogeny, if and only if it corresponds toa perfect phylogeny with the all zeros vector as the root. We thus assume thatthe root of the tree is the haplotype with all 0 values. Under these assumptions,it is clear that for two SNPs j and j′, SNP j cannot be a descendant of SNPj′ if their corresponding allele frequencies satisfy fj > fj′ . Therefore, we �rstestimate the minor allele frequency of each SNP and we choose the SNP j∗ asthe SNP with the largest estimated allele frequency, breaking ties arbitrarily.

For a SNP j and an individual i, we denote by Rij ⊆ Ri the set of readsthat overlap with SNP j. We write the likelihood of the minor allele frequencyas follows:

L(fj ;R1j , . . . , Rnj) =

n∏i=1

2∑g=0

fgj (1− fj)2−g2(2−g)g ∏r∈Rij

Pr(r | Gij = g)

,

where Gij = h1ij + h2ij is the genotype of individual i and position j. Note thatPr(r | Gij) is given in Equation (1), restricted to the case S = {j}. Now,Ai,j,g =

∏r∈Rij

Pr(r | Gij) can be computed in linear time, in a preprocessingprocedure, regardless of fg. Therefore, the log likelihood can be rewritten inthe following way:

log(L(fj ;R)) =n∑i=1

log

(2∑g=0

Ai,j,gfgj (1− fj)

2−g2(2−g)g)

),

where Cj does not depend on fj . We use an expectation-maximization [22](EM) algorithm to estimate fj . In this case, we denote the missing variables(i.e., the assignment of genotypes to individuals at SNP j) by Z. At eachiteration t of the EM, we have a guess f t, and we now compute the E-step andthe M-step:

E-step:

Q(f | f t) = EZ|ft [log(Pr(R1j , . . . , Rnj , Z | f)]

= C +

n∑i=1

2∑g=0

pti,j,g(g log(f) + (2− g) log(1− f))

= C +

n∑i=1

2∑g=1

pti,j,gg log f +

n∑i=1

1∑g=0

pti,j,g(2− g) log(1− f)

Here,

pti,j,g = Pr(Zi = g | f t, Rij) =2g(2−g)f tg(1− f t)2−gAi,j,g∑2k=0 2

k(2−k)f tk(1− f t)2−kAi,j,k.

11

M-step:

f t+1 = argmaxQ(f | f t) =∑ni=1

∑2g=1 p

ti,j,gg

2n

The partitioning procedure. Now that we have chosen j∗, the algorithmproceeds by partitioning the other SNPs into two sets, T1 and T2, where T1 cor-responds to the subtree of the root excluding the edge j∗ and its descendants,and T2 corresponds to the subtree located below j∗. In order to �nd this par-tition, for each SNP j, we calculate the likelihoods of j being in T1 verses thelikelihood of j being in T2, and we assign j to the tree for which the likelihoodis higher. We later refer to this partitioning method as PPHS-2.

Figure 3: One out of 3 di�erent options of Tree T ∗where j2 is the child of j1.

This approach is highly e�cient (the running time of each partition iterationis linear), however, the algorithm is highly sensitive to mistakes occurring earlyon in the process of the tree reconstruction, and it does not take into accountthe overall multivariable relations between j∗ and all other SNPs. For thisreason, we also consider an alternative partitioning approach, (PPHS-3), whichis based on the pairwise relations of pairs of SNPs with j∗. Formally, for eachj1, j2 ∈ {1, . . . ,m} \ {j∗}, we consider the four possible con�gurations, that is(1) j1, j2 ∈ T1, (2) j1, j2 ∈ T2, (3) j1,∈ T1, j2 ∈ T2, and (4) j1,∈ T2, j2 ∈ T1.Each of these con�gurations corresponds to a set of possible subtrees inducedby j∗, j1, j2. Particularly, in cases (1) and (2) there are three possible subtrees,while cases (3) and (4) correspond to a unique subtree (see Figure 3).

For each of these subtrees, T ∗, we guess the most likely haplotype frequencydistribution over the tree using EM. As before, pv corresponds to the frequencyof the haplotype represented by vertex v, and pt(v) corresponds to the guess ofthe EM algorithm at step t.

E-step

Q(p | pt) = C +

n∑i=1

∑(v1,v2)∈T∗

at(v1,v2),i log(pv1pv2),

12

where C is a constant that does not depend on p, and

atv1,v2,i =pt(v1)p

t(v2) Pr(Ri,{j1,j2,j∗} | Hi = (v1, v2))∑(u1,u2)∈T∗ p

t(u1)pt(u2) Pr(Ri,{j1,j2,j∗} | Hi = (u1, u2))

M-step

pt+1(v) =

∑ni=1

∑u∈T∗ a

tv,u,i

n.

In practice, we start the EM algorithm from a distribution p0v de�ned asfollows. Let s0 be the SNP corresponding to the parent edge of v, and let Sbe the set of SNPs corresponding to the children edges of v. We set p0v =ps0 −

∑s∈S ps, where ps is the estimation of the allele frequency of SNP s using

the EM as described above. It is easy to verify that if the allele frequency areaccurately estimated then the haplotype frequencies are correctly estimated aswell (restricted to the tree T ∗).

The EM algorithm provides us with a set of haplotype frequencies pT∗

h cor-responding to the tree T ∗. The likelihood of the subtree can now be writtenas

Pr(R | T ∗) =

n∏i=1

∑h1h2∈T∗

Pr(Ri,{j1,j2,j∗} | (h1, h2))pT∗

h1pT∗

h2

We now de�ne a set of four complete graphs Gab = (V,Eab) (a, b ∈ {0, 1})where V is the set of SNPs. In graph Gab, an edge ej1,j2 ∈ Eab has an associatedweight

wabej1,j2 =1

|{T ∗ | j1 ∈ Ta, j2 ∈ Tb}|∑

T∗∈{T∗j1∈Ta,j2∈Tb}

Pr(R | T ∗)

Let (S1, S2) be a partition of the graphs Gab. Let

wab(S1, S2) =∑

j1∈Sa,j2∈Sb

wab(j1,j2)

The algorithm proceeds by searching for the partition that maximizes∑a,b wab(S1, S2).

In order to �nd the best partition, for low values of m we enumerate over allpossible partitions, and for larger values of m we randomly pick subsets of thevertices, compute the partition for these subsets, and use a majority vote todecide on the overall partition.

3.1 Generation of haplotypes for each window

For every window, the recursive process described above returns the perfectphylogeny tree T which best �ts the data. Now the algorithm needs to select forevery individual i two haplotypes h1, h2. Since the data may not perfectly �t the

13

model, we add additional haplotypes to the pool of possible haplotypes. Usingthe tree T we create a list L of haplotypes, where each haplotype h correspondsto a speci�c node v ∈ T . From each haplotype in L we createm new haplotypes,where m is the number of SNPs in the window by adding all possible one-mutations to the haplotype. We call these haplotypes the haplotypes derivedfrom L.

For each individual i we select two haplotypes h1, h2 which maximizes the aposteriori probability:

Pr(h1, h2 | Ri) ∝ Pr(Ri | h1, h2) Pr(h1h2) = Pr(h1) Pr(h2)∏r∈Ri

Pr(r | h1, h2)

The priors Pr(h) are calculated based on the haplotypes assigned so far toother individuals, and based on the fact that the haplotypes in L are more likelythan the haplotypes derived from L. Particularly, when we consider individuali + 1, let Hi be the set of haplotypes assigned to the �rst i individuals, andfor each haplotype h, let nh be the total number of occurrences of h in Hi.Then, we set the vector of prior to be ~p = (ph1 , . . . , phk

) which maximizes themaximum a posteriori (MAP) under a Dirichelet prior with α1 weight in the Lhaplotypes and α2 weight in the haplotypes derived from L:

Pr(~p | Hi) ∝ Pr(Hi | ~p) Pr(~p) ∝∏h

pnh+αh

h ,

which is maximized for ph ∝ nh + αh.

3.2 Stitching of windows

The framework discussed so far assumes that the haplotypes are inferred withina window of m SNPs. The length of the window is mainly determined by theassumption of perfect phylogeny, and therefore we cannot apply our methodto long windows. When working on large number of SNPs we use BEAGLE'sresults as a guide that connects the di�erent windows.

Let h′

i1, h′

i2be the haplotypes returned by BEAGLE for individual i. Con-

sider a window w or length m spanned by SNPs j ∈ {w...w+m−1}. Let hwi1 , hwi2

be the haplotype returned by the above algorithm for window w and let hw′

i1, hw

′

i2the haplotypes returned by BEAGLE restricted to the same window w.

We calculate the Hamming distances d1, d2:

d1 = d(hwi1, hw′

i1 ) + d(hwi2, hw′

i2 )

d2 = d(hwi2, hw′

i1 ) + d(hwi1, hw′

i2 )

In case d1 < d2, d1 ≤ d∗ we use haplotypes hi1 , hi2 as the haplotypes for windoww, if d2 < d1, d2 ≤ d∗ we use haplotypes hi2 , hi1 as the haplotypes for window w.In all other cases we use the haplotypes h

′

i1, h′

i2which were returned by BEAGLE

14

Figure 4: An example of a perfect phylogeny tree reconstructed by our algorithmusing data simulated by MSMS. Each node contains two values, the �rst one isthe number of the SNP changed by this node, the second is the frequency of thesub tree de�ned by this node.

as the halotypes for window w. Put di�erently, we use the BEAGLE haplotyperesults in order to determine the ordering of our solution if our solution isrelatively close to BEAGLE's solution. If our solution is far from BEAGLE'ssolution we assume that the perfect phylogeny model does not hold and thereforewe simply use BEAGLE's solution in this case. We chose the threshold d∗ tobe m/2 based on empirical evaluation.

4 Results

In order to evaluate the performance of our methods, we �rst used simulateddata to generate a large number of trees, and we measured the accuracy ofthe tree reconstruction directly. We then considered a set of haplotypes, eitherrandomly generated from the simulated tree, or taken from real data, and weevaluated the phasing accuracy on these data. For the latter, we used an exten-sion of the switch distance error metric. Generally, there are two types of errors;the �rst type is regular switch errors (for example two haplotypes 11 and 00 arephased as 10 and 01 ), and the second type is mismatch errors (two haplotypes11,00 are phased as 11 and 01). We used the sum of all switch and mismatcherrors (S +M when S is the number of switch errors in the data and M is thenumber of mismatch errors). We refer to this as the SM error metric.

4.1 Evaluation under Simulated data

We used the MSMS software [9] to simulate data from a perfect phylogeny(see Figure 4). MSMS [9] uses the Coalescent process to create haplotypes andwhen the mutation rate is set low enough, the resulting haplotypes �t the perfectphylogeny model.

15

��

��

��

��

��

��

��

��

� � � � �

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

��

Figure 5: The tree reconstruction percentage as a function of the populationsize. The tests were done with 30 SNPs and a minimal haplotype frequencyof 3%. On the right, the sequencing error rate was 2%, with varying coveragevalues, and on the left, the coverage was 10, and the error rate was set to 4%.

Accuracy of tree reconstruction. Using MSMS, we generated a set of 100random haplotype groups over 30 SNPs, and for each group we randomly gen-erated a set of N genotypes, where N varies from 5 to 500. We then simulatedthe sequencing procedure on these genotypes (see Methods). For a given set ofgenotypes, we let PPHS reconstruct the tree, and the set of haplotypes obtainedcan be compared to the haplotypes of the original tree. In the true tree thereare 30 SNPs and therefore there are exactly 31 haplotypes, denoted by HTrue.We denote the haplotype frequencies of HTrue as p1, . . . , p31. The output ofthe algorithm results in another tree with a possibly di�erent set of haplotypes,HPPHS . Let S = HTrue ∩ HPPHS . Then, we measure the accuracy of thereconstruction as

∑i∈S pi. We observed (Figure 5) that the reconstruction of

the tree by our algorithm is highly accurate, and it converges to the correct treewhen the number of samples increases, even in the presence of high sequencingerrors.

Evaluation of phasing accuracy We compared the results of the PPHS al-gorithm to BEAGLE [3] (see Figure 6). In addition, in order to compare PPHSto the sequencing based algorithms for phasing, we implemented an algorithmMinSingleErr. In MinSingleErr we compute the likelihood for each genotype ateach site, and decide whether the site is heterozygous. If the site is heterozy-gous, it correctly guesses the phasing in this site. Thus, any haplotype assemblymethod that uses one sequence at a time (in contrast to a population of individ-uals) cannot do better than MinSingleErr, and the error rates and no-call rateof MinSingleErr is a lower bound on the no-call rate and error rate of methodssuch as HapCut and HapSAT [2, 15]. We observed that both algorithms showsimilar behavior for low error rates, however PPHS is more accurate than BEA-GLE as the error rate increases. As a function of the coverage, as expected, theperformance of all algorithms drastically decreases when the coverage is low.As we can see PPHS maintains a constant improvement over BEAGLE of 20%-

16

��

��

��

��

��

��

��

� � � � � �

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� � � � �

��

��

��

��

��

Figure 6: The window based SM error metric as a function of the sequencingerror rate (left) and as a function of the coverage (right). The tests were donewith haplotype windows of length was 5 over a population of 5 individuals. Thecoverage was set to 5 on the left, and the sequencing error rate was set to 3%on the right, each point is the average of 10000 windows.

40%. If we use the stitch function and compare the SM errors over the entirehaplotype our improvement over BEAGLE drops to 10%.

4.2 Evaluation under real data

In the above section we evaluated the performance of the algorithms underthe Coalescent model without recombinations. In practice, there are cases inwhich this model does not capture the empirical behavior of the data and itis therefore important to test the algorithm using real datasets. To do so, weused the EUR population data from the 1000 genome project [25] (taken fromthe BEAGLE website, 2010-08), in which the full sequences of 283 unrelatedindividuals of European ancestry are given. For each test we used haplotypesfrom human chromosome 22 which contains overall 162027 SNPs. We note thatthe genotype error rate in this data is quite high since the data used for the SNPcalling is of very low coverage (phase 1 of the 1000 genomes project). Therefore,it may be the case that in some cases the perfect phylogeny assumption doesnot hold while in fact the inconsistency is caused by the sequencing errors.

The genotype data provided by the 1000 Genomes project provides us witha realistic setting of the haplotype distribution in the population. Our experi-ments involved taking these haplotypes and simulating the sequencing process,as described in the Methods. We �rst measured the accuracy of the algorithmsas a function of the coverage and the sequencing errors with a set of �ve in-dividuals. We also compared our algorithm to haplotype assembly algorithms(HapCut [2]) which work on a single individual. We observed that only in thecases where the coverage is high and the error rate is low do the haplotype as-sembly algorithms have a chance of preforming better than PPHS and BEAGLE(assuming long reads).

We observed that with coverage of 4,6 PPHS is constantly better than BEA-GLE by 40%-100% when using the window based error function (see Figure 7).

17

Algorithm Test 1 Test 2 Test 3 Test 4

HapCut - No Calls 20.85 20.63 18.67 18.03HapCut - Errors 15.92 8.71 24.80 15.56

Beagle 2.76 1.81 2.20 1.09PPHS 2.43 1.20 1.71 0.24

Table 1: The performance of HapCut versus PPHS.All tests were done with 5 individuals, tests 1-3 had an expected coverage of 5while test 4 had an expected coverage of 20. Tests 1,3 had a sequencing errorrate of 5% and tests 2,4 had a sequencing error rate of 1%. The read length oftests 1,2 was 400 while for test 3,4 it was 2000 bases.

The advantage over BEAGLE drops to 10% when using the stitch function forPPHS and comparing the error rate over the entire haplotype.

We also compared our algorithm to HAPCUT, as can be seen in table 1, butsince HAPCUT requires long reads and high coverage to obtain quality results,its results were far worse than PPHS in our tests.

E�ects of read length and population size. As a mean to estimate thee�ect of the read length on the performance of the algorithms, we increased theread length but kept the coverage �xed. We observe that the performance ofthe PPHS algorithm improves when the read length increases and the coveragestays �xed (see Figure 8). This is expected since longer reads provide a largernumber of partial haplotypes. We also tested the e�ect of the populations sizeon the error rate (Figure 9). As can be seen in the �gure, BEAGLE error ratedrastically decreases for larger population sizes while PPHS error rate decreasesmore moderately. This suggests that the perfect phylogeny model is useful forphasing small population sizes; we note however, that this may be an artifact ofthe high error rate found in the 1000 genomes project, and future investigationof this issue may be possible when a higher quality dataset is released.

5 Conclusions

This work presents a new algorithm for phasing. This algorithm works byreconstructing the prefect phylogeny tree in every short region. Unlike previousmethods (e.g. HAP [14] ) we use raw read data to build the tree. Since rawread data includes sequencing errors, methods to overcome them were developed.The fact that a single read can contain more than one SNP is also used by thealgorithm. Some deviation from perfect phylogeny is allowed by permitting asingle recurrent mutation event in each haplotype.

The results demonstrate that the proposed algorithm works well and is im-mune to sequencing errors and small population sizes, making it robust. Inaddition to the solution for the phasing problem, the algorithm provides a newmethod to reconstruct perfect phylogeny under the condition of an error model.

18

��

��

��

��

��

��

��

� � � �

��

��

��

��

��

��

��

�

��

�

��

�

��

� � � � � �

��

��

��

��

��

��

�

��

��

��

��

�

��

��

��

��

�

� � � � �

��

��

��

��

��

��

Figure 7: The window based SM error metric as a function of the sequence errorrate for coverage values of 2,4,6. Read length was set to 400. The test was donewith a population of 5 individuals. The windows size which was used by PPHSis 5 SNPs.

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

Figure 8: The window based SM error metric as a function of the length of theread. The test was done with a windows size of 5 SNPs over a population of 5individuals. Coverage was 5 and the sequencing error rate was 3%.

19

��

��

��

�

��

��

��

��

� ��

��

��

��

��

Figure 9: The window based SM error metric as a function of the populationsize. The test was done with a windows size of 5 SNPs. Coverage was 5 andthe sequencing error rate was 3%. Read length was set to 400.

This approach, however, has a few limitations, which may be of interest forfurther research. The method is designed to work in windows. In order to stitchthe windows the BEAGLE results are used as a skeleton, and it is likely thatmore tailored methods (e.g. a variant of [Halperin, Sharan, Eskin] [1]) mayperform better.

An additional limitation of the algorithm is its performance while handlinglarge populations. As the results show, the performance of BEAGLE improvesrapidly with the size of the population, whereas the performance of PPHS im-proves at a slower rate. However, given simulated data (based on the Coalescentmodel), the algorithm's performance does improve considerably as the popula-tion size is increased. This might suggest that when using the 1000 genomesdata, the perfect phylogeny model breaks as the number of samples increasesdue to subtle population structure, or simply large number of errors within thatdata. If the former is true, there may be an optimization procedure that selectsthe length of the windows as a function of the linkage disequilibrium structurein the region. Such optimization of the parameters may result in better phasingalgorithms, and particularly in a better reconstruction of the trees.

References

[1] Eskin, E. and Sharan, R. and Halperin, E. A note on phasing long genomicregions using local haplotype predictions bioinformatics and computationalbiology, 4(3):639�648, 2006.

[2] V. Bansal and V. Bafna. Hapcut: an e�cient and accurate algorithm forthe haplotype assembly problem. Bioinformatics, 24(16):i153, 2008.

[3] S.R. Browning and B.L. Browning. Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies by use oflocalized haplotype clustering. The American Journal of Human Genetics,81(5):1084�1097, 2007.

20

[4] R. Cilibrasi, L. Van Iersel, S. Kelk, and J. Tromp. On the complexity ofseveral haplotyping problems. Algorithms in Bioinformatics, pages 128�139, 2005.

[5] A.G. Clark. Inference of haplotypes from pcr-ampli�ed samples of diploidpopulations. Molecular biology and evolution, 7(2):111, 1990.

[6] M.J. Daly, J.D. Rioux, S.F. Scha�ner, T.J. Hudson, and E.S. Lander. High-resolution haplotype structure in the human genome. Nature genetics,29(2):229�232, 2001.

[7] E. Eskin, E. Halperin, and R.M. Karp. E�cient reconstruction of hap-lotype structure via perfect phylogeny. INTERNATIONAL JOURNALOF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 1(1):1�20,2003.

[8] E. Eskin, E. Halperin, and R. Sharan. Optimally phasing long genomicregions using local haplotype predictions. In Proceedings of the SecondRECOMB Satellite Workshop on Computational Methods for SNPs andHaplotypes. Citeseer, 2004.

[9] G. Ewing and J. Hermisson. Msms: a coalescent simulation program includ-ing recombination, demographic structure and selection at a single locus.Bioinformatics, 26(16):2064, 2010.

[10] L. Exco�er and M. Slatkin. Maximum-likelihood estimation of molecu-lar haplotype frequencies in a diploid population. Molecular biology andevolution, 12(5):921, 1995.

[11] WR Gilks. Markov chain monte carlo. 1996.

[12] D. Gus�eld. Haplotyping as perfect phylogeny: conceptual frameworkand e�cient solutions. In Proceedings of the sixth annual internationalconference on Computational biology, pages 166�175. ACM, 2002.

[13] B. Halldórsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, and S. Is-trail. A survey of computational methods for determining haplotypes.Computational Methods for SNPs and Haplotype Inference, pages 613�614, 2004.

[14] E. Halperin and E. Eskin. Haplotype reconstruction from genotype datausing imperfect phylogeny. Bioinformatics, 20(12):1842�1849, 2004.

[15] D. He, A. Choi, K. Pipatsrisawat, A. Darwiche, and E. Eskin. Opti-mal algorithms for haplotype assembly from whole-genome sequence data.Bioinformatics, 26(12):i183, 2010.

[16] Y.T. Huang, K.M. Chao, and T. Chen. An approximation algorithm forhaplotype inference by maximum parsimony. Journal of ComputationalBiology, 12(10):1261�1274, 2005.

21

[17] G. Kimmel and R. Shamir. Gerbil: Genotype resolution and block identi-�cation using likelihood. Proceedings of the National Academy of Sciencesof the United States of America, 102(1):158, 2005.

[18] JFC Kingman. The coalescent. Stochastic processes and their applications,13(3):235�248, 1982.

[19] S. Levy, G. Sutton, P.C. Ng, L. Feuk, A.L. Halpern, B.P. Walenz, N. Ax-elrod, J. Huang, E.F. Kirkness, G. Denisov, et al. The diploid genomesequence of an individual human. PLoS biology, 5(10):e254, 2007.

[20] R. Lippert, R. Schwartz, G. Lancia, and S. Istrail. Algorithmic strategies forthe single nucleotide polymorphism haplotype assembly problem. Brie�ngsin bioinformatics, 3(1):23, 2002.

[21] J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin,S. Lin, Z.S. Qin, H.M. Munro, G.R. Abecasis, et al. A comparison of phas-ing algorithms for trios and unrelated individuals. The American Journalof Human Genetics, 78(3):437�450, 2006.

[22] T.K. Moon. The expectation-maximization algorithm. Signal ProcessingMagazine, IEEE, 13(6):47�60, 1996.

[23] N. Patil, A.J. Berno, D.A. Hinds, W.A. Barrett, J.M. Doshi, C.R. Hacker,C.R. Kautzer, D.H. Lee, C. Marjoribanks, D.P. McDonough, et al. Blocks oflimited haplotype diversity revealed by high-resolution scanning of humanchromosome 21. Science, 294(5547):1719, 2001.

[24] P. Scheet and M. Stephens. A fast and �exible statistical model for large-scale population genotype data: applications to inferring missing geno-types and haplotypic phase. The American Journal of Human Genetics,78(4):629�644, 2006.

[25] N. Siva. 1000 genomes project. Nature biotechnology, 26(3):256�256, 2008.

[26] S. Sridhar, G.E. Blelloch, R. Ravi, and R. Schwartz. Optimal imper-fect phylogeny reconstruction and haplotyping (ipph). In Computationalsystems bioinformatics: CSB2006 conference proceedings, Stanford CA,14-18 August 2006, page 199. Imperial College Pr, 2006.

[27] M. Stephens, N.J. Smith, and P. Donnelly. A new statistical method forhaplotype reconstruction from population data. The American Journal ofHuman Genetics, 68(4):978�989, 2001.

22

Haplotype reconstruction using perfect phylogeny …Haplotype reconstruction using perfect phylogeny...

Documents

Transcript of Haplotype reconstruction using perfect phylogeny …Haplotype reconstruction using perfect phylogeny...