SHARLEE CLIMER, ALAN R. TEMPLETON, AND WEIXIONG ZHANG ACM-BCB, NIAGARA FALLS AUGUST 2010...
-
Upload
gwendoline-fletcher -
Category
Documents
-
view
215 -
download
1
Transcript of SHARLEE CLIMER, ALAN R. TEMPLETON, AND WEIXIONG ZHANG ACM-BCB, NIAGARA FALLS AUGUST 2010...
SHARLEE CLIMER, ALAN R. TEMPLETON, AND WEIXIONG ZHANG
ACM-BCB, NIAGARA FALLSAUGUST 2010
SplittingHeirs:Inferring Haplotypes by Optimizing
Resultant Dense Graphs
Overview
IntroductionDefinition of haplotype inference problemPrevious approaches SplittingHeirsExperimental results
Introduction
Only 0.1% of human DNA has variation
Most of this variation is due to Single Nucleotide Polymorphisms (SNPs)
Most SNPs have only two variants, or alleles, within a population
Broad definition of haplotype:A set of alleles for a given set of SNPs in relatively close proximity on a chromosome
Image source: http://www.dnabaser.com/articles/SNP/SNP-Single-nucleotide-polymorphism.png
Introduction
DNA is transcribed to produce RNA
RNA is translated, ultimately producing proteins
Variation in non-coding regions might have an effect on regulation
SNPs throughout the genome may be of interest
Image source: http://www.cytochemistry.net/cell-biology/ribosome.htm
Humans are diploid Pairs of chromosomes
Common sequencing produces a meld of the two haplotypes, referred to as a genotype
Computational methods used to infer a pair of haplotypes from a genotype Phasing the genotype
G C
SNP1 SNP2
T T
G C T T
Introduction
G T A C + C T A G
C T A C + G T A G?
Importance of accuracy when inferring haplotypes from genotypes Frequently an early step in expensive and vitally important
studies
SNP1 SNP2 SNP1 SNP2
C C T C G C T T
Introduction
Possible to identify the separate haplotypes directly Only feasible for very small studies
Useful for testing accuracy of computational methods Andres et al. [Genet. Epi. 2007] found computational methods had
poor accuracy and confidence levels were error prone PHASE [Stephens et al., AJHG 2001]
fastPhase [Scheet and Stephens, AJHG 2006]
HAP [Halperin and Eskin, Bioinformatics 2004]
GERBIL [Kimmel and Shamir, PNAS 2005]
Errors in confidence levels suggest that the models might not fully capture biological properties
Problem Definition
Let ‘0’ and ‘1’ represent the two possible alleles for a given SNP
Haplotype represented by a string of binary values
Genotype for a pair of haplotypes ‘0’ if both alleles are ‘0’ ‘1’ if both alleles are ‘1’ ‘2’ if heterozygous
G T A C C T A G
1 1 0 00 1 0 1
2 1 0 2
Problem Definition
For k heterozygous sites, there are 2k-1 feasible solutions
Not apparent which solution is more likely than another
Population-level characteristics There tends to be relatively
few unique haplotypes There tends to be clusters
of haplotypes that are similar to each other
Some haplotypes are relatively common
Problem Definition
Given a set of genotypes drawn from a population:1) Find the set of haplotypes that exist in the set 2) For each genotype, determine the pair of haplotypes that is mostly likely to exist in the given individual
Image source: http://www.samepoint.com/blog/wp-content/uploads/2009/04/blog_group_of_people_1.jpg
Example
g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222
Example problem 5 individuals 8 SNP sites
Display solutions as graphs Each node represents a unique
haplotype Edge weight
Measure of difference between haplotypes
Set equal to the number of sites that differ between the haplotypes
Edges with smallest distances are shown
Example
g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222
Solution found by: Clark’s Subtraction Method
[Mol. Biol. And Evol. 1990] Pure Parsimony [Gusfield,
CPM’03] EM [Excoffier and Slatkin, Mol.
Biol.Evol. 1995]
5 unique haplotypesHaplotypes are not very
similar to each other
Example
g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222
No Perfect Phylogeny solution
Solution found by HAP 6 unique haplotypesHaplotypes are
slightly more similar to each other
Example
g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222
Solution found by PHASE
9 unique haplotypesHaplotypes are
more similar to each other
Example
g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222
PHASE favors pair-wise similarities
Essentially evaluating a nearest-neighbor graph
SplittingHeirs
SplittingHeirs favors cluster-wide similarities, as well as reduced cardinality
Cast as a Mixed Integer Linear Program (MIP)
Minimize:
where di = the weight of edge ih = the cardinality of the haplotype setu = a weighting factor
SplittingHeirs
Enforce cluster-wide similarities by requiring a minimum density of edges in the graph
Additional constraint:
where e = number of edgesa is a configurable parameter
Can be decreased for highly diverse sampleCan be increased for sample with low diversity
Example
g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222
Solution found by SplittingHeirs
8 unique haplotypesHaplotypes are
quite similar to each other
Results
Tested on 7 sets of haplotype data for which the true phase is known
n is the number of individualsm is the number of sites# Ambiguous is the number of genotypes that have
more than one feasible solution
Conclusions
Introduced a biologically intuitive model that optimizes cluster-wide similarities and reduced cardinality
Globally optimal solutions can be computed for small regions Candidate locus studies
Future work Speed up computation Use model to guide an approximation method
Image source: http://farm3.static.flickr.com/2268/2255581637_a59a956bfe.jpg