SHARLEE CLIMER, ALAN R. TEMPLETON, AND WEIXIONG ZHANG ACM-BCB, NIAGARA FALLS AUGUST 2010...

23
SHARLEE CLIMER, ALAN R. TEMPLETON, AND WEIXIONG ZHANG ACM-BCB, NIAGARA FALLS AUGUST 2010 SplittingHeirs: Inferring Haplotypes by Optimizing Resultant Dense Graphs

Transcript of SHARLEE CLIMER, ALAN R. TEMPLETON, AND WEIXIONG ZHANG ACM-BCB, NIAGARA FALLS AUGUST 2010...

SHARLEE CLIMER, ALAN R. TEMPLETON, AND WEIXIONG ZHANG

ACM-BCB, NIAGARA FALLSAUGUST 2010

SplittingHeirs:Inferring Haplotypes by Optimizing

Resultant Dense Graphs

Overview

IntroductionDefinition of haplotype inference problemPrevious approaches SplittingHeirsExperimental results

Introduction

Only 0.1% of human DNA has variation

Most of this variation is due to Single Nucleotide Polymorphisms (SNPs)

Most SNPs have only two variants, or alleles, within a population

Broad definition of haplotype:A set of alleles for a given set of SNPs in relatively close proximity on a chromosome

Image source: http://www.dnabaser.com/articles/SNP/SNP-Single-nucleotide-polymorphism.png

Introduction

DNA is transcribed to produce RNA

RNA is translated, ultimately producing proteins

Variation in non-coding regions might have an effect on regulation

SNPs throughout the genome may be of interest

Image source: http://www.cytochemistry.net/cell-biology/ribosome.htm

Humans are diploid Pairs of chromosomes

Common sequencing produces a meld of the two haplotypes, referred to as a genotype

Computational methods used to infer a pair of haplotypes from a genotype Phasing the genotype

G C

SNP1 SNP2

T T

G C T T

Introduction

G T A C + C T A G

C T A C + G T A G?

Importance of accuracy when inferring haplotypes from genotypes Frequently an early step in expensive and vitally important

studies

SNP1 SNP2 SNP1 SNP2

C C T C G C T T

Introduction

Possible to identify the separate haplotypes directly Only feasible for very small studies

Useful for testing accuracy of computational methods Andres et al. [Genet. Epi. 2007] found computational methods had

poor accuracy and confidence levels were error prone PHASE [Stephens et al., AJHG 2001]

fastPhase [Scheet and Stephens, AJHG 2006]

HAP [Halperin and Eskin, Bioinformatics 2004]

GERBIL [Kimmel and Shamir, PNAS 2005]

Errors in confidence levels suggest that the models might not fully capture biological properties

Problem Definition

Let ‘0’ and ‘1’ represent the two possible alleles for a given SNP

Haplotype represented by a string of binary values

Genotype for a pair of haplotypes ‘0’ if both alleles are ‘0’ ‘1’ if both alleles are ‘1’ ‘2’ if heterozygous

G T A C C T A G

1 1 0 00 1 0 1

2 1 0 2

Problem Definition

For k heterozygous sites, there are 2k-1 feasible solutions

Not apparent which solution is more likely than another

Population-level characteristics There tends to be relatively

few unique haplotypes There tends to be clusters

of haplotypes that are similar to each other

Some haplotypes are relatively common

Problem Definition

Given a set of genotypes drawn from a population:1) Find the set of haplotypes that exist in the set 2) For each genotype, determine the pair of haplotypes that is mostly likely to exist in the given individual

Image source: http://www.samepoint.com/blog/wp-content/uploads/2009/04/blog_group_of_people_1.jpg

Example

g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222

Example problem 5 individuals 8 SNP sites

Display solutions as graphs Each node represents a unique

haplotype Edge weight

Measure of difference between haplotypes

Set equal to the number of sites that differ between the haplotypes

Edges with smallest distances are shown

Example

g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222

Solution found by: Clark’s Subtraction Method

[Mol. Biol. And Evol. 1990] Pure Parsimony [Gusfield,

CPM’03] EM [Excoffier and Slatkin, Mol.

Biol.Evol. 1995]

5 unique haplotypesHaplotypes are not very

similar to each other

Example

g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222

No Perfect Phylogeny solution

Solution found by HAP 6 unique haplotypesHaplotypes are

slightly more similar to each other

Example

g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222

Solution found by PHASE

9 unique haplotypesHaplotypes are

more similar to each other

Example

g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222

PHASE favors pair-wise similarities

Essentially evaluating a nearest-neighbor graph

SplittingHeirs

SplittingHeirs favors cluster-wide similarities, as well as reduced cardinality

Cast as a Mixed Integer Linear Program (MIP)

Minimize:

where di = the weight of edge ih = the cardinality of the haplotype setu = a weighting factor

SplittingHeirs

Enforce cluster-wide similarities by requiring a minimum density of edges in the graph

Additional constraint:

where e = number of edgesa is a configurable parameter

Can be decreased for highly diverse sampleCan be increased for sample with low diversity

Example

g1: 1111 0001g2: 2212 0202g3: 2220 2102g4: 2222 2121g5: 2022 0222

Solution found by SplittingHeirs

8 unique haplotypesHaplotypes are

quite similar to each other

Results

Tested on 7 sets of haplotype data for which the true phase is known

n is the number of individualsm is the number of sites# Ambiguous is the number of genotypes that have

more than one feasible solution

Results

Results

Conclusions

Introduced a biologically intuitive model that optimizes cluster-wide similarities and reduced cardinality

Globally optimal solutions can be computed for small regions Candidate locus studies

Future work Speed up computation Use model to guide an approximation method

Image source: http://farm3.static.flickr.com/2268/2255581637_a59a956bfe.jpg

Acknowledgments

Olin FellowshipNIH grants

P50-GM065509 R01-GM087194A2 U01-GM063340

NSF grants IIS-053557 DBI-0743797

Alzheimer’s Association grant

Thanks to: Taylor Maxwell Gerold Jaeger