Pure Parsimony

32
RECOMB SNPs Workshop/Jan 28, 2007 How Accurate is Pure Parsimony Haplotype Inferencing? Sharlee Climer Department of Computer Science and Engineering Department of Biology Washington University in Saint Louis [email protected] www.climer.us Joint work with Weixiong Zhang and Gerold Jaeger

description

How Accurate is Pure Parsimony Haplotype Inferencing? Sharlee Climer Department of Computer Science and Engineering Department of Biology Washington University in Saint Louis [email protected] www.climer.us Joint work with Weixiong Zhang and Gerold Jaeger. Pure Parsimony. - PowerPoint PPT Presentation

Transcript of Pure Parsimony

Page 1: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

How Accurate is Pure Parsimony Haplotype Inferencing?

Sharlee ClimerDepartment of Computer Science and Engineering

Department of BiologyWashington University in Saint Louis

[email protected]

Joint work with Weixiong Zhang and Gerold Jaeger

Page 2: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Pure Parsimony

• Pure Parsimony Haplotype Inferencing (PPHI)– Find smallest set of unique haplotypes that can

resolve a set of genotypes

• Suggested by Earl Hubbell in 2000• Cast as an Integer Linear Program (IP) by

Dan Gusfield [CPM’03]

• Great research interest

Page 3: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Overview

• Biological forces

• Haplotypes with low frequency

• Define haplotype classes

• Data sets

• Characteristics of real data

Page 4: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 5: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 6: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 7: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 8: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 9: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 10: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 11: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

• Relatively few unique haplotypes

• Subset of haplotypes with low frequency

• Problems for PPHI– Large number of optimal solutions– True biological solution might not be

parsimonious

• What are structural characteristics of optimal solutions?

Page 12: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Classes of haplotypes

• Set of possible haplotypes is exponentially large• Partition similar to Traveling Salesman Problem• Backbone haplotypes

– Appear in every optimal solution

• Fat haplotypes– Do not appear in any optimal solution

• Fluid haplotypes– Appear in some, but not all, optimal solutions

Page 13: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone haplotypes

• Implicit backbones– All haplotypes that resolve unambiguous

genotypes

• Explicit backbones– Can identify by solving at most one IP for each

haplotype in solution that isn’t implicit backbone

Page 14: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone haplotypes

h3 h7 h15 h27 h39 h50 h55 h79 h91

bb bb bb bb

Page 15: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone graph

Page 16: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone graph

Page 17: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

An optimal solution

Page 18: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 19: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 20: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 21: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 22: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Data sets

• 7 true haplotype data sets– Orzack et al.[Genetics, 2003]

• 80 genotypes

• 9 sites

• ApoE

– Andres et al. [Genet. Epi., in press]

• 6 sets of complete data

• 39 genotypes

• 5 to 47 sites

• KLK13 and KLK14

Page 23: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Data sets

• HapMap data [Nature 2003, 2005]

– Phase unknown– Random instance generator– 20 unique genotypes – 20 sites– Three populations

• CEU• YRI• JPT+CHB

– 22 chromosomes

Page 24: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Size of haplotype backbonePercentage of haplotypes that are backbones

0

0.2

0.4

0.6

0.8

1

1.2

BF

HG

BV

ceu2:

ceu5:

ceu8:

ceu11:

ceu14:

ceu17:

ceu20:

yri3:

yri6:

yri9:

yri12:

yri15:

yri18:

yri21:

jpt+

chb1:

jpt+

chb4:

jpt+

chb7:

jpt+

chb10:

jpt+

chb13:

jpt+

chb16:

jpt+

chb19:

jpt+

chb22:

Implicit backbones

hBBTotal

Page 25: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of fluid haplotypes in each solution

0

2

4

6

8

10

12

14

16

18

20

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75

Page 26: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of optimal solutions

1

10

100

1000

1 2 3 45 6 7 8910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576

Page 27: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of fluid haplotypes and solutions

0

2

4

6

8

10

12

14

16

18

20

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576

Nu

mb

er

of

flu

id h

ap

loty

pes r

eq

uir

ed

0

200

400

600

800

1000

1200

Nu

mb

er

of

so

luti

on

s

# fluid haplotypes # of solutions

Page 28: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness

Data set

# gen. # sites # BB

hap.

#fluid hap.

# opt. sols.

Avg. distance to real

A 30 9 15 0 1 8

B 10 5 7 0 1 0

C 18 17 9 3 16 7.5

D 10 8 6 1 4 2.5

E 23 26 9 7 >1000 4.33

F 26 22 12 5 630 28.24

G 35 47 12 16 >1000 10.95

Page 29: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness

Data set Parsimony # of haplotypes

True # of haplotypes

A 15 17

B 7 7

C 12 12

D 7 7

E 16 16

F 17 18

G 28 32

Page 30: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness

• Accuracy of backbone haplotypes

• Two data sets (F and G) had errors – One parsimony backbone haplotype not in real

solution

Page 31: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of solutions vs. number of genotypes

0

2

4

6

8

10

12

14

16

18

nu

mb

er o

f h

aplo

typ

es

0

100

200

300

400

500

600

700

nu

mb

er o

f o

pti

mal

so

luti

on

s

# of haplotypes

# of solutions

Page 32: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Conclusions

• Biological forces tend to minimize cardinality, but also create low frequency haplotypes

• Low frequency in unique genotypes might not be low frequency in full set

• Low frequency haplotypes– Large number of optimal solutions

– True solution not necessarily parsimonious

– Combinatorial nature can lead to errors in backbones

• Parsimony combined with other biological clues