Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers...
-
Upload
diane-randall -
Category
Documents
-
view
222 -
download
2
Transcript of Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers...
![Page 1: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/1.jpg)
Haplotyping algorithms and structure of human variation
EECS 458 CWRU
Fall 2004
Readings: see papers on the course website
![Page 2: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/2.jpg)
Roadmap
• Definition: haplotype and haplotype inference• Why infer haplotypes• Infer haplotypes from pedigree data
– Most probable haplotype configurations– Haplotype configurations with minimum recombinations
• Infer haplotypes from population data– Combinatorial: Clark’s, Perfect Phylogeny– Statistical methods: EM, Bayesian (MCMC)
• Infer haplotypes from pooled samples• Haplotype block partition• Tag SNP selection
![Page 3: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/3.jpg)
Genotype and Haplotype
{1 2}
{1 2}
.
.
.AATGCCGCAA...GTC...
.
.
.AGTGCCGCAA...TAC...
Paternal Maternal
![Page 4: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/4.jpg)
Typical Genotype Data
• Two alleles for each individual– Chromosome origin for
each allele is unknown
• Multiple haplotype pairs can fit observed genotype
• Molecular haplotyping is expensive
Observation:
A C
G A
T C
Marker1
Marker2
Marker3
Possible haplotypes:
A C
G A
T C
A C
G A
C T
A C
A G
T C
A C
A G
C T
![Page 5: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/5.jpg)
Haplotypes are important!
• Phase may determine phenotype
• Phase helps exploit linkage disequilibrium
Infer state of neighboring alleles
• Phase clarifies identity-by-descent status
![Page 6: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/6.jpg)
Common Uses of Haplotypes
• Linkage disequilibrium studies– Summarize genetic variation
• Selecting markers to genotype– Identify haplotype tag SNPs
• Candidate gene association studies– Test haplotype associations– Help interpret single marker associations
• Understanding evolution of human populations
![Page 7: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/7.jpg)
The problem…
• Haplotypes are hard to measure directly– X-chromosome in males– Sperm typing– Other molecular techniques
• Often, statistical or combinatorial methods for reconstruction required
![Page 8: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/8.jpg)
Haplotype Inference on population data
m=6, m’=4
{1 2}{1 1}{1 2}{1 2}{1 2}{2 2}
m
1|21|11|21|21|22|2
2|11|11|21|21|22|2
1|21|12|11|21|22|2
2|11|12|11|21|22|2
2|11|12|12|12|12|2
……
2m’
![Page 9: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/9.jpg)
Information on Relatives
• Number of ambiguous individuals increases rapidly with number of markers
• Family information can help, but many ambiguities remain
![Page 10: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/10.jpg)
Haplotype Inference on Pedigrees, Mendelian Law
{1 2} {1 1}
{1 1} {1 2} {2 2}
2|1 1|1
1|1 2|1 2 2
{1 2} {1 2}
{1 2} {1 2} {1 2}
{1 2} {1 *}
{1 1} {1 2} {1 2}
![Page 11: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/11.jpg)
Haplotype inference on pooled samples
• The input contain n pools• Each pool contains k
individuals, thus 2k haplotypes and m markers
• At each marker, we are given the number of alleles for the k individuals for each pool
• The goal is to find the haplotype frequencies
• Example: n=3, k=2, m=5
2 4 3 2 2
0 2 3 1 2
1 2 2 2 3
![Page 12: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/12.jpg)
Haptotyping pedigree data: statistical formulation
• Statistical formulation: find the most probable haplotype configuration
• Need to calculate the probability of a pedigree on every haplotype configuration
• Recall for linkage analysis, we need to calculate the probability of a pedigree, that sums over all possible haplotype configs
![Page 13: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/13.jpg)
Haptotyping pedigree data : statistical formulation
• Thus the linkage programs like Genehunter, Allegro, Merlin could compute the most probable haplotypes
• But, it is time consuming….• In addition to exact computation, there are some
approximation algorithms, mainly based on important sampling, e.g. SimWalk.
• Still very time consuming, may consider many configurations with very small probabilities
![Page 14: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/14.jpg)
Recombination and combinatorial formulation
{1 2}{1 2}
{1 1}{1 2}
{1 2}{1 2}
{1 2}{1 2}
1|2{1 2}
1|1{1 2}
{1 2}{1 2}
1|2{1 2}
1|2 1|2
1|11|2
1|21|2
1|2 1|2
1|2 1|2
1|11|2
1|22|1
1|2 1|2
![Page 15: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/15.jpg)
MRHC Problem
Find a minimum recombinant haplotype configuration from a given pedigree with genotype data.
Assumptions:• Mendelian law (no mutations);• Recombination events are rare.
Well supported from real data.
{1 2}{1 2}{1 2} …
{1 2}{1 2}{2 2} …
{1 1}{1 2}{2 2} ...
{1 2}{1 2}{1 2} ...
{1 2}{2 2}{2 2} …
{1 1}{1 2}{2 2} …
{1 2}{1 2}{1 2} ...
{1 1}{1 2}{2 2} ...
Input
![Page 16: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/16.jpg)
MRHC Problem (cont’d)• PS: parental source of the two
alleles at the locus (i.e. phase)
• GS: grandparental source of an allele
• Haplotype configuration = assignment of PS and GS values.
Output
PS=0
PS=1
GS2=1
1|21|22|1…
1|22|12|2 …
1|11|22|2 ...
1|21|22|1 ...
1|22|22|2 …
1|12|12|2…
1|21|22|1 ...
1|11|22|2 ...
A
B
GS2=1GS2=0
![Page 17: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/17.jpg)
Previous Results• Genotype elimination (O’Connell’00).
– For data requiring no recombinant, exhaustive elimination.
• Genetic algorithm (Tapadar et al.’00).– Time consuming.
• MRH (Qian & Beckmann’02).– Six step rule-based algorithm.– Locus by locus at every step, extremely slow for biallelic (e.g.
SNP) markers.
![Page 18: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/18.jpg)
Thm. MRHC is NP-Hard.
Idea: Reduction from a variant of set cover.
First complexity result.
Remains hard for two loci.
Remains hard when no loops.
Li & Jiang’03, Doi, Li & Jiang’03
![Page 19: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/19.jpg)
Block-Extension Algorithm
Iterative, heuristic, five steps. Rules are derived from Mendelian law, MR principle, block concept and some greedy ideas based on the following observations:
• Block structures are common in haplotypes.• Double recombination events are rare.• Common haplotype blocks shared in siblings.• …
Advantages/Disadvantages Time complexity (BE: O(dmn) / MRH: O(2dm3n2))
Li & Jiang’03
![Page 20: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/20.jpg)
Block-Extension Algorithm
1 3
4 2
2 4
3 2
1 2
3 2
2 1
3 4
3 3
3 2
2 4
3 41 *
4 2
2 4
3 2
* *
* *
* *
* *
1 2
3 4 5
6
1 1
1 2
2 3
3 4
1 3
4 2
2 4
3 2
1 2
3 2
2 1
3 4
2 3
* *
* *
* *
1 2
3 4 5
6
1 1
1 2
2 3
3 4
3 3
3 2
2 4
3 41 3
4 2
2 4
3 2
1 3
4 2
2 4
3 2
1 2
3 2
2 1
3 4
2 3
3 4
1 4
2 *
1 2
3 4 5
6
1 1
1 2
2 3
3 4
3 3
3 2
2 4
3 41 3
4 2
2 4
3 2
![Page 21: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/21.jpg)
Block-Extension Algorithm
1|1
1|2
2 3
3 4
1|2(-1,0)
2|3(1,-1)
2|1(-1,-1)
3 4
1|3(-1,1)
2|4(1,-1)
2|4(-1,-1)
3|2(-1,-1)
1|3
3 2
2 4
3 41 3
4|2(1,-1)
2 4
2|3(1,-1)
2|3
3 4
1 4
2 *
1 2
3 4 5
6
1|1
1|2
2 3
3 4
1|2(-1,0)
2|3(1,-1)
2|1(-1,-1)
3 4
1|3(-1,1)
2|4(1,-1)
2|4(1,-1)
3|2(-1,-1)
1|3
3 2
2 4
3 43|1(1,0)
4|2(1,-1)
4|2(1,-1)
2|3(1,-1)
2|3
3 4
1 4
2 *
1 2
3 4 5
6
![Page 22: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/22.jpg)
Dynamic Programming Algorithms• Locus-based dynamic programming algorithm
– Linear time in the number of the members– Applicable to only tree pedigrees
• Member-based dynamic programming algorithm– Linear time in the number of the loci– Applicable to general pedigrees with small sizes
Doi, Li & Jiang’03
![Page 23: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/23.jpg)
root
21
5
43
6
7 8
Locus-Based Dynamic Programming
21
5
43
6
7
8
![Page 24: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/24.jpg)
Constraint-Finding Algorithm
• Assumptions:– No missing alleles, no errors. – Zero recombinants.
• Idea: finding all feasible (i.e. 0-recombinant) haplotype configurations is equivalent to reducing the degree of freedom in PS/GS assignment.
Li & Jiang’03
![Page 25: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/25.jpg)
Four Levels of Constraints
PS=0
GS2=1
1|21|22|1…
1|22|12|2 …
1|11|22|2 ...
1|21|22|1 ...
1|22|22|2 …
1|12|12|2…
1|21|22|1 ...
1|11|22|2 ...
A
B
Based on 0-recombinant (for a pair of loci): Level 3: Haplotype constraint Level 4: Grouping constraint
Based on Mendelian law (on single locus) :
Level 1: GS constraint Level 2: PS constraint
![Page 26: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/26.jpg)
Level 3 and Level 4 Constraints
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
{1 1}
{1 1}
{1 2}
{1 2}
1 2
3 4 5
6 1 2
2 1
1 2
1 2
1 2
1 2
2 1
2 1
4 5
6
1 2
2 1
1 2
2 1
4 5
6
{1 2}
{1 2}
{1 2}
{1 2}
4 5
6{1 2}
{1 2}
![Page 27: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/27.jpg)
Level 3 and Level 4 ConstraintsLevel 3 and Level 4 Constraints
The variables represent PS values and the equations are over Z2
![Page 28: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/28.jpg)
Analysis of Constraint-Finding Algorithm
Thm. Every solution consistent with the constraint equations is a feasible solution and vice versa.
• Steps: – find all constraints, in the form of linear equations over Z2
– solve the equations by Gaussian elimination – enumerate all feasible haplotype configurations
• Exact polynomial time (O(n3m3); genotype elimination: exponential)
![Page 29: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/29.jpg)
Integer Linear Programming
• Combines missing data imputation and haplotype inference.• Regardless of the pedigree structure, number of recombinants,
number of variables are linear of problem size. • Implicitly checks the Mendelian consistency for pedigree
genotype data with missing alleles, which is also an NPC problem.
• Could find all possible optimal solutions.• Solved by a branch-and-bound algorithm.• Effective for practical size problems in terms of time efficiency.• Accurate in terms of missing alleles imputation and haplotype
inference.
Li & Jiang’04a
![Page 30: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/30.jpg)
ILP for MRHC with Missing Data
1. Define variables . 2. Define linear constraints.3. Define a linear objective function of the variables. 4. Preprocess constraints.5. Apply branch-and-bound strategy to find solutions. (a
partial order relationship and some other special relationships).
6. Estimate bounds.7. Apply a maximum likelihood approach to multiple
optimal solutions.
![Page 31: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/31.jpg)
FormulationMj:={mk} set of all possible alleles at marker locusj and let tj = |Mj|. M1 = {1, 2} , M2 = {1,2}
{1 2}{1 2}
{1 1}{1 2}
{1 2}{1 0}
{1 0}{1 2}
1 2
3 4
Individual 4:
11,4f 1
2,4f 11,4m 1
2,4m
21,4f 2
2,4f 21,4m 2
2,4m
1 12,4
11,4 ff
…
Define tj f vars for each paternal allele and tj m vars
for each maternal allele at locus j of individual i: )1( , ,, j
jki
jki tkmf
kjki mf is allele paternal iff 1 ,
![Page 32: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/32.jpg)
Formulation: Variables Define 2 g vars for each paternal allele and
maternal allele at locus j for individual iji
ji gg 2,1, ,
Var g1 = 0 (or 1) iff paternal allele is copied from father’s paternal (or maternal) allele. Var g2 defined similarly.
Define r vars:
iff 1
)11( , 1
1,1,1,
2,1,
j
ij
ij
i
ji
ji
ggr
mjrr
![Page 33: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/33.jpg)
Formulation: Objective Function
) (Founders-Non
1
12,1,
m
j
ji
ji rr
Subject to Genotype constraints:
} 1{},{
} 1{},{
} 1, 1, 1{}0,{
} 1, 1{}0,0{
,,,,,,,,
,,
1,
1,,,
1,
1,
jsi
jsi
jri
jri
jsi
jri
jsi
jri
js
jr
jri
jri
jr
jr
t
k
jki
t
k
jki
jri
jri
jr
t
k
jki
t
k
jki
mfmfmmffmm
mfmm
mfmfm
mf
jj
jj
Objective function:
![Page 34: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/34.jpg)
Formulation: Constraints
Mendelian law of inheritance constraints (a child i and its father f ):
1
0
1,,,
1,,,
j
ij
kfjki
ji
jkf
jki
gmf
gff
Constraints for r vars:
0
0
2
0
1,,,
1,,,
1,,,
1,,,
jli
jli
jli
jli
jli
jli
jli
jli
jli
jli
jli
jli
ggr
ggr
ggr
ggr
![Page 35: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/35.jpg)
A Partial Order Relationship
Denote:
0 1
1
y
yy
Inequalities with 2 variables:
ji yy
1
2
3
45
8
69
11
107
1
2
3’
8
9
11
10
![Page 36: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/36.jpg)
Forced Variables
• Rule 1:
• Rule 2:
• Rule 3:
ncyInconsiste, 10 Syy
1)()(
0)()(1
1
jjiji
ijiji
yyyyy
yyyyy
01 iii yyy
![Page 37: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/37.jpg)
Lower and Upper Bounds
• Lower bounds– Linear relaxation.– Summation of the number of recombinants in each
nuclear family.– Effective for data with large number of recombinants.
• Upper bound– Obtained by block-extension algorithm.– Effective for data with small number of recombinants.
![Page 38: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/38.jpg)
Statistical Assessment
• E-M algorithm to estimate haplotype frequencies for data that consist of multiple pedigrees.
N
hhf i
i 2
)()(ˆ
11
i
imifii
ii hhhPhfhffHGPfounder -non
)()(founder
21 )|()(ˆ)(ˆ)ˆ|,(
![Page 39: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/39.jpg)
PedPhase software• Simulated data were generated to compare our
algorithms, as well as MRH in terms of efficiency, accuracy.
• Three different pedigree structures.• Multiallelic and biallelic data.• Numbers of loci: 10, 25 and 50.• Number of recombinants: 0-4.• 100 runs per data set.
![Page 40: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/40.jpg)
Pedigree Structures
![Page 41: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/41.jpg)
Accuracy Results of BE Algorithm
![Page 42: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/42.jpg)
Efficiency Results
![Page 43: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/43.jpg)
More Results from ILP
![Page 44: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/44.jpg)
Real Data Analysis Data set (Gabriel et al.’02)
93 members, 12 pedigrees (each with 7-8 members); chromosome 3, 4 regions, each region 1-4 blocks.
![Page 45: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/45.jpg)
Common Haplotypes
&Frequencies
![Page 46: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/46.jpg)
Results From ILP on the Whole Dataset
3.82 4.00 0.45 0.034
![Page 47: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/47.jpg)
What if there are no relatives?
• Rely on linkage disequilibrium
• Assume that population consists of small number of distinct haplotypes
• Haplotypes tend to be similar
![Page 48: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/48.jpg)
Clark’s Haplotyping Algorithm
• Clark (1990) Mol Biol Evol 7:111-122
• One of the first haplotyping algorithms– Computationally efficient– Very fast
• Today, more accurate alternatives are often available
![Page 49: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/49.jpg)
Clark’s Haplotyping Algorithm
• Find homozygous individuals– Initialize a list of known haplotypes
• Resolve ambiguous individuals– If possible, use two haplotypes from list– Otherwise, use one known haplotype and augment
list
• If unphased individuals remain– Assign phase randomly to one individual– Augment haplotype list and continue from previous
step
![Page 50: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/50.jpg)
Haplotyping via Perfect Phylogeny - Model,
Algorithms, Empirical studies
Dan Gusfield, Ren Hua Chung
U.C. Davis
Cocoon 2003
![Page 51: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/51.jpg)
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral haplotype
Extant haplotypes at the leaves
Site mutations on edges
The Perfect Phylogeny Model of Haplotype Evolution
![Page 52: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/52.jpg)
The Perfect Phylogeny Model
We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed.
In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root.
![Page 53: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/53.jpg)
Perfect Phylogeny Haplotype (PPH)
Given a set of genotypes S, find an explaining set of haplotypes that fits a perfect phylogeny.
1 2
a 2 2
b 0 2
c 1 0
sitesA haplotype pair explains agenotype if the merge of thehaplotypes creates thegenotype. Example: Themerge of 0 1 and 1 0 explains 2 2.
Genotype matrix
S
![Page 54: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/54.jpg)
The PPH Problem
Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny
1 2
a 2 2
b 0 2
c 1 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
![Page 55: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/55.jpg)
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny
1 2
a 2 2
b 0 2
c 1 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1
c c a a
b
b
2
10 10 10 01 01
00
00
![Page 56: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/56.jpg)
The Alternative Explanation
1 2
a 2 2
b 0 2
c 1 0
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No treepossiblefor thisexplanation
![Page 57: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/57.jpg)
Efficient Solutions to the PPH problem - n genotypes, m sites
• Reduction to a graph realization problem (GPPH) - build on Bixby-Wagner or Fushishige solution to graph realization O(nm alpha(nm)) time.
• Reduction to graph realization - build on Tutte’s graph realization method O(nm^2) time.
• Direct, from scratch combinatorial approach -O(nm^2) Bafna et al.
• Berkeley (EHK) approach - specialize the Tutte solution to the PPH problem - O(nm^2) time.
![Page 58: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/58.jpg)
The DPPH Method
• Bafna et al. O(nm^2) time
• Based on deeper combinatorial observations about the PPH problem.
• A matrix-centric approach (rather than tree-centric), although a graph is used in the algorithm.
First, we need to understand why some sets of haplotypeshave a perfect phylogeny, and some do not.
![Page 59: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/59.jpg)
When does a set of haplotypes fit a perfect phylogeny?
Arrange the haplotypes in a matrix, two haplotypes for each individual. Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all three pairs:
0,1 and 1,0 and 1,1
This is the 3-Gamete Test
![Page 60: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/60.jpg)
The Alternative Explanation
1 2
a 2 2
b 0 2
c 1 0
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No treepossiblefor thisexplanation
![Page 61: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/61.jpg)
1 2
a 2 2
b 0 2
c 1 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1
c c a a
b
b
2
0 0
0 1 0 1
0 0
The Tree Explanation Again
![Page 62: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/62.jpg)
PPH: The Combinatorial Problem
Input: A ternary matrix (0,1,2) M with 2N rowspartitioned into N pairs of rows, where thetwo rows in each pair are identical.
Def: If a pair of rows (r,r’) in the partition have entry values of 2 in a column j then positions (r,j) and (r’,j) are called Mates.
![Page 63: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/63.jpg)
Output: A binary matrix M’ created from Mby replacing each 2 in M with either 0 or 1,such that
a) A position is assigned 0 if and only if its Mate is assigned 1.
b) M’ passes the 3-Gamete Test, i.e., does not contain a 3x2 submatrix (after row and column permutations) with all three combinations 0,1; 1,0; and 1,1
![Page 64: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/64.jpg)
Initial Observations
If two columns of M contain the following rows 2 0 2 0 mates then M’ will contain a row with 1 0 and a row with 0 1 in
those columns. This is a forced expansion.
![Page 65: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/65.jpg)
Initial Observations
Similarly, if two columns of M contain the mates 2 1 2 1 then M’ will contain a row with 1 1 and a row with
0 1 in those columns.
This is a forced expansion.
![Page 66: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/66.jpg)
If a forced expansion of two columns creates 0 1 in those columns, then any 2 2 1 0 2 2 in those columns must be set to be0 11 0 We say that two columns are forced out-of-phase.
If a forced expansion of two columns creates 1 1 in those columns, then any 2 2 2 2 in those columns must be set to be1 10 0 We say that two columns are forced in-phase.
![Page 67: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/67.jpg)
1 2 2
1 2 2
2
0 2
2 0 2
1 2 2
1 2 2
1 2 2
1 2 2
2 2 0
2 2 0
a
a
b
b
cc
d
ed
e
1 2 3
Columns 1 and 2, and 1 and3 are forced in-phase.Columns 2 and 3 are forced out-of-phase.
Example:
1 0
1 1
1 0
0 0
1 3
a
ae
e
1 0
1 1
0 0
1 0
1 2
a
ab
b
0 0
0 1
1 0
0 0
2 3
b
be
e
![Page 68: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/68.jpg)
Overview of Bafna et al. algorithm
First, represent the forced phase relationships, andthe needed decisions, in a graph G.
![Page 69: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/69.jpg)
1 7
2
5
3
4
6Each node representsa column in M, and eachedge indicates that thepair of columns hasa row with 2’sin both columns.
The algorithm builds thisgraph, and then checkswhether any pair of nodesis forced in or out of phase.
Graph G
![Page 70: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/70.jpg)
1 7
2
5
3
4
6Each Red edge indicatesthat the columns areforced in-phase.
Each Blue edge indicatesthat the columns areforced out-of-phase.
Let Gf be the subgraph of Gcdefined by the red and blueedges.
Graph Gc
![Page 71: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/71.jpg)
1 7
2
5
3
4
6
Graph Gf has threeconnected components.
![Page 72: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/72.jpg)
The Central Theorem
There is a solution to the PPH problem for M ifand only if there is a coloring of the dashed edges of Gc with the following property:
For any triangle (i,j,k) in Gc, where there is one row containing 2’s in all three columns i,j and k (any triangle containing at least one dashed edge will be of this type), the coloring makes either 0 or 2 of the edges blue (out-of-phase). Nice, but how do we find such a coloring?
![Page 73: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/73.jpg)
1 7
2
5
3
4
6
Theorem 1: If there are anydashed edges whose ends are in the same connected component of Gf, atleast one edge is in a trianglewhere the other edges arenot dashed, and in every PPHsolution, it must be coloredso that the triangle has aneven number of Blue (out ofPhase) edges. This is an “inferred” coloring.
Graph Gf
Triangle Rule
![Page 74: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/74.jpg)
1 7
2
5
3
4
6
![Page 75: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/75.jpg)
1 7
2
5
3
4
6
![Page 76: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/76.jpg)
1 7
2
5
3
4
6
![Page 77: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/77.jpg)
Corollary
Inside any connected component of Gf, ALL the phaserelationships on edges (columns of M) are uniquely determined, either as forced relationships based onpairwise column comparisons, or by triangle-based inferred colorings.
Hence, the phase relationships of all the columns in a connected component of Gf are INVARIANT over allthe solutions to the PPH problem.
![Page 78: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/78.jpg)
Comparing the programs - R.H. Chung
• All three are fast and practical (under one second) on problem instances of size 50 x 30.
• DPPH is the fastest, followed by HPPH and GPPH.
• HPPH encounters memory problems with large input.
![Page 79: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/79.jpg)
30 50 0.65 0.0206 0.0215
300 150 9.3 3.0 4.49
500 250 36 11.5 21.5
2000 1000 2331 640 1866
sites individ GPPH DPPH HPPH
times shown are in seconds on an 800 Mhz machine.
![Page 80: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/80.jpg)
A Phase-Transition
Problem, as the ratio of sites to genotypes changes,how does the probability that the PPH solution isunique change?
For greatest utility, we want genotype data where thePPH solution is unique.
Intuitively, as the ratio of genotypes to sites increases,the probability of uniqueness increases.
![Page 81: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/81.jpg)
Extension
• With recombination
• The papers: See wwwcsif.cs.ucdavis.edu/~gusfield
![Page 82: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/82.jpg)
The E-M Haplotyping Algorithm
• Excoffier and Slatkin (1995) Mol Biol Evol 12:921-927
• Provide a clear outline of how the algorithm can be applied to genetic data
• Combination of two strategies– E-M statistical algorithm for missing data– Counting algorithm for allele frequencies
![Page 83: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/83.jpg)
E-M Algorithm For Haplotyping
1. “Guesstimate” haplotype frequencies
2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes
3. Estimate frequency of each haplotype by counting
4. Repeat steps 2 and 3 until frequencies are stable
![Page 84: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/84.jpg)
E-M Algorithm for Haplotyping
• Cost grows rapidly with number of markers
• Typically appropriate for < 25 SNPs– Fewer microsatellites
• More accurate than Clark’s method
• Fully or partially phased individuals contribute most of the information
![Page 85: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/85.jpg)
Enhancements to E-M
• List only haplotypes present in sample– Gradually expand subset of markers under
consideration, eliminating haplotypes with low estimated frequency from consideration at each stage
• SNPHAP [Clayton (2001)]
• HAPLOTYPER [Qin et al. (2002)]
![Page 86: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/86.jpg)
Divide-And-Conquer Approximation
• No. of potential haplotypes increases exponentially– Actual no. of haplotypes doesn’t
• Approximation– Successively divide marker set– Run E-M assuming segments associate randomly– Proceed, ignoring composites of segments with zero
frequency
• Order: ~ m log m• Exact E-M is order ~ 2m
![Page 87: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/87.jpg)
Other Recent Developments …
• Newer methods try to further improve haplotype estimation by favoring sets of similar haplotypes
• Stephens et al. (2001) Am J Hum Genet 68:978-89
• Genealogical approach, which implies haplotypes are similar to each other…
![Page 88: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/88.jpg)
Method based on Gibbs sampler
• MCMC method– Stochastic, random procedure– Improves solution gradually
• Given initial set of haplotypes
• Sample haplotypes for one individual at a time, assuming other haplotypes are true
• Repeat a few million times…
![Page 89: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/89.jpg)
Update Procedure I
• Pick individual U to update at random
• Calculate haplotype frequencies F in all other individuals– Since everyone is “phased”, this is done by
counting
• Sample new haplotypes for U from conditional distribution of U’s haplotypes given F
![Page 90: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/90.jpg)
Update Procedure I
• This procedure would produce an estimate of haplotype frequencies that equivalent to the E-M algorithm…
• Stephens et al (2001) suggested an alternative estimate of F…
![Page 91: Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649e865503460f94b88c9d/html5/thumbnails/91.jpg)
Update Procedure II
• Estimate F from the other individuals
• Construct F* to include haplotypes in F and also other similar (possibly differing at a few sites, due to mutations)
• Update U’s haplotypes conditional on F*