Two Solutions in Search of Killer Apps.
Dimacs workshop on Algorithms in Human Population Genomics
Dan Gusfield
UC Davis
Two Algorithmic Topics
We have new algorithmic tools for a) computing the MinimumMosaic of a set of recombinants, and for b) multi-state PerfectPhylogeny with missing data, that should be of use inPopulation Genomics and Phylogenetics. These toolswere developed `on spec’:
We have `hand-waiving’ arguments for their utility, but no actual (biological data-set) applications.
Suggestions wanted.
Topic I: Improved Algorithms for Inferring the Minimum Mosaic of a
Set of Recombinants
Yufeng Wu and Dan Gusfield
UC Davis
From CPM 2007
Recombination
• Recombination: one of the principle genetic forces shaping sequence variations within species.
• Two equal length sequences generate a new equal length sequence.
110001111111001
000110000001111
Prefix
Suffix
11000 0000001111
Breakpoint
Founders and Mosaic• Current sequences are descendents of a small
number of founders.– A current sequence is composed of blocks from the
founders, due to recombination.– No mutations since formation of founders.
000000
111111
000000
111111
001111
000000
111111
001111
111100
Breakpoint
Founders
Sampled sequences in current population
000000
001111
111100
011100
Mosaic
000000
001111
111100
011100
The Minimum Mosaic Problem• Given a set of aligned binary sequences in the current
population and assume the number of founders is known to be Kf, find set of founders and the mosaic with the minimum number of breakpoints.
1101101
1010001
0111111
0110100
1100011
Assume Kf =3
1101101
1010001
0111111
0110100
1100011
1101111
1010001
0110100
Three Founders
Four breakpoints: minimum for all possible three founders
Status of the Minimum Mosaic Problem
• First studied by E. Ukkonen (WABI 2002). Later WABI 2007.– Dynamic programming method. Not practical when the
number of rows is more than 20 and Kf >2.
• No polynomial-time algorithm was known even when Kf is small. No NP-completeness result is known.
• Our results:– A simple polynomial-time algorithm for Kf = 2 case. – Exact and practical method for data of medium range for Kf
3.
10
Three or More Founders: Assuming Known Founders
1101101
1010001
0111111
0110100
1100011
Three Founders
1101101
1101111
1010001
0110100
With known founders, can minimize breakpoints for each sequence, and thus also minimize the total number of breakpoints.
For each input sequence, starting from the left, insert a breakpoint at the end of longest segments matching one founder.
Founder mapping: at each position c in any input sequence s, which founder s[c] takes its value from.
Breakpoint!
Input Sequences
1101101
Founder 1 Founder 2
Founder Mapping
Enumerating Founders for Founder-Unknown Case
In reality, founders are not known. A straightforward way is to simply enumerate all possible sets of founders, and then run the previous method to find the minimum mosaic.
100
001
011
101
110
010
At each column, there are 2kf–2 founder settings.
Let m be the number of columns, fully enumerate all possible sets of
founders takes (2m*kf) time. Infeasible when m or Kf is large.
Need more ideas to develop a practical method. First, we do the enumeration in the form of search paths in a search tree.
Search Paths and Search Tree
It works but exponential blowup of the search paths!
Obvious idea to reduce search space: branch and bound (compute a lower bound and …).
But we found a different idea is more useful.
001
0
Founder setting at column one
Num of tot. breakpoints up to current column
011
0
c1
c3001
2
010
1
c2001
001
100
0001
011
1001
101
0001
110
2001
010
5
On-line computation:
Compute partial solution up to the current column for speedup.
010
001
Founder settings up to column 3
The founder-known method can be run with partially-known founders!
Assume three founders
Dropping Search Paths that are Beaten by Another Search Path
001
0
011
6
P1 and P2 are two search paths up to column 2.
Can we say P1 is better than P2? Not really, because maybe P2 can lead to fewer breakpoints later on.
But, suppose the number of input sequences is 5. We can then say P1 beats P2 (and so drop P2). Why?
P1
P2
<=39<= 5 bkpts
>= 0 bkpts
An optimal search path following P2
40
Assume three founders
011
101
Founder Config.
A more powerful beating rule
We use a more powerful, but more complex, rule to identify paths that will be beaten, and we use rules that avoid generating beaten paths and redundant, symmetric paths.
These methods reduce the enumeration enormously, allowing practical computation of the optimal.
How Practical Is Our Method?Source of data and image: UNC Chapel Hill
Five founders
20 rows, 36 columns
UNC’s heuristic solution: 54 breakpoints
Enumerating 2180 founder states is impossible!
Our method takes 5 minutes to find the optimal solutions: 53 breakpoints. It is also practical for 50x50 matrix with four founders.
Another example
The data from Ukkonen’s 2007 WABI
paper (4 founders, 20 sequences, 40 sites was solved in 5 secs and used
one fewer crossover than used in that
paper.
Applications?
• Founders on an island
• Founders in microbial communities
• ???
Topic II: Multi-State Perfect Phylogeny
with Missing and Removable Data
To appear in Recomb 2009, May 09
Dan Gusfield
The Perfect Phylogeny Modelfor binary sequences
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral sequence
Extant sequences at the leaves
Site mutations on edgesThe tree derives the set M:1010010000010110101000010
Only one mutation per siteallowed (infinite sites)
What is a Perfect Phylogeny for k-state characters?
• Input consists of n sequences M with m sites (characters) each, where each site can take one of k > 2 states.
• In a Perfect Phylogeny T for M, each node of T is labeled with an m-length sequence where each site has a value from 1 to k.
• T has n leaves, one for each sequence in M, labeled by that
sequence.• Arbitrarily root T at some node, and direct all the edges away
from the root. Then, for any character C with b states, there are at most b-1 edges where character C mutates, and for any
state s of C, there is at most one edge where character C mutates to state s. This more reflects the infinite alleles model rather than infinite sites.
Example: A perfect phylogeny for input M
3
2 1
2 3 2
3 2 3
1 1 3
1 2 3
A B C
1
2
3
4
5
M
n = 5m = 3k = 3
(3,2,1)(2,3,2)
(3,2,3)
(1,2,3)(1,1,3)
(1,2,3)
(3,2,3)
Root
A more standard definition
• For each character-state pair (C,s), the nodes of T that are labeled with state s for character C, form a connected subtree of T.
• It follows that the subtrees for any C are node-disjoint. This condition is called the convexity requirement.
Example
3
2 1
2 3 2
3 2 3
1 1 3
1 2 3
A B C
1
2
3
4
5
M
n = 5 number of taxam = 3 number of sitesk = 3 number of states
(3,2,1)(2,3,2)
(3,2,3)
(1,2,3)(1,1,3)
(1,2,3)
(3,2,3)
The tree forState 2 ofCharacter B
Perfect Phylogeny Problems
Existence Problem:Given M, is there a Perfect Phylogeny for M?
Missing Data Problem: For a given k, if there are cells in M withoutvalues, can values less than or equal to k be imputedso that the resulting matrix M’ has a perfect phylogeny?
Handling missing data extends the utility of the perfect-phylogeny model.
Status of the Existence Problem
Poly-time algorithm for 3 states, Dress-Steel 1991
Poly-time algorithm for 3 or 4 states, Kannan-Warnow
Poly-time algorithm for any fixed number of states -polynomial in n and m, but exponential in k, Agarwalla andFernandez-Baca
Speed up of the method by Kannan-Warnow
When k is not fixed, the existence problem is NP-hard
Status of missing data problem
NP-complete even for k = 2; effective integer programmingapproaches for k = 2.
Polynomial-time methods for a `directed’ variant of k = 2.
No literature on the missing data problem for k > 2.
New work here: specialized ILP methods for k = 3,4,5and a general ILP solution for any fixed k.In this talk I will only discuss the general solution.
New approach to existence and missing data
Based on an old theorem and newer techniques.
Old theorem: Buneman’s theorem relating Perfect-Phylogeny to chordal graphs.
Newer techniques and theorems: Minimal triangulations of anon-chordal graph to make it chordal.
Chordal Graphs
A graph G is called Chordal if every cycle of length four or more contains a chord.
Another Classic (1970s) Characterization
A graph G is chordal if and only if it is the intersection graph ofa set S of subtrees of a tree T. Each node of G is a member of S.
a
b c
d
e f
g
{b,c}
{b,c,d}
{c,d,e,g}
{a,e} {e,f,g}
T
{a,e,g}
G
Relation to Perfect Phylogeny
In a perfect phylogeny T for a table M, for any character Cand any state s of character C, the sub-forest of Tinduced by the nodes labeled (C,s) form a single, connectedsubtree of T.
So, there is a natural set of subtrees of T induced by M, andhints at the relationship of perfect-phylogeny to chordalGraphs.Buneman’s theorem makes this precise.
Buneman’s Approach to Perfect Phylogeny
3
2 1
2 3 2
3 2 3
1 1 3
1 2 3
1 1 1
2 2 2
3 3 3
Each row of table M induces a clique in Partition-intersection graph G(M).
Table M
Partition-Intersection Graph G(M) has one node for eachcharacter-state pairin M, and an edgebetween two nodesif and only if thereis a row in M withboth thosecharacter-statepairs.
G(M)
1 2 3Character-states
Note that if table M has m columns, then G(M) is am-partite graph. Nodes in the same class of G(M)are said to have the same color. Two nodes with theSame color are called a mono-chromatic pair.
An edge (u,v) not in G(M) is legalif u and v are in different classes of the partition, ie.are not a mono-chromatic pair.
Buneman’s Theorem
There is a perfect phylogeny for M if and only if legaledges can be added to graph G(M) to make it chordal.
If there is such a chordal graph, denote it G’(M).
Theorem (Buneman 1971?)
G’(M) is called a legal triangulation of G(M).
From Chordal Graph to Perfect Phylogeny
Fact: Given a legal triangulation G’(M), a perfect phylogenyfor M can be constructed in linear time.
The algorithms are based on `perfect elimination orders’ And `clique trees’. Many citations.
Example
A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2
1 2 3
M
3,0 2,1 3,1
1,0 1,1
2,0 3,2 2,2
B C
A D
G’(M)
A legal triangulation
A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2
1 2 3
M
3,0 2,1 3,1
1,0 1,1
2,0 3,2 2,2
B C
A D
G’(M)
X Y
Yields a Perfect Phylogeny
A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2
1 2 3
M
B C
A D
One node in T for eachmaximal clique in G’(M)
X Y
002
010 111
122
012 112
What about Missing Data?
If M is missing data, build the partition intersection graph G(M) using the known data in M. Buneman’s theorem still holds:
There is a perfect phylogeny for some imputation of missingdata in M, if and only if there is a legal triangulation of G(M).
The legal triangulation gives a perfect phylogeny T for Mwith some imputed data, and then the imputed M’ can beobtained from T.
Example
A: 0 0 2B: 0 ? 0C: 1 1 1D: 1 2 2
1 2 3
M
3,0 2,1 3,1
1,0 1,1
2,0 3,2 2,2
B C
A D
G(M)
The Key Problem
So the key problem in this approach to both theExistence and the Missing Data problems is howto find a legal triangulation, if there is one.
But, there is a robust and still expanding literature onefficient algorithms to find a minimal triangulation ofa non-chordal graph.
Some triangulation problems are NP-hard (Tree-width,Minimum fill-in).
Minimal triangulation
A triangulation of a non-chordal graph G is minimal if no subset of added edges is a triangulationof G.
Clearly, if there is a legal triangulation G’(M) of G(M),Then there is one which is a minimal triangulation.
So we can take advantage of the minimal triangulationtechnology. The minimal vertex separators are the key objects.
Minimal vertex separators
A set of vertices S whose removal separates verticesu and v is called a u,v separator. S is a minimal u,vseparator if no subset of S is a u,v separator.
S is a minimal-separator if it is a minimal u,v separatorfor some u,v.
Minimal separator S crosses minimal separator S’, ifS separates some pair of nodes in S’.
Crossing is a symmetric relation for minimal separators.
Example
3,0 2,1 3,1
1,0 1,1
2,0 3,2 2,2
B C
A D
G(M)
{(2,1), (3,2)} and {(1,0), (1,1)}are crossing minimal separators.
{(2,1), (1,1)} and {(1,0), (3,2)} arenon-crossing minimal separators.
The structure of a minimal triangulation in G
Completing a minimal separator K means adding allthe missing edges to make K a clique.
Capstone Theorem (P,S 1997): Every minimal triangulation of G is obtained by completing each minimal separator in amaximal set of pairwise non-crossing minimal separators of G. Conversely, completing every minimal separator ina maximal set of pairwise non-crossing minimal separatorsyields a triangulation of G.
A legal triangulation
A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2
1 2 3
M
3,0 2,1 3,1
1,0 1,1
2,0 3,2 2,2
B C
A D
G’(M)
X Y
There are6 minimalseparators.5 are pairwisenon-crossing
Back to Perfect Phylogeny
A minimal separator S in the partition intersection graph G(M)Is called legal if it does not contain two nodes of the samecolor and illegal if it does.
P,S Theorem can be used to prove the Main New Results
Theorem 1:There is a perfect phylogeny for M, even if M is missing data,If and only if there is a set of pairwise non-crossing legalminimal separators in G(M) that separate every mono-chromatic pair of nodes in G(M).
A legal triangulation
A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2
1 2 3
M
3,0 2,1 3,1
1,0 1,1
2,0 3,2 2,2
B C
A D
G’(M)
X Y
Corollaries
Cor 1: If there is a mono-chromatic pair of nodes in G(M)that is not separated by any legal minimal separator, thenM has no perfect phylogeny.
Cor 2: If G(M) has no illegal minimal separators, thenM has a perfect phylogeny.
Cor 3: If every mono-chromatic pair of nodes is separatedby some legal minimal separator, and no legal minimalseparators cross, the M has a perfect phylogeny.
How to solve the existence and missing data problems
Given M, find all minimal separators in G(M); determine which are legal and which are illegal; for each legalminimal separator, determine which mono-chromatic pairsof nodes it crosses.
Determine if any of the Corollaries hold.
If not, set up and solve an integer linear program to find a set Q ofpairwise non-crossing legal minimal separators thatseparate every mono-chromatic pair of nodes in G(M).
Conceptually nice, but
Does it work in practice?
It works surprisingly (shockingly) well
Simulations based on ms, with datasets of sizes thatare charactoristic of many current applications in phylogenetics and population genetics - but notgenomic scale or tree-of-life scale.
Surprising empirical resultsThe minimal separators are found quickly by existing algorithms.
The numbers of minimal separators are small.
There are few crossing minimal separators.
Until a large percentage of missing data, most problemsare solved by the Corollaries, without the need for an ILP.
The created ILP are tiny.
The ILPs solve quickly - all havesolved in 0.00 CPLEX-reported seconds (CPLEX 11 on2.5 Ghz machine). Most solve in the CPLEX pre-processor.
So Although the chordal graph approach
may at first seem impractical, it works
on a large range of data of sizes that
are typical of current phylogenetics problems, and degree of missing data.
So missing data can be handled.
But what are the biological applications of Multi-StatePerfect Phylogeny?
Application Requirements
• Must be multi-state data - ubiquitous
• The probability of mutating to any given state more than once must be very small - less common.
Possible applications
• Micro-satellite data
• Transposable elements as characters and positions of elements as states
• Discretized quantitative traits
• Infinite alleles model
All software to replicate theseresults will be available on my
website by the time of Recomb 2009
Top Related