Download - Two Solutions in Search of Killer Apps. Dimacs workshop on Algorithms in Human Population Genomics Dan Gusfield UC Davis.

Two Solutions in Search of Killer Apps.

Dimacs workshop on Algorithms in Human Population Genomics

Dan Gusfield

UC Davis

Two Algorithmic Topics

We have new algorithmic tools for a) computing the MinimumMosaic of a set of recombinants, and for b) multi-state PerfectPhylogeny with missing data, that should be of use inPopulation Genomics and Phylogenetics. These toolswere developed `on spec’:

We have `hand-waiving’ arguments for their utility, but no actual (biological data-set) applications.

Suggestions wanted.

Topic I: Improved Algorithms for Inferring the Minimum Mosaic of a

Set of Recombinants

Yufeng Wu and Dan Gusfield

UC Davis

From CPM 2007

Recombination

• Recombination: one of the principle genetic forces shaping sequence variations within species.

• Two equal length sequences generate a new equal length sequence.

110001111111001

000110000001111

Prefix

Suffix

11000 0000001111

Breakpoint

Founders and Mosaic• Current sequences are descendents of a small

number of founders.– A current sequence is composed of blocks from the

founders, due to recombination.– No mutations since formation of founders.

000000

111111

000000

111111

001111

000000

111111

001111

111100

Breakpoint

Founders

Sampled sequences in current population

000000

001111

111100

011100

Mosaic

000000

001111

111100

011100

The Minimum Mosaic Problem• Given a set of aligned binary sequences in the current

population and assume the number of founders is known to be Kf, find set of founders and the mosaic with the minimum number of breakpoints.

1101101

1010001

0111111

0110100

1100011

Assume Kf =3

1101101

1010001

0111111

0110100

1100011

1101111

1010001

0110100

Three Founders

Four breakpoints: minimum for all possible three founders

Status of the Minimum Mosaic Problem

• First studied by E. Ukkonen (WABI 2002). Later WABI 2007.– Dynamic programming method. Not practical when the

number of rows is more than 20 and Kf >2.

• No polynomial-time algorithm was known even when Kf is small. No NP-completeness result is known.

• Our results:– A simple polynomial-time algorithm for Kf = 2 case. – Exact and practical method for data of medium range for Kf

3.

10

Three or More Founders: Assuming Known Founders

1101101

1010001

0111111

0110100

1100011

Three Founders

1101101

1101111

1010001

0110100

With known founders, can minimize breakpoints for each sequence, and thus also minimize the total number of breakpoints.

For each input sequence, starting from the left, insert a breakpoint at the end of longest segments matching one founder.

Founder mapping: at each position c in any input sequence s, which founder s[c] takes its value from.

Breakpoint!

Input Sequences

1101101

Founder 1 Founder 2

Founder Mapping

Enumerating Founders for Founder-Unknown Case

In reality, founders are not known. A straightforward way is to simply enumerate all possible sets of founders, and then run the previous method to find the minimum mosaic.

100

001

011

101

110

010

At each column, there are 2kf–2 founder settings.

Let m be the number of columns, fully enumerate all possible sets of

founders takes (2m*kf) time. Infeasible when m or Kf is large.

Need more ideas to develop a practical method. First, we do the enumeration in the form of search paths in a search tree.

Search Paths and Search Tree

It works but exponential blowup of the search paths!

Obvious idea to reduce search space: branch and bound (compute a lower bound and …).

But we found a different idea is more useful.

001

0

Founder setting at column one

Num of tot. breakpoints up to current column

011

0

c1

c3001

2

010

1

c2001

001

100

0001

011

1001

101

0001

110

2001

010

5

On-line computation:

Compute partial solution up to the current column for speedup.

010

001

Founder settings up to column 3

The founder-known method can be run with partially-known founders!

Assume three founders

Dropping Search Paths that are Beaten by Another Search Path

001

0

011

6

P1 and P2 are two search paths up to column 2.

Can we say P1 is better than P2? Not really, because maybe P2 can lead to fewer breakpoints later on.

But, suppose the number of input sequences is 5. We can then say P1 beats P2 (and so drop P2). Why?

P1

P2

<=39<= 5 bkpts

>= 0 bkpts

An optimal search path following P2

40

Assume three founders

011

101

Founder Config.

A more powerful beating rule

We use a more powerful, but more complex, rule to identify paths that will be beaten, and we use rules that avoid generating beaten paths and redundant, symmetric paths.

These methods reduce the enumeration enormously, allowing practical computation of the optimal.

How Practical Is Our Method?Source of data and image: UNC Chapel Hill

Five founders

20 rows, 36 columns

UNC’s heuristic solution: 54 breakpoints

Enumerating 2180 founder states is impossible!

Our method takes 5 minutes to find the optimal solutions: 53 breakpoints. It is also practical for 50x50 matrix with four founders.

Another example

The data from Ukkonen’s 2007 WABI

paper (4 founders, 20 sequences, 40 sites was solved in 5 secs and used

one fewer crossover than used in that

paper.

Applications?

• Founders on an island

• Founders in microbial communities

• ???

Topic II: Multi-State Perfect Phylogeny

with Missing and Removable Data

To appear in Recomb 2009, May 09

Dan Gusfield

The Perfect Phylogeny Modelfor binary sequences

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed (infinite sites)

What is a Perfect Phylogeny for k-state characters?

• Input consists of n sequences M with m sites (characters) each, where each site can take one of k > 2 states.

• In a Perfect Phylogeny T for M, each node of T is labeled with an m-length sequence where each site has a value from 1 to k.

• T has n leaves, one for each sequence in M, labeled by that

sequence.• Arbitrarily root T at some node, and direct all the edges away

from the root. Then, for any character C with b states, there are at most b-1 edges where character C mutates, and for any

state s of C, there is at most one edge where character C mutates to state s. This more reflects the infinite alleles model rather than infinite sites.

Example: A perfect phylogeny for input M

3

2 1

2 3 2

3 2 3

1 1 3

1 2 3

A B C

1

2

3

4

5

M

n = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

Root

A more standard definition

• For each character-state pair (C,s), the nodes of T that are labeled with state s for character C, form a connected subtree of T.

• It follows that the subtrees for any C are node-disjoint. This condition is called the convexity requirement.

Example

3

2 1

2 3 2

3 2 3

1 1 3

1 2 3

A B C

1

2

3

4

5

M

n = 5 number of taxam = 3 number of sitesk = 3 number of states

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

The tree forState 2 ofCharacter B

Perfect Phylogeny Problems

Existence Problem:Given M, is there a Perfect Phylogeny for M?

Missing Data Problem: For a given k, if there are cells in M withoutvalues, can values less than or equal to k be imputedso that the resulting matrix M’ has a perfect phylogeny?

Handling missing data extends the utility of the perfect-phylogeny model.

Status of the Existence Problem

Poly-time algorithm for 3 states, Dress-Steel 1991

Poly-time algorithm for 3 or 4 states, Kannan-Warnow

Poly-time algorithm for any fixed number of states -polynomial in n and m, but exponential in k, Agarwalla andFernandez-Baca

Speed up of the method by Kannan-Warnow

When k is not fixed, the existence problem is NP-hard

Status of missing data problem

NP-complete even for k = 2; effective integer programmingapproaches for k = 2.

Polynomial-time methods for a `directed’ variant of k = 2.

No literature on the missing data problem for k > 2.

New work here: specialized ILP methods for k = 3,4,5and a general ILP solution for any fixed k.In this talk I will only discuss the general solution.

New approach to existence and missing data

Based on an old theorem and newer techniques.

Old theorem: Buneman’s theorem relating Perfect-Phylogeny to chordal graphs.

Newer techniques and theorems: Minimal triangulations of anon-chordal graph to make it chordal.

Chordal Graphs

A graph G is called Chordal if every cycle of length four or more contains a chord.

Another Classic (1970s) Characterization

A graph G is chordal if and only if it is the intersection graph ofa set S of subtrees of a tree T. Each node of G is a member of S.

a

b c

d

e f

g

{b,c}

{b,c,d}

{c,d,e,g}

{a,e} {e,f,g}

T

{a,e,g}

G

Relation to Perfect Phylogeny

In a perfect phylogeny T for a table M, for any character Cand any state s of character C, the sub-forest of Tinduced by the nodes labeled (C,s) form a single, connectedsubtree of T.

So, there is a natural set of subtrees of T induced by M, andhints at the relationship of perfect-phylogeny to chordalGraphs.Buneman’s theorem makes this precise.

Buneman’s Approach to Perfect Phylogeny

3

2 1

2 3 2

3 2 3

1 1 3

1 2 3

1 1 1

2 2 2

3 3 3

Each row of table M induces a clique in Partition-intersection graph G(M).

Table M

Partition-Intersection Graph G(M) has one node for eachcharacter-state pairin M, and an edgebetween two nodesif and only if thereis a row in M withboth thosecharacter-statepairs.

G(M)

1 2 3Character-states

Note that if table M has m columns, then G(M) is am-partite graph. Nodes in the same class of G(M)are said to have the same color. Two nodes with theSame color are called a mono-chromatic pair.

An edge (u,v) not in G(M) is legalif u and v are in different classes of the partition, ie.are not a mono-chromatic pair.

Buneman’s Theorem

There is a perfect phylogeny for M if and only if legaledges can be added to graph G(M) to make it chordal.

If there is such a chordal graph, denote it G’(M).

Theorem (Buneman 1971?)

G’(M) is called a legal triangulation of G(M).

From Chordal Graph to Perfect Phylogeny

Fact: Given a legal triangulation G’(M), a perfect phylogenyfor M can be constructed in linear time.

The algorithms are based on `perfect elimination orders’ And `clique trees’. Many citations.

Example

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

A legal triangulation

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

Yields a Perfect Phylogeny

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

B C

A D

One node in T for eachmaximal clique in G’(M)

X Y

002

010 111

122

012 112

What about Missing Data?

If M is missing data, build the partition intersection graph G(M) using the known data in M. Buneman’s theorem still holds:

There is a perfect phylogeny for some imputation of missingdata in M, if and only if there is a legal triangulation of G(M).

The legal triangulation gives a perfect phylogeny T for Mwith some imputed data, and then the imputed M’ can beobtained from T.

Example

A: 0 0 2B: 0 ? 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

The Key Problem

So the key problem in this approach to both theExistence and the Missing Data problems is howto find a legal triangulation, if there is one.

But, there is a robust and still expanding literature onefficient algorithms to find a minimal triangulation ofa non-chordal graph.

Some triangulation problems are NP-hard (Tree-width,Minimum fill-in).

Minimal triangulation

A triangulation of a non-chordal graph G is minimal if no subset of added edges is a triangulationof G.

Clearly, if there is a legal triangulation G’(M) of G(M),Then there is one which is a minimal triangulation.

So we can take advantage of the minimal triangulationtechnology. The minimal vertex separators are the key objects.

Minimal vertex separators

A set of vertices S whose removal separates verticesu and v is called a u,v separator. S is a minimal u,vseparator if no subset of S is a u,v separator.

S is a minimal-separator if it is a minimal u,v separatorfor some u,v.

Minimal separator S crosses minimal separator S’, ifS separates some pair of nodes in S’.

Crossing is a symmetric relation for minimal separators.

Example

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

{(2,1), (3,2)} and {(1,0), (1,1)}are crossing minimal separators.

{(2,1), (1,1)} and {(1,0), (3,2)} arenon-crossing minimal separators.

The structure of a minimal triangulation in G

Completing a minimal separator K means adding allthe missing edges to make K a clique.

Capstone Theorem (P,S 1997): Every minimal triangulation of G is obtained by completing each minimal separator in amaximal set of pairwise non-crossing minimal separators of G. Conversely, completing every minimal separator ina maximal set of pairwise non-crossing minimal separatorsyields a triangulation of G.


A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

There are6 minimalseparators.5 are pairwisenon-crossing

Back to Perfect Phylogeny

A minimal separator S in the partition intersection graph G(M)Is called legal if it does not contain two nodes of the samecolor and illegal if it does.

P,S Theorem can be used to prove the Main New Results

Theorem 1:There is a perfect phylogeny for M, even if M is missing data,If and only if there is a set of pairwise non-crossing legalminimal separators in G(M) that separate every mono-chromatic pair of nodes in G(M).


A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

Corollaries

Cor 1: If there is a mono-chromatic pair of nodes in G(M)that is not separated by any legal minimal separator, thenM has no perfect phylogeny.

Cor 2: If G(M) has no illegal minimal separators, thenM has a perfect phylogeny.

Cor 3: If every mono-chromatic pair of nodes is separatedby some legal minimal separator, and no legal minimalseparators cross, the M has a perfect phylogeny.

How to solve the existence and missing data problems

Given M, find all minimal separators in G(M); determine which are legal and which are illegal; for each legalminimal separator, determine which mono-chromatic pairsof nodes it crosses.

Determine if any of the Corollaries hold.

If not, set up and solve an integer linear program to find a set Q ofpairwise non-crossing legal minimal separators thatseparate every mono-chromatic pair of nodes in G(M).

Conceptually nice, but

Does it work in practice?

It works surprisingly (shockingly) well

Simulations based on ms, with datasets of sizes thatare charactoristic of many current applications in phylogenetics and population genetics - but notgenomic scale or tree-of-life scale.

Surprising empirical resultsThe minimal separators are found quickly by existing algorithms.

The numbers of minimal separators are small.

There are few crossing minimal separators.

Until a large percentage of missing data, most problemsare solved by the Corollaries, without the need for an ILP.

The created ILP are tiny.

The ILPs solve quickly - all havesolved in 0.00 CPLEX-reported seconds (CPLEX 11 on2.5 Ghz machine). Most solve in the CPLEX pre-processor.

So Although the chordal graph approach

may at first seem impractical, it works

on a large range of data of sizes that

are typical of current phylogenetics problems, and degree of missing data.

So missing data can be handled.

But what are the biological applications of Multi-StatePerfect Phylogeny?

Application Requirements

• Must be multi-state data - ubiquitous

• The probability of mutating to any given state more than once must be very small - less common.

Possible applications

• Micro-satellite data

• Transposable elements as characters and positions of elements as states

• Discretized quantitative traits

• Infinite alleles model

All software to replicate theseresults will be available on my

website by the time of Recomb 2009