DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal...

Post on 15-Dec-2015

216 views 1 download

Tags:

Transcript of DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal...

DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents

Zhaoming YinBader-Polo Joint Group Meeting, Nov 11, 2013

Contribution

• Research Aspect

-A framework to solve the maximum parsimonious tree with the input of unequal genome contents.

-Proved Adequate subgraph theory is applicable in unequal contents data which reduces search space.

-provide a benchmark for the HPC community.

• Engineering Aspect

-Implement software with many state of the art features such as supertree method, GAS initialization method, spectral partition etc.

-The software can produce a tree with not only topologies, but also type/number of different evolution events (visualization!).

Why Phylogenetic Tree Problem is Hard?• For N genomes, there are (N-3)!! number of

possible tree topologies.• For each topology, we need to compute at least

one different median, the possible median order are (g-2)!! . g is the number of genes.

• To validate each possible median, if the gene content has duplications, it’s NP hard.

• So the complexity type of computing the MP tree with uneuqal contents genomes is:

NP hard over NP hard over NP hard!

Phylogenetic Tree

This picture presents the phylogeny of the “12 Drosophila.”

From http://insects.eugenes.org/species

Maximum Parsimony Concept

5

1

23

4

13 2

4

6 5 6

5

1 4 2 3

6

Of all possible topologies, the maximum parsimonious tree is the one that has the minimum total tree length

Genome Rearrangement

http://ai.stanford.edu/~serafim/CS374_2006/presentations/lecture17.ppt

Genome RearrangementIn 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip, 99% similarity between genes, These surprisingly identical gene sequences differed in gene order, This study helped pave the way to analyzing genome rearrangements in molecular evolution.

1 2 3 4 5 6 7 8 9 10

1 2 –6 –5 -4 -3 7 8 9 10

1 2 7 8 3 4 5 6 9 10

1 2 7 8 –6 -5 -4 -3 9 10

Inversion:

Transposition:

Inverted Transposition:

Genome Median Computation

5

1

23

4

14 2

3

65 6

1

2

3

5

4

6

1

2

3

5

4

6

Genome Median Computation

1

2

3

5

4

6

1,2,3

1,-3,-2-2,-1,3

1,2,3 = 2 moves2,-1,3 = 5 moves…..

Step 1: Spectral Partition

Step 2: Compute MP Tree for Each Sub-Disk

Step 2-1: How to Compute Median (BNB)

1

2

3 45

6

78

1

2

3 45

6

78

1

2

3 45

6

78

1

2

3 45

6

78

1

2

3 45

6

78

1

2

3 45

6

78

1

2

3 45

6

78

1

2

3 45

6

78

Step 2-2: How to Compute Median (LK)

………………….

stop

Step 2-2: How to Evaluate Median

1

med1, 2, 3, 3, 4, 6, 5

1, 2, 3, 4, 3, 6, 5

1, 2, 3, 4, 6, 3, 5

1, 2, 5, 4, 6, 3, 3

Dis(m,1)+Dis(m,2)+Dis(m,3)

23

Step 2-2: How to Evaluate Median

1, 2, 3, 3, 4, 6, 5

1, 2, 3, 4, 3, 5

Find a mapping first (NP hard) dis=1

1, 2, 3, 3, 4, 6, 5

-2, -1, 3, 3, 4, 5

Complete the loss (polynomial) dis =2

1, 2, 3, 4, 6, 5

-2, -1, 3, 4, 6, 5

Compute DCJ (polynomial) dis =3

1, 2, 3, 4, 6, 5

1, 2, 3, 4, 6, 5

Step 3: Merge Disks

Decomposition of The disks

Construct a tree for each disk

Merge the tree usingA specific consensus method:Strict, majority etc…

Disambiguation

Step 4: Initialization

1

2

3

5

4

6

X

1 2

c

b

e

d

Init by insertionWhich is local

Init by prospectionWhich is global.

Step5: Iterative Refinement

12

3 4

a

b

Review

• Step 1: Spectral partition• Step 2: Subtree construction• Step 3: Supertree merge• Step 4: Initialization of complete tree using

General Adequate Subgraph (GAS) method.

• Step 5: Iterative Refinement until the complete tree converged.

Result—Simulated Data

seed#Theta+#gamma+#phi operations

We know the total number of evolution event in the model tree

We grow our own tree

Result--Accuracy

%of duplication 0.1% of loss 0.1Theta is % of inversion

There are 8 species2*8-3 =13edges.So the average accuracy is ~90%

Result – Real Data

SCRaMbLE Matrix

• We can represent a SCRaMbLEd strain by its vector.• The sign gives the orientation. • The color encodes the position in the synthetic chromosome.

Result – Real Data

#inversion:#insertion/deletion:#duplication

Parallel Method [Bader 05]

Parallel search

Load Balancing

Experimental Results (Parallel)

Why Many-core BnB?

• So many distributed memory MIP BnB frameworks (PICO, PEBBL, ALPS, COIN-OR).

• Load balance of distributed BnB is highly relied on Ramp up, run time load balancing is not efficient.

• But nowadays Peta-flops machines are mostly hybrid systems(distributed + many-core (or accelerators)).

Experimental Results (Intel Phi knapsack)