DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents
Zhaoming YinBader-Polo Joint Group Meeting, Nov 11, 2013
Contribution
• Research Aspect
-A framework to solve the maximum parsimonious tree with the input of unequal genome contents.
-Proved Adequate subgraph theory is applicable in unequal contents data which reduces search space.
-provide a benchmark for the HPC community.
• Engineering Aspect
-Implement software with many state of the art features such as supertree method, GAS initialization method, spectral partition etc.
-The software can produce a tree with not only topologies, but also type/number of different evolution events (visualization!).
Why Phylogenetic Tree Problem is Hard?• For N genomes, there are (N-3)!! number of
possible tree topologies.• For each topology, we need to compute at least
one different median, the possible median order are (g-2)!! . g is the number of genes.
• To validate each possible median, if the gene content has duplications, it’s NP hard.
• So the complexity type of computing the MP tree with uneuqal contents genomes is:
NP hard over NP hard over NP hard!
Phylogenetic Tree
This picture presents the phylogeny of the “12 Drosophila.”
From http://insects.eugenes.org/species
Maximum Parsimony Concept
5
1
23
4
13 2
4
6 5 6
5
1 4 2 3
6
Of all possible topologies, the maximum parsimonious tree is the one that has the minimum total tree length
Genome Rearrangement
http://ai.stanford.edu/~serafim/CS374_2006/presentations/lecture17.ppt
Genome RearrangementIn 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip, 99% similarity between genes, These surprisingly identical gene sequences differed in gene order, This study helped pave the way to analyzing genome rearrangements in molecular evolution.
1 2 3 4 5 6 7 8 9 10
1 2 –6 –5 -4 -3 7 8 9 10
1 2 7 8 3 4 5 6 9 10
1 2 7 8 –6 -5 -4 -3 9 10
Inversion:
Transposition:
Inverted Transposition:
Genome Median Computation
5
1
23
4
14 2
3
65 6
1
2
3
5
4
6
1
2
3
5
4
6
Genome Median Computation
1
2
3
5
4
6
1,2,3
1,-3,-2-2,-1,3
1,2,3 = 2 moves2,-1,3 = 5 moves…..
Step 1: Spectral Partition
Step 2: Compute MP Tree for Each Sub-Disk
Step 2-1: How to Compute Median (BNB)
1
2
3 45
6
78
1
2
3 45
6
78
1
2
3 45
6
78
1
2
3 45
6
78
1
2
3 45
6
78
1
2
3 45
6
78
1
2
3 45
6
78
1
2
3 45
6
78
Step 2-2: How to Compute Median (LK)
………………….
stop
Step 2-2: How to Evaluate Median
1
med1, 2, 3, 3, 4, 6, 5
1, 2, 3, 4, 3, 6, 5
1, 2, 3, 4, 6, 3, 5
1, 2, 5, 4, 6, 3, 3
Dis(m,1)+Dis(m,2)+Dis(m,3)
23
Step 2-2: How to Evaluate Median
1, 2, 3, 3, 4, 6, 5
1, 2, 3, 4, 3, 5
Find a mapping first (NP hard) dis=1
1, 2, 3, 3, 4, 6, 5
-2, -1, 3, 3, 4, 5
Complete the loss (polynomial) dis =2
1, 2, 3, 4, 6, 5
-2, -1, 3, 4, 6, 5
Compute DCJ (polynomial) dis =3
1, 2, 3, 4, 6, 5
1, 2, 3, 4, 6, 5
Step 3: Merge Disks
Decomposition of The disks
Construct a tree for each disk
Merge the tree usingA specific consensus method:Strict, majority etc…
Disambiguation
Step 4: Initialization
1
2
3
5
4
6
X
1 2
c
b
e
d
Init by insertionWhich is local
Init by prospectionWhich is global.
Step5: Iterative Refinement
12
3 4
a
b
Review
• Step 1: Spectral partition• Step 2: Subtree construction• Step 3: Supertree merge• Step 4: Initialization of complete tree using
General Adequate Subgraph (GAS) method.
• Step 5: Iterative Refinement until the complete tree converged.
Result—Simulated Data
seed#Theta+#gamma+#phi operations
We know the total number of evolution event in the model tree
We grow our own tree
Result--Accuracy
%of duplication 0.1% of loss 0.1Theta is % of inversion
There are 8 species2*8-3 =13edges.So the average accuracy is ~90%
Result – Real Data
SCRaMbLE Matrix
• We can represent a SCRaMbLEd strain by its vector.• The sign gives the orientation. • The color encodes the position in the synthetic chromosome.
Result – Real Data
#inversion:#insertion/deletion:#duplication
Parallel Method [Bader 05]
Parallel search
Load Balancing
Experimental Results (Parallel)
Why Many-core BnB?
• So many distributed memory MIP BnB frameworks (PICO, PEBBL, ALPS, COIN-OR).
• Load balance of distributed BnB is highly relied on Ramp up, run time load balancing is not efficient.
• But nowadays Peta-flops machines are mostly hybrid systems(distributed + many-core (or accelerators)).
Experimental Results (Intel Phi knapsack)
Top Related