1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with...
-
date post
22-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of 1 Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with...
1
Computer Science DepartmentTechnion – Israel Institute of Technology
Genomic Sorting with Length-Weighted Reversals
Ron Y. PinterTechnion
Steve SkienaSUNY Stony Brook
2
Genome Rearrangement
• events– duplication– translocation– reversal (inversion)
• occur primarily during reproduction
• allow large-scale genomic comparisons
3
Sorting by Reversals
• genome represented as a permutation on 1, 2, …, n– n = # homologous genes among species
• assumptions– can identify genes– genes are distinct
• operation: reversal of a subsequence (of genes)– models inversion (occurs during crossover)
• one of the permutations can be 1, 2, …, n– appropriately relabel others
4
• 6 reversal• in our model (for f(l) = l): cost = 18
Example
4 3 2 8 7 1 5 6 11 10 9
4 3 2 1 7 8 5 6 9 10 11
1 2 3 4 8 7 6 5 9 10 11
1 2 3 4 5 6 7 8 9 10 11
5
Our Model
• unsigned
• cost of reversal of subsequence of length l is f(l)
• total sorting cost (or distance) is
f (length(sj))
Sj are reversed
subsequences
6
Cost Functions
• additivef(x+y) = f(x) + f(y)
• subadditivef(x+y) < f(x) + f(y)
• superadditivef(x+y) > f(x) + f(y)
• other– e.g. bitonic
f(l)
f(l)
7
Problems
• algorithm to sort any permutation– worst-case min cost
• approximate min cost for a given permutation
8
Extremal Costs
• highly subadditive: e.g. unit cost, f(l) = 1– NP complete [Caprara, ’97]– series of approximation ratios: 2, 1.75,
1.375
• highly superadditive: f(l) > l2
– essentially bubblesort
9
Our Results
• additive cost function– specifically f(l) = l
• QuickSort-like algorithm for worst-case– complexity: O(n lg2n)
• min cost approximation ratio of O(lg2n)
10
MedianEject(a,b)
• find r maximal blocks of wrong-sided elements with respect to median
• for lg r do: flip every other pair of blocks of wrong-sided and adjacent blocks
• move wrong-sided blocks to median boundary
• reverse left and right blocks
11
complexity: O((b-a) lg r)
Sample Run
12
ReversalSort(a,b)
MedianEject (a,b);
ReversalSort (a, );
ReversalSort ( ,b);
Complexity
T(n) = 2 T ( ) + O(f(n) lg n) O(f(n)lg2n)= O(n lg2n) for f(n)~n
2
ab
2
n
13
Algorithmic Improvements
I simplify “short” phases
II merge 2 last steps of MedianEject
when possible (2p+q vs. 3p+q)
III apply II recursively
p q p
14
Approximation Ratio• M(p) is the maximal total distance between pairs of out-of order
elements
Lemma 4: min cost is (M(p))butLemma 6: # of out-of order elts < 3 M(p)+Lemma 7: MedianEject touches only elements within linear range
from out-of-order elements
yields:
• each round of MedianEject takes O(M(p) lg2 n)
• ReversalSort costs O(M(p) lg2 n)
• ReversalSort is at most O((lg2 n) times optimal
15
• use our cost (= distance) to build phylogenetic trees
• 4 plants (chloroplastic genes)• consistent with [Martin et al., PNAS Sept ‘02]• work in progress [M. Shoham]
Bioinformatic “Validation”
Cyanophora
Cyanidium
Guilardia
Porphyra
16
• weighted genes
• tighter approximation ratio– close to O(lg n)– can get to constant?
• other cost functions (incl. bitonic)
• the signed case
Open Problems: Algorithmic
17
• chromosomal ordering
• what is the right cost function?– consider cost(l) = ld
• combine with constant-based models– restricted regions– “undesired” reversal sequences
• deal with duplication and translocation events
Open Problems: Modeling