Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial...

Post on 02-Jan-2016

215 views 1 download

Transcript of Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial...

Sorting by Cuts, Joins and Whole Chromosome Duplications

Ron Zeira and Ron ShamirCombinatorial Pattern Matching 2015

30.6.15

Genome rearrangements

Motivation I: evolution

Human genome project

Motivation II: cancer

NCI, 2001

Normal karyotypeMCF-7 breast cancer cell-line

Definitions: gene

• A gene – oriented segment:

• A gene has two extremities: head and tail.

• Positive: tailhead; Negative: headtail.

Definitions: chromosome

• Chromosome is a series of consecutive genes.

• 2 consecutive extremities form an adjacency.• A telomere is an extremity that is not part of

an adjacency.• Circular chrom. has no telomeres. Linear

chrom. has 2 telomeres.

Definitions: genome

• A genome is a set of chromosomes.

• Equivalently, a genome is a set of adjacencies.

• Ordinary genome has one copy of each gene. Otherwise duplicated.

{ , },{ , },{ , },{ , }h h t h h h t ta b b c d f f eΠ

GR distance problem

• Distance dop(Π,Σ) – minimal number of operations between genomes Π and Σ.

• Operations:– Reversals– Translocations– Transpositions– Others…

The SCJ model

• SCJ – Single Cut or Join (Feijão,Meidanis 11):– Cut an adjacency to 2 telomeres.– Join 2 telomeres to an adjacency.

• Simple and practical model.• Reflects evolutionary distance (Biller et al. 13)

cut

join

Models with multiple gene copies

• Most models with multiple gene copies are NP-hard.

• Not many models allow duplications or deletions.

• Many normal and cancer genomes have multiple gene copies.

The SCJD model

• A duplication takes a linear chromosome and produces an additional copy of it.

• An SCJD operation is either a cut, or a join or a duplication.

,abc abc abc

The SCJD distance

• The minimal number of SCJD operations that transform an ordinary genome into a duplicated genome.

Results outline

• Characterize optimal solution structure.

• Give a distance optimization function.

• Solve the optimization problem.

• Study the number of duplications in optimal scenario.

SCJD optimal scenario structure

• Theorem: There exists an optimal SCJD sorting scenario, consisting, in this order, of– SCJ operations on single-copy genes.– Duplications.– SCJ operations acting on duplicated genes.

' 2 'SCJs duplications SCJs

Proof outline

• An SCJ operation acts on extremities on 2 duplicated genes or 2 unduplicated genes.

• Preempting SCJ on unduplicated genes keeps a valid sorting scenario.

• Preempt duplications while scenario is valid.

Corollary: SCJD distance

• Write the distance as a function of Γ’.• Find Γ’ that minimizes the distance.

η – higher score for adj. in Γ and Δ

Distance optimization solution

• The following genome maximizes H:

• If Γ not linear, remove an adjacency with η=1 from each circular chromosome in Γ’ to obtain Γ’’.

• Theorem: SCJD distance is computable in linear time.

' { | ( ) 0}

Controlling the number of duplications

• Duplications are more “radical” events than cut or join.

• Lemma: Our algorithm gives an optimal sorting scenario with a maximum number of duplications.

Optimal solutions can have different numbers of duplications

Minimizing duplications is hard

• Theorem: Finding an optimal SCJD sorting scenario with a minimum number of duplications is NP-hard.

• Reduction from Hamiltonian path problem on a directed graph with in/out degree 2.

Proof outline

• For a 2-digraph G and two vertices x, y, there is an Eulerian path P:xy.

• Create a duplicated genome Σ from P and an empty genome Π.

• Add auxiliary genes and k copies of Σ, Π.• There is a Hamiltonian path xy in G iff there

is an optimal sorting scenario with k duplications.

Summary

• Genome rearrangements are important.• Problems with multiple gene copies are hard.• SCJD – allows SCJ and duplications:– Linear algorithm for the SCJD distance.– Study the number of duplications in optimal

solution.• We hope to generalize the model and apply it

on cancer data.

Thank You!