Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in...

64
Genome Rearrangements CIS 667 April 13, 2004
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in...

Page 1: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Rearrangements

CIS 667 April 13, 2004

Page 2: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Rearrangements

• We have seen how differences in genes at the sequence level can be used to infer evolutionary relations among species Differences in sequences in (one or

more) genes resulted from point mutations (insert, delete, substitute)

These are not the only type of changes that can occur in the genome

Page 3: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Rearrangements

• Repair of broken chromosomes is an important process Mistakes can occur, however

• Mistakes can also occur during crossover• These mistakes cause changes in gene

order A large piece of chromosome can be moved or

copied to another location It can also move from one chromosome to

another We call these movements genome

rearrangments

Page 4: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Crossover

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 5: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Chromosome Repair

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 6: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Rearrangements

• These have important (usually fatal) consequences for the organism and its evolution

• Alignments do not capture genome rearrangments Two species may have nearly the same gene

sequences, but in a different order (why would the two species then be different?)

Page 7: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Rearrangements

• We need some other way to compare entire genomes (i.e. compare at a higher level)

• Rather than simple point mutations a genome is obtained from another by a number of a special kind of rearrangements: Reversals Use the number of reversals needed to

transform one genome into another to measure evolutionary distance

Page 8: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

The Method

• Use combinatorial optimization techniques in an attempt to infer a most economical sequence of rearrangement operations to account for differences among the genomes Compare with character-based methods

for phylogenetics (parsimony)

Page 9: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reversals

• Consider the genome of a species as a sequence of blocks A block is some sequence of the

genome (possibly containing more than one gene) transcribed as a unit

Blocks are oriented since they can be transcribed from either strand of DNA

Give homologous blocks the same label

Page 10: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reversals

• Relation between chloroplast genomes of alfalfa and garden pea:

Page 11: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reversals

• Reversal operation for oriented blocks: Inverts the order of affected blocks and

changes their orientation (arrow) Affects a contiguous segment of blocks

• What sequence of reversal operations could have changed alfalfa into garden pea? Would like to have a polynomial time

algorithm to find the shortest sequence

Page 12: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.
Page 13: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Comparison vs. Gene Comparison

• In the late 1980s, J. Palmer and his colleagues studied the mitochondrial genomes of cabbage and turnips The gene sequences are very similar

(some genes are 99% equal) Gene order, however, differs dramatically Genome rearrangements are now

considered to be a common mode of molecular evolution

Page 14: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Comparison vs. Gene Comparison

• Extreme conservation of genes on X chromosomes across mammalian species provides an opportunity to study the evolutionary history of X chromosome independently of the rest of the genomes

• According to Ohno´s law, the gene content of X chromosome has barely changed throughout mammalian development in the last 125 million years.

• However, the order of genes on X chromosomes has been disrupted several times.

Page 15: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Human and Mouse X Chromosomes

Page 16: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Human and Mouse X Chromosomes

-4 -6 1 7 2 -3 5 8

3 2 7 -1 6 4 5 8

1 7 2 -3 6 4 5 8

1 2 7 -3 6 4 5 8

1 2 7 3 -6 4 5 8

1 2 -5 -4 -3 6 7 8

1 2 3 4 5 6 7 8

Page 17: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Comparison vs. Gene Comparison

• The traditional molecular evolutionary technique is a gene comparison to construct a phylogenetic tree

• In the ”cabbage and turnip” case this is hardly suitable, since rate of point mutations in their mitochondrial genes is so low that their genes are almost identical

• Genome comparison (i.e. comparison of gene orders) is the method of choice in the case of very slowly evolving genomes

• Another area is the case where genomes evolve very rapidly (genes not very similar)

Page 18: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Genome Comparison

• Only about (17839) genome rearrangements have happened since human and mouse diverged 80 million years ago Mouse and human genomes can be viewed as

a collection of about 200 fragments which are shuffled in mice as compared to humans

A comparative mouse-human genetic map gives the position of a human gene given the location of a related mouse gene

Page 19: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Man-Mouse Comparative Physical Map

Page 20: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Definitions

• A signed permutation over the set of labels L = {1, 2, …, n} is a permutation such that (i) = + or –, where a L

• Example: +3, –2, –1 is a signed permutation over L = {1, 2, 3} Note that no label may appear twice in the

permutation

• A reversal [i,j] is an operation that transforms one signed permutation into another, reversing the order or a contiguous portion and flipping the signs

Page 21: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Definitions

• ’ = [i,j] = (1), …, (i – 1), –(j), …, –(i), (j + 1), …, (n)

• We are interested in the problem of sorting by reversals: Given two signed permutations and , find the minimum number of reversals 1, …, t that will transform into 1…t =

• The reversal distance d = t

Page 22: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Definitions

• Note that the reversal operation does not directly correspond to the biological operations (inversion, translocation, fission, fusion)

• Given and , can we always transform into using only the reversal operation? If so, how many reversals are required in the worst case?

Page 23: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Breakpoints

• A breakpoint is a point between consecutive labels in the initial permutation that must necessarily be separated by at least one reversal to reach the target permutation The two consecutive labels are not

consecutive in the target, or their orientations are not the same in a relative sense

Page 24: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Breakpoints

• To formalize the idea of breakpoint, we introduce the extended version of

• Let = (1), …, (n)• Then the extended version of is (L, (1), …,

(n), R)• For example let extended be (L, –2, –3,

+1, +6, –5, –4, R) and let extended be (L, +1, +2, +3, +4, +5, +6, R)

• The breakpoints are: (L,–2), (–2,–3), (–3,+1), (+1,+6), (6,–5), (–4,R)

Page 25: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Breakpoints

• The number of breakpoints of a permutation is denoted by b() In the example, = 6

• Can you characterize the situations where L is involved in a breakpoint? When R is involved in a breakpoint?

Page 26: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

A Lower Bound

• A reversal can remove at most two breakpoints Cuts the permutation in exactly two

places So, if …t then

b() – b() 2 b() – b() 2 … b(…) – b(…) 2

So b() 2t. If t = d(), b()/2 d()

Page 27: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reality and Desire Diagram

• The lower bound found is not very tight• We can derive a better l.b. based on a

structure called the reality-desire diagram of a permutation with respect to another

• To draw the diagram, we will represent +a with the tuple (-a +a) and -a with the tuple (+a -a) The orientation is given by the rightmost

member of the tuple

Page 28: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reality and Desire Diagram

• A permutation is a sequence of adjacent tuples:+3, –2, –1, +4, –5 can be

represented as: L---(–3 +3)---(+2 –2)---(+1 –1)---(–4 +4)---(+5 –5)---R

L---(–1 +1)---(–2 +2)---(–3 +3)---(–4 +4)---(–5 +5)---R

Page 29: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reality and Desire Diagram

• Now we will draw a graph to represent (L, +3, -2, -1, +4, -5, R) The reality diagram:

L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R

Page 30: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reality and Desire Diagram

• Suppose that is the identity (L, +1, +2, +3, +4, +5) We will add desire edges to the previous

graph to represent

L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R

Page 31: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reality and Desire Diagram

• is the reality• is what is desired• The diagram (a multigraph) shows

both reality and desire Call it RD()

• We can rearrange the nodes of the graph to make it easier to understand

Page 32: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reality and Desire Diagram

LR

Reality

Desire

Page 33: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Properties of RD()

• Each vertex has degree 2 Each node is incident to one edge from

A, the set of reality edges, and B, the set of desire edges

• The connected components of the graph are alternating cycles (edges alternate between reality - blue - and desire - red)

• Each cycle has an even number of edges, half reality and half desire

Page 34: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Properties of RD()

• The number of cycles of RD(is denoted by c

• Note that c = n + 1 since has no breakpoints All cycles are two parallel edges

between the same pair of nodes We have 2n + 2 nodes, so n + 1 cycles This is the only permutation for which

c = 1

Page 35: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Properties of RD()

• So transforming into can be seen as transforming RD() into a graph with as many cycles as possible - n + 1

• Now we need to see how a reversal affects the cycles of RD() Note that a reversal is characterized by

the two points where it cuts the current permutation, which each correspond to a reality edge

Page 36: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reversals and RD()

• Let be a reversal defined by two reality edges (s,t) and (u,v), then RD() differs from RD() as follows: Reality edges (s,t) and (u,v) are replaced by

(s,u) and (t,v) Vertices u, …, t are reversed

• Desire edges remain unchanged• See example on following slide

Page 37: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Example

L

R R

LSome nodes/edges omitted

Page 38: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Orientation of Cycles

• How many cycles are affected by a reversal?

• First we define convergent and divergent edges Two reality edges on the same cycle

converge if they are traversed in the same direction (clockwise or counterclockwise on the circle in the diagram) on the cycle

Otherwise they diverge

Page 39: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Orientation of Cycles

LRConvergent: (+3,+2) (-1,-4)

Divergent: (L,-3) (+3,+2)

Page 40: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reversals and #Cycles

• Let be a reversal acting on two reality edges e and f If e and f belong to different cycles, c() =

c() – 1 If e and f belong to the same cycle and

converge, c() = c() If e and f belong to the same cycle and

diverge, c() = c() + 1

Page 41: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

First Case

• If e and f belong to different cycles, c() = c() – 1

Page 42: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Second Case

If e and f belong to the same cycle and converge, c() = c()

Page 43: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Third Case

If e and f belong to the same cycle and diverge, c() = c() + 1

Page 44: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reversals and #Cycles

• Note that the number of cycles changes by at most one with each reversal Use that to find another lower bound for

reversal distance Suppose we have …t = we know that c(n

+ 1 and we have:• c() - c() 1

• c() - c() 1

• …

• c(…t) - c(…t-1) 1

Adding and cancelling terms we get• n + 1 - c() t

• If …t is optimal then t = d(), n + 1 - c() d()

Page 45: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Interleaving Graph

• This new lower bound is better than the old one - b()/2 For most signed permutations, it is close

to the actual distance, however it does not always work (we can’t always choose two divergent edges)

• We can classify the cycles of RD() as good or bad: A cycle is good if it has two divergent reality

edges Otherwise it is bad

Page 46: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Interleaving Graph

• The classification only applies to proper cycles (those with at least four edges) Those with three edges don’t need to be

touched since reality = desire• If we have only good cycles in a

permutation, then the lower bound previously given is an equality We sort, increasing the number of cycles

by one per reversal

Page 47: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Interleaving Graph

• If a desire edge from one cycle crosses some desire edge from another cycle we say that the two cycles interleave Interleaved cycles allow us to change a

bad cycle into a good one while breaking another cycle This good cycle can then broken in the next

step To find interleaving cycles, we construct an

interleaving graph

Page 48: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Interleaving Graph

Page 49: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Interleaving Graph

• Nodes in the interleaving graph are cycles

• Edge between two nodes if the cycles interleave

• The connected components of the graph are called bad components if they consist entirely of bad cycles

• Component otherwise is a good component

Page 50: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Interleaving Graph

• What is the interleaving graph of the previous example?

• Suppose that F and C are good cycles. Which components of the interleaving

graph are good and which are bad?

Page 51: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Sorting Good Components

• We need to choose two divergent edges in the same cycle to define a reversal that increases the number of cycles

• Example• A reversal characterized by two divergent

edges of the same cycle is a sorting reversal if and only if it does not lead to the creation of bad components

Page 52: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Bad Components

• Using this criterion to sort all of the good components, we must now sort the bad ones

• Give a hierarchy of bad components• We say a component B separates

components A and C if all chords in RD(a) that link a terminal in A to a terminal in C cross a desire edge of B

Page 53: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Diagram with no Good Components

Page 54: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Bad Components

• Reversal through reality edges in different components A and C will result in every component B that separates A and C being twisted A bad component becomes good when

twisted A good component can stay good or

become bad when twisted So twist only when no good components

Page 55: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Hierarchy of Bad Components

• A hurdle is a bad component that does not separate any other two bad components If a bad component separates others, then it is

a nonhurdle

• A hurdle A protects a nonhurdle B when removal of A would cause B to become a hurdle B is protected by A when every time B

separates two bad components, A is one of them

Page 56: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Hierarchy of Bad Components

• A hurdle A is called a superhurdle if it protects some other nonhurdle B Otherwise it is called a simple hurdle

Bad Components

Nonhurdles Hurdles

Simple hurdles Super hurdles

Page 57: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Fortress

• A signed permutation a is called a fortress iff RD() has an odd number of hurdles and all of them are super hurdles

Page 58: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Reversal Distance

• The reversal distance of oriented permutations is given by:

• d() = n + 1 - c() + h() + f() c() - number of cycles (proper and non) h() - number of hurdles f() - a fortress? (1 else 0)

n + 1 - c() good components and bad components which become good during sort

h() - bad components require extra reversal f() - extra reversal for fortress

Page 59: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Algorithm

• If we don’t have a good cycle we must use either a reversal on two convergent edges or a reversal on edges in different cycles In first case, number of cycles is constant In second case, number of cycles

decreases by one Choose case one on a hurdle

Transforms bad component into good Number of cycles remains constant

Page 60: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Algorithm

• Getting rid of a non-hurdle doesn’t change the number of hurdles or fortress status, so distance remains the same

• If we reverse a superhurdle, the nonhurdle it protects becomes a hurdle so h remains constant

• Call reversal on some cycle in a hurdle hurdle cutting

Page 61: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Algorithm

• In order not to increase f(), use hurdle cutting only when h() is odd

• Using reversal on edges in two different cycles increases c() However d() will decrease if we can

decrease h() by two Choose edges from two different hurdles -

this is called hurdle merging• The two hurdles as well as any nonhurdle

separating them become good components

Page 62: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Algorithm

• We have to be careful that hurdle merging doesn’t transform a nonhurdle into a hurdle A and B are called opposite hurdles

when we find the same number of hurdles walking the circle clockwise from A to B as we do walking counterclockwise This can only happen if h() is even Choosing opposite hurdles, we don’t turn a

nonhurdle into a hurdle

Page 63: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Algorithm

• To avoid creating a fortress where we don’t have one, we choose the opposite hurdles when they exist

• If h() is odd and we have a simple hurdle, do hurdle cutting to avoid fortress

• If neither case if possible, we already have a fortress so f() doesn’t increase with any hurdle merging

Page 64: Genome Rearrangements CIS 667 April 13, 2004. Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.

Algorithm

Algorithm Sorting Reversalinput: distinct permutations and output: a sorting reversal for with target if there is a good component in RD() then

pick 2 divergent edges e and f in this component,making sure the corresponding reversal does notcreate any bad componentsreturn the reversal characterized by e and f

else if h() is even thenreturn merging of two opposite hurdles

else if h() is odd and there is a simple hurdle thenreturn a reversal cutting this hurdle

else // fortressreturn merging of any two hurdles