15. Lecture WS 2004/05Bioinformatics III1 V15: genome rearrangement current status * Genome...

15. Lecture WS 2004/05

Bioinformatics III 1

V15: genome rearrangement – current status

* Genome comparison mouse – human: syntenic regions

* Breakpoint analysis

* Breakpoint reusage

* heuristic MGR algorithm

* Comparison genomes mouse – rat – human

* microsyntenies - macrosyntenies

V16: reversal distance problem (Hannenvalli, Tesler, Pevzner) –

versus using conserved intervals (Bergeron, Stoye)



Processes of Genome EvolutionTwo genomes may have many genes in common, but the genes may be arranged in a different sequence or be moved between chromosomes. Such differences in gene orders are the results of rearrangement events that are common in molecular evolution (frequency ca. only 1 event per million years!)

- Substitution

- Insertion

- Deletion

- Translocation

- Inversion/ Reversal

- Duplication



What is a reversal = inversion ?

Break and Invert

A T G C C T G T A C T A

T A C G G A C A T G A T

A T G T A C A G G C T A

T A C A T G T C C G A T

• Purines (A, G) and Pyrimidines (C, T) switch strands

• Many organisms have highly similar genes but very different gene orders.• Very prominent in prokaryotes, mitochondrial DNA and mamalian X-chromosome.



Types of Genome Rearrangements

In unichromosomal genomes, the most common rearrangement events are reversals, in which a contiguous interval of genes is put into the reverse order.

For multichromosomal genomes, the most common rearrangement events are reversals, translocations, fissions, and fusions.

The pairwise genome rearrangement problem is to find an optimal scenario transforming one genome to another via these rearrangement events.

Genomic distance: the number of inversions and translocations needed to transform one genome into another. Fissions and fusions may be included as a special case of translocations in which one of the input or output chromosomes is empty.



Representation of a genome

We consider a unichromosomal genome to be a sequence of n genes. The genes are represented by numbers 1, 2, ..., n.

The two orientations of gene i are represented by i and -i.

A genome is represented as a signed permutation of the numbers 1, 2, ..., n.

For example, a unichromosomal genome with n = 5 genes is 5 -3 4 2 -1



Multichromosal Genome

A multichromosomal genome consists of n genes spread over m chromosomes. We represent it as a signed permutation of 1, 2, ..., n, with delimiters "$" or ";" inserted between the chromosomes. For example, a genome with 12 genes spread over 3 chromosomes is 7 -2 8 3 $ 5 9 -6 -1 12 $11 4 10 $ The order of the chromosomes and the direction of the chromosomes do not matter in the multichromosomal algorithms. Thus, we could represent this same genome by flipping the first chromosome (reverse the order of its entries and negate them) and then moving the last chromosome to the beginning: 11 4 10 $ -3 -8 2 -7 $ 5 9 -6 -1 12 $



Unichromosomal genomes: sorting by reversal

A reversal in a signed permutation is an operation that takes an interval in a permutation, reverses the order of the numbers, and changes all their signs. For example,

5 1 3 2 -9 7 -4 6 8

5 1 -7 9 -2 -3 -4 6 8

The reversal distance between two genomes is the minimum number of reversals it takes to get from one genome to the other.

For a given pair of genomes, the reversal distance is unique, but there are usually many possible reversal scenarios with this distance.

However, it is (of course) possible that this mathematical notion of reversal distance can underestimate the actual number of steps that occurred biologically.



Multichromosomal genomes: rearrangement operations

We treat four elementary rearrangement events in multichromosomal genomes: reversals, translocations, fusions, and fissions.

Reversal: An interval within a single chromosome may be reversed in the same fashion as a reversal acts in the unichromosomal case:

7 -2 8 3 $ 7 -2 8 3 $ 5 9 -6 -1 12 $ 5 9 -12 1 6 $ 11 4 10 $ 11 4 10 $ Note: When the GRIMM program are run in unichromosomal mode, the genomes 3 1 2 and -2 -1 -3are considered different (one reversal apart, distance = 1), while in multichromosomal mode, those same genomes are considered equivalent (distance = 0) because we have simply flipped an entire chromosome, which gives an equivalent genome in the multichromosomal mode.



Two chromosomes "A B" and "C D" may be rearranged into "A D" and "C B". (The letters A, B, C, D stand for sequences of genes.)

Because flipping chromosomes does not alter a genome (only its representation is altered), "A -C" and "-B D" is another possible translocation. (-B means to reverse the order of the genes in sequence B and negate each one.)

For example, a translocation on chromosomes 1 and 3 is

7 -2 8 3 $ 7 -2 8 -4 -11 $ 5 9 -6 -1 12 $ 5 9 -6 -1 12 $ 11 4 10 $ -3 10 $

Translocation



Fussion & Fission

Fusion: Two chromosomes may be fused together into a single chromosome. Due to chromosome flippings, there are 4 distinct fusions possible between each pair of chromosomes. Here is one of the fusions between chromosomes 1 and 3:

7 -2 8 3 $ 7 -2 8 3 -10 -4 -11 $ 5 9 -6 -1 12 $ 5 9 -6 -1 12 $ 11 4 10 $

Fission: A chromosome may be broken into two chromosomes between any pair of genes:

7 -2 8 3 $ 7 -2 8 3 $ 5 9 -6 -1 12 $ 5 9 $11 4 10 $ -6 -1 12 $

11 4 10 $



Signed and unsigned genomes

Most comparative mapping techniques determine the physical locations and relative order of genes in each chromosome, but do not determine which of two orientations each gene has.

Current sequencing methods do provide the orientations. It turns out that the genome rearrangement problem (uni- and multichromosomal) for unsigned permutations is NP-hard, but the same problems for signed data can be done in polynomial time.

Fortunately, with many genomes currently being sequenced, it is likely that many comparative maps (corresponding to unsigned permutations) will soon be replaced by sequencing data (corresponding to signed permutations).




For example, to turn the unsigned genome 1 2 3 4 5

into the unsigned genome 1 4 3 2 5

requires one unsigned reversal. An assignment of signs may be designed in the source and destination genomes that give a signed reversal scenario requiring this same number of steps. Here, we get 1 2 3 4 5

1 -4 -3 -2 5

which also takes one step. Note that there may be other sign assignments taking this minimum number of steps.




It is possible that correctly signed data would have increased the number of steps:

1 2 3 4 5

1 -4 -3 -2 5

1 -4 3 -2 5

If the data collection method did not determine signs, it is impossible to know mathematically whether the one step or two step scenario is more biologically accurate; the mathematical problem the genome rearrangement programs solve is to find the signs giving the minimum possible distance.



A biological model case

8 7 6 5 4 3 2 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

cabbage

turnip

Palmer and Herbon found that the mitochondrial genomes in cabbage and turnip had very similar gene sequences, but with fairly different gene orders. How to design a „transformation“ of cabbage into turnip? Mitochondrial DNA of cabbage and turnip are composed of five conserved blocks of genes that are shuffled in cabbage as compared to turnip. Every conserved block has a direction that is shown by a + or – sign.



Inversion, Transposition and inverted Transposition

inversion

transposition

inverted transposition



Sorting by Reversals

8 7 6 5 4 3 2 1 11 10 9

8 7 6 5 4 3 2 1 11 10 9

8 2 3 4 5 6 7 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

8 2 3 4 5 1 7 6 11 10 9

4 3 2 8 5 1 7 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

Cabbage

Turnip



Permutation () : an ordered arrangement ofthe set { 1,2,…,n}

Reversal () :a rearrangement that inverts ablock in {3 4 7 6 1 5 2 } (3,6) ={3 4 5 1 6 7 2}

Signed Permutation (): a permutationwhere the elements are orienteda reversal switches element orientation{+3 -4 +7 -6 +1 -5 +2 } (3,6) ={+3 -4 +5 -1 +6 -7 +2}



easy to do by eye ...

8 7 6 5 4 3 2 1 11 10 9

8 7 6 5 4 3 2 1 11 10 9

8 2 3 4 5 6 7 1 11 10 9

4 3 2 8 7 1 5 6 11 10 9

8 2 3 4 5 1 7 6 11 10 9

4 3 2 8 5 1 7 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

4 3 2 8 7 1 5 6 11 10 9

1

12

123

12….t=

= t …. 21



Formal Approach: Sorting by Reversals

The order of genes in 2 organisms is represented by permutations = 12 ... n and = 12 ... n.

A reversal of an interval [i,j] is the permutation

1 2 ... i-1 i i+1 ... j-1 j j+1 ... n1 2 ... i-1 j j-1 ... i+1 i j+1 ... n

(i,j) has the effect of reversing the order of ii+1 ... j and transforming

1 ... i-1i ... j j+1 ... n into •(i,j) = 1 ... i-1j ... ij+1 ... n .

Given permutations and , the reversal distance problem is to find a series of reversals 12 ... t such that •1•2 ... t = and t is minimal.

t is called the reversal distance between and .



Reconstruction of phylogenetic trees from WG data

1 Phylogeny reconstruction as optimization problem?Attempt to reconstruct an evolutionary scenario with a minimum number of permitted evolutionary events (e.g. duplications, insertions, deletions, inversions, transpositions) on a tree all known approaches are NP-hardAlso, no automated tool exists sofar.

2 Estimate leaf-to-leaf distances (based on some metric) between all genomes. Then úse a standard distance-based method such as neighbour-joining to construct the tree. Such approaches are quite fast but cannot recover the ancestral gene order.

2a Breakpoint phylogeny (Blanchette & Sankoff)for special case in which the genomes all have the same set of genes, and each gene appears once. Use breakpoint distance as distance matrix.



Reversal distance problemThe reversal distance for a pair of genomes can be computed in polynomial time (Hannenhalli & Pevzner 1999 and others, also see Bioinformatics 1 lecture).

However, its use in studies of multiple genome rearrangements was somewhat limited since it was not clear how to combine pairwise rearrangement scenarios into a multiple rearrangement scenario.

In particular, Capara (1999) demonstrated that even the simplest version of theMultiple Genome Rearrangement Problem, the Median Problem, is NP-hard.

Therefore, this line of research was abandoned for a while in favor of thebreakpoint analysis approach (see Blanchette & Sankoff).The existing tools BPAnalysis or GRAPPA use the so-called breakpoint distance to derive rearrangement scenarios.



Breakpoint phylogenyWhen each genome has the same set of genes and each gene appears exactly

once, a genome can be described by a (circular or linear) ordering = permutation of these genes.Each gene has either positive (gi) or negative (- gi) orientation.

Given 2 genomes G and G‘ on the same set of genes, a breakpoint in G is defined as an ordered pair of genes (gi,gj) such that gi and gj appear

consecutively in that order in G, but neither (gi,gj) (- gi,- gj) appears

consecutively in that order in G‘.

The breakpoint distance between two genomes is simply the number of breakpoints between that pair of genomes.The breakpoint score of a tree in which each node is labelled by a signed ordering of genes is then the sum of the breakpoint distances along the edges of the tree.



Breakpoint Graph

Sorting a permutation is a hard problem.Breakpoints were introduced by Watterson et al. (1982) and by Nadeau and Taylor (1984) and correlations were noticed between the reversal distance and the number of breakpoints.

Let i j if |i – j| = 1. Extend a permutation = 12 ... n by adding 0 = 0 and

n+1 = n + 1. We call a pair of elements (i,i+1), 0 i n, of an adjacency

if i i+1, and a breakpoint if i i+1.

2 3 1 4 6 5 7

0 2 3 1 4 6 5 7 8

adjacencies

breakpoints

As the identity permutation has no breakpoints, sorting by reversals corresponds to eliminating breakpoints.

An observation that every reversal can eliminate at most 2 breakpoints implies thatthe reversal distance d() b() / 2 where b() is the number of breakpoints in . However, this is a clear overestimate.



Breakpoint Graph

The breakpoint graph of a permutation is an edge-colored graph G() with n + 2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by a

black edge for 0 i n. We join vertices i and j by a gray edge if i j.

Black path

0 2 3 1 4 6 5 7

Grey path

0 2 3 1 4 6 5 7

Superposition of black and grey paths formsthe breakpoint graph:

A breakpoint graph is obtained by a super-position of a black pathtraversing the vertices0, 1, ..., n, n+1 in the order given by the permutation and a graypath traversing the verticesin the order given by theidentity permutation.

more next week ...



Comparison of mouse and man at genome level

Key findings:

* the mouse genome is about 14% smaller than the human genome. The difference probably reflects a higher rate of deletion in mouse.

* over 90% of the mouse and human genomes can be partitioned into corresponding regions of conserved synteny (segments in which the gene order in the most recent common ancestor has been conserved in both species)

* at the nucleotide level, ca. 40% of the human genome can be aligned to the mouse genome. These sequences seem to represent most of the orthologous sequences that remain in both lineages from the common ancestor. The rest was probably deleted in one or both genomes.

* the neutral substitution rate has been roughly half a nucleotide substitution per site since the divergence of the species. About twice as many of these substitutions have occurred in mouse as in human.

see paper of the Mouse Genome Sequencing Consortium „Initial sequencing and comparative analysis of the mouse genome“,Nature 420, 520-562 (5.12.2002). Excellent paper! Well readable!



Comparison of mouse and man at genome level

Key findings:

* the proportion of small (50-100 bp) segments in the mammalian genome that is under (purifying) selection is ca. 5%, i.e. much higher than can be explained by protein-coding sequences alone. genome contains many additional features (UTRs, regulatory elements, non-protein-coding genes, chromosomal structural elements) under selection for biological function!

* the mammalian genome is evolving in a non-uniform manner, various measures of divergence showing substantial variation across the genome.

* mouse and human genomes each seem to contain ca. 30.000 protein-coding genes. The proportion of mouse genes with a single identifiable orthologue in the human genome is ca. 80%. The proportion of mouse genes without any homologue currently detectable in the human genome (and vice versa) is < 1%.



The mouse genome. Nature 420, 520 - 562

Conservation of synteny between human and mouseStarting from a common ancestral genome approximately 75 Million years ago, human and mouse genomes have each been shuffled by chromosomal rearrangements. The rate of these changes is low enough that local gene order remains largely intact (ca. 3.2 chromosomal rearrangements per 1 million year in mouse, and 1.6 per Myr in human).

In their pioneering paper, Nadeau and Taylor, 1984 estimated that the mouse and human genomes could be parsed into roughly 180 syntenic regions –

a surprisingly small number.

Random-breakage model.

Today, gene-based syteny maps define about 200 - 500 syntenic regions depending on the minimal segment length considered.




Detect syntenic regions with PatternHunter- perform sequence comparison of entire mouse and human genome sequencesto identify regions with a high similarity score > 40 (corresponding to a 40-baseperfect match with penalties for mismatches and gaps)

- also require that each sequence is the other‘s unique match above this threshold.Such regions probably reflect orthologous sequence pairs.

About 558.000 pairs found! Mean spacing of 4.4 kb; N50 length of ca. 500 kb.Together they make up 7.5% of the mouse genome.

But there may be many more that have evolved too quickly to be detected.

Use RepeatMasker to remove repeats (breakpoint analysis requires unique matches between genomes).




Identify regions of conserved syntenySyntenic segment: maximal region in which a series of landmarks occur in the same order on a single chromosome in both species.

Syntenic block: one or more syntenic segments that are all adjacent on the same chromosome in human and on the same chromosome in mouse; may otherwise be shuffled with respect to order and orientation.

(only consider regions > 300 kb)

Each genome could be parsed into a total of 342 conserved syntenic segments.On average, each landmark resides in a segment containing 1600 other landmarks.Segments vary greatly in length: 303 kb – 64.9 Mb.About 90.2 % of human and 93.3% of mouse genome unambigously residewith conserved syntenic segments.




Conservation of synteny between human and mouse

A typical 510-kb segment of mouse chromosome 12 that shares common ancestry with a 600-kb section of human chromosome 14 is shown. Blue lines connect the reciprocal unique matches in the two genomes.

The cyan bars represent sequence coverage in each of the two genomes for the regions.

In general, the landmarks in the mouse genome are more closely spaced, reflecting the 14% smaller overall genome size.




Correspondence of syntenic regions

Segments and blocks >300 kb in size with conserved synteny in human are superimposed on the mouse genome. Each colour corresponds to a particular human chromosome. The 342 segments are separated from each other by thin, white lines within the 217 blocks of consistent colour.




Dot plots of conserved syntenic segmentsFor each of three human (a–c) and mouse (d–f) chromosomes, the positions of orthologous landmarks are plotted along the x axis and the corresponding position of the landmark on chromosomes in the other genome is plotted on the y axis. Different chromosomes in the corresponding genome are differentiated with distinct colours. In a remarkable example of conserved synteny, human chromosome 20 (a) consists of just three segments from mouse chromosome 2 (d), with only one small segment altered in order. Human chromosome 17 (b) also shares segments with only one mouse chromosome (11) (e), but the 16 segments are extensively rearranged. However, most of the mouse and human chromosomes consist of multiple segments from multiple chromosomes, as shown for human chromosome 2 (c) and mouse chromosome 12 (f). Circled areas and arrows denote matching segments in mouse and human.




Size distribution of elements with conserved synteny

Size distribution of segments and blocks with synteny conserved between mouse and human. a, b, The number of segments (a) and blocks (b) with synteny conserved between mouse and human in 5-Mb bins (starting with 0.3–5 Mb) is plotted on a logarithmic scale. The dots indicate the expected values for the exponential curve of random breakage given the number of blocks and segments, respectively.




Genome rearrangement?Using the Pevzner & Tesler algorithms one can compute the minimal number of rearrangements needed to „transform“ one genome into the other.

When applied to the 342 syntenic segments, the most parsimonious (=shortest) path has 295 rearrangements.The analysis suggests that chromosomal breaks may have a tendency to reoccur in certain regions.

With only two species, however, it is not yet possible to recover the ancestral chromosomal order or reconstruct the precise pathway of rearrangements.

This is only possible when more than 2 mammalian species are considered.



Genome Rearrangements: Synteny

(a) Human and mouse synteny blocks of conserved gene order. Every block corresponds to a rectangle, with a diagonal showing whether the arrangements of anchors in human and mouse (within the synteny block) are the same or reversed.

(b) Combining anchors into clusters by the GRIMM-Synteny algorithm at G = 100 kb. The edges in the anchor graph connect the closest ends of the anchors. The anchors are color-coded by the resulting clusters. At G = 1 Mb, this forms a single cluster, which in turn forms a synteny block (the lower right block in the human 18/mouse 17 rectangle in a).

Pevzner, Tesler, Genome Res 13, 37 (2003)



From Anchors to Breakpoint Graphs

X-chromosome: from local similarities, to synteny blocks, to breakpoint graph, to rearrangement scenario. (a) Dot-plot of anchors. Anchors are enlarged for visibility. (b) Clusters of anchors. (c) Rectified clusters. (d) Synteny blocks. (e) Synteny blocks (symbolic representation as genome rearrangement units; rescaling each block has same length on x- and y-axis). (f) 2D breakpoint graph superimposed on synteny blocks. The projections of the 2D graph onto the human and mouse axes form the conventional breakpoint graphs. (g) 2D breakpoint graph. The four cycles in the breakpoint graph are shown by different colors. (h) A most parsimonious rearrangement scenario for

human and mouse X-chromosomes.




Genome Rearrangements

Construction of the breakpoint graph from synteny blocks. (a) Solid path through human. (b) Dotted path through mouse. (c) Superposition of paths. (d) Remove blocks to obtain cycles.




Multichromosomal breakpoint graph

Multichromosomal breakpoint graph of the whole human and mouse genomes. The conventional chromosome order and orientation are not suitable for such graphs; an optimal chromosome order and orientation were determined by the algorithm in Tesler (2002).

Three "null chromosomes," N1, N2, N3, were added to mouse to equalize the number of chromosomes in the two genomes.




Multiple Genome Rearrangement ProblemFind a phylogenetic tree describing the most „plausible“ rearrangement scenario for multiple species.

The genomic distance in the case of genome rearrangement is defined in terms of (1) reversals, (2) translocations, (3) fusions, and (4) fissions which are the most common rearrangement events in multichromosomal genomes.

The special case of three genomes (m = 3) is called the Median Problem.Given the gene order of three unichromosomal genomes G1, G2, and G3, find the ancestral genome A which minimizes the total reversal distance

321 ,,, GAdGAdGAd



Multiple Genome Rearrangement Problem

New approach:Given a set of m permutations (existing genomes) or order n, find a tree T with the m permutations as leaf nodes and assign permutations (ancestral genomes) to internal nodes such that D(T) is minimized, where

is the sum of reversal distances over all edges of the tree.

T

dTD

,

,

The breakpoint analysis attempts to solve the Median Problem by minimizing the breakpoint distance instead of the reversal distance.However, the breakpoint distance, in contrast to the reversal distance, does not correspond to a minimum number of rearrangement events!As a result, the breakpoint, recovered by breakpoint analysis, rarely corresponds to the ancestral median, the genome that minimizes the overall number of rearrangements in the evolutionary scenario.



New algorithm

Aim: Among all possible reversals for each of the three genomes identify good reversals.

A good reversal in a genome G1 is a reversal that brings a genome closer to the ancestral genome. But since this is unknown, it is unclear to find good reversals, oops!

Instead: assume that reversals that reduce the reversal distance between G1 and G2 and the reversal distance between G1 and G3 are likely to be good reversals.

With () as the overall reduction in the reversal distances:

the reversal () is good if () = 2.

31213121 ,,,, GGdGGdGGdGGd



New algorithmIteratively carry on these good rearrangements until the genomes G1, G2, and

G3 are transformed into an identical genome, hoping that this is the most likely

„ancestral median“.

When we are dealing with multichromosomal genomes and with four different types of rearrangements, ambiguous situations may occur too.



Ambiguities again possibleE.g. G1 = 1 2 3 4 5

G2 = 1 2 -5 -4 -3

G3 = 1 2

3 4 5 The parsimony principle does not allow to umambiguously reconstruct the evolutionary scenario. If the ancestor coincides with G1, then a reversal occurred on the way to G2,

and a fission occurred on the way to G3.

One can as well start with G2 or G3 as the ancestors. In this case 323121 ,,, GGdGGdGGd

This kind of ambiguity does not exist for unichromosomal genomes because, there, it is impossible to find 3 genomes that would all be within one reversal of each other.



Strategy for choosing reversalsTherefore one has to select carefully among the good rearrangements.Observe that in most genomes of interest reversals and translocations are more common than fusions and fissions.

Therefore use as a rule always to select reversals/translocations before fusions/fissions.

Often, the list of good reversals contains nonoverlapping reversals, and the order in which these reversals are performed is often irrelevant.Compute for each good reversal the number of good reversals n that will be

available if is carried out. Then choose the good reversal with the maximal n to be carried out.

If we run out of good reversals before reaching a solution, the best reversal to be taken will be the result of a depth k search minimizing the total pairwise rearrangement distances.



How good measure is reversal distance?Authors claim that the reversal distance is a good approximation of the true distance for many biologically relevant cases.

Let be a genome that evolved from a genome by k reversals. I.e. the true distance between and is k.

We say that and form a valid pair if d(, )= k. Otherwise we say that d(, ) underestimates the true distance.

Typically two genomes form a valid pair if the number of rearrangements between them is relatively small – exactly the case in a number of genome rearrangement studies.



Reversal distance vs. True distance

Reversal distance, d(, ), versus the actual number of reversals performed to transform into , where is a genome/permutation that evolved from the identity permutation = 1,2, ... ,100 by k random reversals. The simulations were repeated 10 times for every k. Shown is the average difference between the reversal distance and the actual number of reversals performed (k).

For a genome with n=100 markers, the reversal distance approximates the true distance very well as long as the number of reversals remains below 0.4 n. This is the case in many biological relevant cases.

Bourque, Pevzner, Genome Res (2002)



Test on simulated data

Starting from the identity permutation A with n genes/markers.n = 30 or 100.k reversals were performed to get genome G1, k to get genome G2, and k to get

genome G3.

Use these as input to MGR-MEDIAN and GRAPPA.Check whether programs reconstruct the ancestral identity permutation.

The simulations were repeated 10 times for every ratio #reversals/#markers = 3k/n.



Comparison of MGR-MEDIAN and GRAPPA(a) and (b) show the average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree (equal to 3k). (c) and (d) show the average reversal distance between the solution recovered and the actual ancestor.

GRAPPA and MGR-MEDIAN produce very similar solutions for r < 0.20.As ratio r increases, GRAPPA starts making errors. MGR-MEDIAN sometimes finds solutions that even have fewer reversals than the actual ancestor.Reason: for increasing r, assumption that the ancestor corresponds to the most parsimonious scenario sometimes fails.




Tests on simulated data: non-equidistant genomesThe genomes G1, G2, and G3 are

obtained by k, k, and 2k reversals, each from the ancestral identity permutation 1 2 ... n (n = 30 and n = 100). The simulations were repeated 10 times for every ratio #reversals/#markers = 4k/n.

Figs (a) - (d) have same meaning as on previous figure. Same behavior is found.

Also test for 4 – 10 genomes.GRAPPA can‘t do more than 10 genomes because the tree space is too large.




Herpesvirus Data

Herpes simplex virus (HSV), Epstein-Barr virus (EBV), and Cytomegalovirus (CMV) gene orders (Hannenhalli et al. 1995 ) as well as the ancestral gene order (A) and optimal evolutionary scenario recovered by MGR-MEDIAN.

MGR finds solution with 7 reversals, GRAPPA finds 8 reversals.

Here, the ratio r of #reversals / #markers is 7/25 = 0.28.




mtDNA of human, fruit fly, and sea urchin

Human, sea urchin, and fruit fly mitochondrial gene order taken from Sankoff et

al. (1996) . A is the ancestral gene order suggested by MGR-MEDIAN.

Solution found is different from Sankoff et al. but the total reversal distance (39) is the same.Here, the ratio of #reversals / #markers is 39/33 = 1.18, marking this as a difficult problem.Running GRAPPA on these genomes gives a solution with a total reversal distance of 43.




Metazoan mtDNA dataData (36 common genes) of 11 metazoan genomes that was studied before by BPA.

Shown here: Phylogeny reconstructed by MGR. The genomes come from 6 major metazoan groupings: nematodes (NEM), annelids (ANN), mollusks (MOL), arthropods (ART), echinoderms (ECH), and chordates (CHO). Numbers show the number of reversals (150 in total).

Tree is very similar to that of Blanchette et al. that was constructed in a semiautomated fashion. GRAPPA finds after 48 CPUhours three optimal trees with 175 reversals and 200 breakpoints.




Campanulaceae cpDNA data

Campanulaceae chloroplast with 13 cpDNAs and 105 markers.The tree space for 13 genomes cannot be searched exhaustively by GRAPPA. Therefore, trees were constrained by Moret et al. (2001). They found 216 trees with a total of 67 reversals.

MGR (without using constraints) gives a tree with 65 reversals.Tree topology corresponds to GRAPPA tree but labelling of internal nodes differs.




Nadeau & Taylor model (1984)

- suggest presence of conserved segments (i.e., segments with preserved gene orders without disruption by rearrangements) - estimated that there are ca. 180 conserved segments in human and mouse

- provided convincing evidence that random breakage model of genomic evolution postulated by Ohno is correct. The model assumes a random (i.e., uniform and independent) distribution of chromosome rearrangement breakpoints and is supported by the observation that the lengths of synteny blocks shared by human and mouse are well fitted by the predicted exponential distribution imposed by the random breakage model.

where L is the average length of segments.

- model has become widely accepted- new studies of significantly larger datasets that confirmed that newly discovered synteny blocks still fit the predicted exponential distribution very well.

Lx

eLxf 1



Breakpoint reusage

Two different most parsimonious scenarios that transform the order of the 11 synteny blocks on the mouse X chromosome into the order on the human X chromosome. The arrangement of synteny blocks in the ancestor is unspecified (and is assumed to coincide with one of intermediate arrangements) because it cannot be inferred without availability of a third genome. Breakpoint uses are shown as short vertical yellow lines, and breakpoint region reuses are shown as double yellow lines. In the first scenario (Left) the breakpoint reuses are located in human in breakpoint regions (3,4), (4,5), and (5,6), whereas in the second one (Right) they are located in (5,6), (6,7), and after block 11. In the second scenario, a potential hidden block is shown as a black dot; it restricts the set of possible most parsimonious scenarios, and it separates two breakpoint uses that would have been a breakpoint region reuse. Our theory implies that any rearrangement scenario based on these 11 blocks has at least three reuses of breakpoint regions (possibly including chromosome ends).

Pevzner, Tesler, PNAS 100, 7672 (2003)



Breakpoint reusage

Extension of the Hannenhalli–Pevzner theory implies that any rearrangement scenario based on these 11 blocks has at least three reuses of breakpoint regions (although one cannot unambiguously infer where these breakpoint reuses happened).

indicates that there are at least three more "hidden" synteny blocks in addition to our 11 "large" synteny blocks. Some of these blocks may be detected by lowering the threshold for synteny block detection, whereas others may escape such detection.

The analysis further reveals at least 190 breakpoint region reuses over the whole genome on the evolutionary path from mouse to human.

Pevzner, Tesler, PNAS 100, 7672 (2003)



Length of synteny blocks

(Left) Histogram of synteny block lengths in human for Nb = 281 synteny blocks of length at least 1 Mb,

fitted by an exponential distribution with mean block length L = GbNb = 9.6 Mb, where Gb = 2,707 Mb is the overall length of syntenic blocks. The bin size is 2.5 Mb. (Center) The same histogram superimposed with the 190 hidden synteny blocks revealed by genome rearrangement analysis, under the assumption that all hidden blocks are short, i.e., <1 Mb in length. (Right) Histogram of breakpoint region lengths in the human genome (bin size is 100 kb). Most breakpoint regions are very short, with 109 of 258 regions being <100 kb. However, there is a small number of long breakpoint regions: 17 regions are between 1 and 2.5 Mb, and 15 are <2.5 Mb (shown by a single bar at the right end).

The rearrangement analysis confirms the existence of many short breakpoints. Their existence immediately implies that an exponential distribution is not a good fit to reality, thus pointing to limitations of the random breakage mode Pevzner, Tesler, PNAS 100, 7672 (2003)



Rat – mouse – man

Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004)

Aligned portions and origins of sequencesin rat, mouse and human genomes.




The Rat Sequencing Consortium, Nature 428, 493 (2004)




The Rat Sequencing Consortium, Nature 428, 493 (2004)

X chromosome on each pair. GRIMM synteny for 16 orthologous pairs.Arrangement of the 16 blocks: 15 rearrangement events necessary.Shown is one of a number of most parsimonious inversion scenarios.The last common ancestor of human, mouse and rat should be on the evolutionary path between median ancestor and human.




Rat – mouse – man – example 2





Full evolutionary model(all chromosomes)



SummaryBreakpoint analysis (BPA) is a robust technique for small rearrangement problems. Problem of ambiguity between different optimal solutions.Although complexity could be dramatically reduced by algorithmic improvements (e.g. GRAPPA), method is still too expensive for more than 10 genomes.

Heuristic MGR algorithm by Bourque & Pevzner minimizes reversal distance instead of breakpoint distance. (Taking the number of breakpoints 2 was not the optimal lower bound for the reversal distance.)Runs more efficient + can be applied to much larger problems + provides only one or a few solutions.

MGR algorithm: analogy to conformational search in some energy landscape ...

What is the correct way to identify the biologically correct = true evolutionary trees: by minimizing the breakpoint distance or the reversal distance or something else?



Summary IIJoint analysis of > 2 genomes allows to identify common ancestors.But just 3 genomes are not sufficient to identify unique most parsimonius evolutionary path.

Right now we don‘t understand what to do with the repeats.Why throw them away (using RepeatMasker)?Following duplication history should be as powerful as following genome rearrangement.

15. Lecture WS 2004/05Bioinformatics III1 V15: genome rearrangement current status * Genome...

Documents

Transcript of 15. Lecture WS 2004/05Bioinformatics III1 V15: genome rearrangement current status * Genome...