Identifying conserved spatial patterns in genomes

105
Identifying conserved spatial patterns in genomes Rose Hoberman Computer Science Department Carnegie Mellon University University of Chicago 0ct 20, 2006

description

Identifying conserved spatial patterns in genomes. Rose Hoberman Computer Science Department Carnegie Mellon University. University of Chicago 0ct 20, 2006. My focus: Spatial Comparative Genomics. - PowerPoint PPT Presentation

Transcript of Identifying conserved spatial patterns in genomes

Page 1: Identifying conserved spatial patterns  in genomes

Identifying conserved spatial patterns

in genomes

Rose HobermanComputer Science Department

Carnegie Mellon University

University of Chicago0ct 20, 2006

Page 2: Identifying conserved spatial patterns  in genomes

My focus: Spatial Comparative Genomics

Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and

evolves.

Page 3: Identifying conserved spatial patterns  in genomes

A simple model of a genome

an ordered list of genes

4 53 71 62 8 9 11 12 1310 14 15 1716 19 20183

Page 4: Identifying conserved spatial patterns  in genomes

Genomic Change

speciation

Sequence Mutation + Chromosomal Rearrangements

species 2

species 1

Ancestral genome

Page 5: Identifying conserved spatial patterns  in genomes

Types of Genomic Rearrangements

4 53 71 62

8 97 11 12104 531 62 13 14 15 1716 19 2018

8 9 11 12 1310 14 15 1716 19 2018

Inversions

431 2

Duplications/Insertions

3 89111213 10141517 161920 18

20191813 14 15 1716

Loss

Page 6: Identifying conserved spatial patterns  in genomes

8 97 11 12104 531 62 431 2 2013 14 15 1716

Fissions and fusions

4 53 71 2 3 13141517 161920 18 3 891112 10

Types of Genomic Rearrangements

InversionsDuplications/InsertionsLoss

Page 7: Identifying conserved spatial patterns  in genomes

An Essential Task forSpatial Comparative Genomics

4 53 71 2

8 97 11 12104 531 62 431 2

13141517 161920 18

2013 14 15 1716

3

431 2

891112 10

Species 2

Species 1

Identify chromosomal regions that descended from the same region in the genome of the common ancestor

Page 8: Identifying conserved spatial patterns  in genomes

Outline

Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic

regions?

Problem Background Why is this challenging? Introduction to cluster finding

Results Statistics for pairwise clusters Statistics for three-way clusters

Page 9: Identifying conserved spatial patterns  in genomes

Pevzner, Tesler. Genome Research 2003

Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution

Reconstruct history of chromosomal rearrangements

Infer ancestral genetic map

Phylogeny reconstruction

Identify ancient whole genome duplications

Page 10: Identifying conserved spatial patterns  in genomes

Identification of homologous chromosomal segments is a key task in comparative genomics

Guillaume Bourque et al. Genome Research 2004

1. Genome evolution Reconstruct

history of chromosomal rearrangements

Infer ancestral genetic map

Phylogeny reconstruction

Identify ancient whole genome duplications

Page 11: Identifying conserved spatial patterns  in genomes

Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution

Reconstruct history of chromosomal rearrangements

Infer ancestral genetic map

Phylogeny reconstruction

Identify ancient whole genome duplications

Whole genome duplication

chromosome 2chromosome 1

Ancestral chromosome

chr 2

chr 1

Page 12: Identifying conserved spatial patterns  in genomes

Identification of homologous chromosomal segments is a key task in comparative genomics1. Genome evolution

Reconstruct history of chromosomal rearrangements

Infer ancestral genetic map

Phylogeny reconstruction

Identify ancient whole genome duplications McLysaght et al

Nature Genetics, 2002.

Page 13: Identifying conserved spatial patterns  in genomes

Infer functional associations

Predict operons

Identify horizontal transfers

2. Understand gene function and regulation in bacteria

Identification of conserved chromosomal structure is a key task in comparative genomics

Loss

Insertion

Inversions

Page 14: Identifying conserved spatial patterns  in genomes

Outline

Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic

regions?

Problem Background Why is this challenging? Introduction to cluster finding

Results Statistics for pairwise clusters Statistics for three-way clusters

Page 15: Identifying conserved spatial patterns  in genomes

Closely related genomes

Related regions are easy to identify

4 53 71 2

8 97 11 12104 531 62 431 2

13141517 161920 18

2013 14 15 1716

3

431 2

891112 10Species 1

Species 2

Page 16: Identifying conserved spatial patterns  in genomes

Five hundred million years...

Page 17: Identifying conserved spatial patterns  in genomes

Homologous regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly

preserved

More Distantly Related Genomes

45

3 71 2

89 11 12104 51 62 31 2

89 11121310 141517 161920

18

2013 1415 17 16

3

4 1

18

7

11

Page 18: Identifying conserved spatial patterns  in genomes

Gene clusters Similar gene content Neither gene content nor order is perfectly

45

3 71 2

89 11 12104 51 62 31 2

89 11121310 141517 161920

18

2013 1415 17 16

3

4 1

18

7

11

The signature of diverged regions

Page 19: Identifying conserved spatial patterns  in genomes

A Framework for Identifying Gene Clusters

1. Find homologous genes

2. Formally define a “gene cluster”

3. Design an algorithm to find clusters

4. Statistically verify clusters

Page 20: Identifying conserved spatial patterns  in genomes

Why Validate Clusters Statistically?

After sufficient time has passed, gene order will become randomized

Uniform random data tends to be “clumpy”

Some genes will end up close together in both genomes simply by chance

43 71 2

89 11 12104 51 62 31 2

1310 141517 161920

2013 1415 17 16

3

4 1

18

7

11

Page 21: Identifying conserved spatial patterns  in genomes

Cluster Statistics

r-windows

max-gap

2 regionsDurand & Sankoff,

2003my work

3 regions my work unsolved

Cluster Models

Page 22: Identifying conserved spatial patterns  in genomes

Outline

Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic

regions?

Problem Background Why is this challenging? Introduction to cluster finding

Results Statistics for pairwise clusters Statistics for three-way clusters

Page 23: Identifying conserved spatial patterns  in genomes

A Framework for Identifying Gene Clusters

1. Find homologous genes

2. Formally define a “gene cluster”

3. Design an algorithm to find clusters

4. Statistically verify clusters

Page 24: Identifying conserved spatial patterns  in genomes

Gene Homology

Identification of homologous gene pairs generally based on sequence similarity conserved genomic context is also informative

Assumptions matches are binary (similarity scores are

discarded) each gene is homologous to at most one other

gene in the other genome

Page 25: Identifying conserved spatial patterns  in genomes

A Framework for Identifying Gene Clusters

1. Find homologous genes Formally define a “gene cluster”

3. Devise an algorithm to identify clusters

4. Statistically verify clusters

Page 26: Identifying conserved spatial patterns  in genomes

Where are the gene clusters?

Intuitive notion: pairs of regions that are dense with homologs

How can we formalize this intuition?

Page 27: Identifying conserved spatial patterns  in genomes

The r-window cluster definition

Two windows of size r that share at least m homologous gene pairs

r is a user-specified parameter

r = 6

(Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)

Page 28: Identifying conserved spatial patterns  in genomes

A max-gap chain

The distance or “gap” between genes is equal to the number of intervening genes

A set of genes in a genome form a max-gap chain if the gap between adjacent genes is never

greater than g (a user-specified parameter)

g= 2 gap= 3

Page 29: Identifying conserved spatial patterns  in genomes

The max-gap cluster definition

g= 2 g= 3gap= 3

A set of genes form a max-gap cluster in two genomes if

1. the genes forms a max-gap chain in each genome

2. the cluster is maximal (i.e. not contained within a larger cluster)

Page 30: Identifying conserved spatial patterns  in genomes

A Framework for Identifying Gene Clusters

1. Find homologous genes

2. Formally define a “gene cluster” Devise an algorithm to identify

clusters

4. Statistically verify clusters

Page 31: Identifying conserved spatial patterns  in genomes

Max-gap search algorithms

Many genomic studies use a max-gap criteria Each group designs their own search algorithm These are often greedy algorithms, but greedy

algorithms miss disordered max-gap clusters

Hoberman et al, RECOMB Comp Genomics 2005

Page 32: Identifying conserved spatial patterns  in genomes

Greedy, Agglomerative Algorithms

1. initialize a cluster as a single homologous pair

2. search for a gene in proximity on both chromosomes

3. either extend the cluster and repeat, or terminate

g = 2

Page 33: Identifying conserved spatial patterns  in genomes

Greedy Algorithms Impose Order Constraints

g = 2

A max-gap cluster of size four

A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size

two

Page 34: Identifying conserved spatial patterns  in genomes

Algorithms and Definition Mismatch

Greedy, bottom-up algorithms will not find all max-gap clusters

There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002)

Cluster statistics depend on the search space, which depends on which algorithm is used

Page 35: Identifying conserved spatial patterns  in genomes

A Framework for Identifying Gene Clusters

1. Find homologous genes

2. Formally define a “gene cluster”

3. Devise an algorithm to identify clusters

Statistically verify clustersAn example…

Page 36: Identifying conserved spatial patterns  in genomes

Example: Whole genome self-comparison to detect duplicated blocks

Chose g=30

Compared all human chromosomes to all other chromosomes to find max-gap gene clusters

McLysaght et al. Nature Genetics, 2002.

Page 37: Identifying conserved spatial patterns  in genomes

How can we use statistical analysis?

Could two regions display this degree of similarity simply by chance?

Is g=30 a reasonable choice of gap size? Are larger clusters less likely to occur by

chance? How large does a cluster have to be before we

are surprised to observe it?

McLysaght et al. Nature Genetics, 2002.

3 genes duplicatedout of ~25 13 genes

29 genes

10 genes duplicatedout of ~100

Chr 3

Chr 17

Page 38: Identifying conserved spatial patterns  in genomes

Outline

Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic

regions?

Problem Background Why is this challenging? Introduction to cluster finding

Results Statistics for pairwise clusters Statistics for three-way clusters

Page 39: Identifying conserved spatial patterns  in genomes

Cluster Statistics

r-windows

max-gap

2 regionsDurand & Sankoff,

2003my work

3 regions my work unsolved

Cluster Models

Page 40: Identifying conserved spatial patterns  in genomes

The max-gap definition is the most widely used cluster definition in genomic analyses

Yet there is no formal statistical model for max-gap clusters

Overbeek et al 1999: inferring functional coupling of genes in bacteriaVision et al 2000: origins of genomic duplications in ArabidopsisFriedman and Hughes 2001: gene duplication and structure of eukaryotic

genomesTamames 2001: evolution of gene order conservation in prokaryotesVandepoele et al 2002: microcolinearity between Arabidopsis and riceMcLysaght et al 2002: genomic duplication during early chordate

evolutionSimillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in ArabidopsisLuc et al 2003: gene teams for comparative genomicsChen et al 2004: operon prediction in newly sequenced bacteriaBourque et al 2005: comparison of mammalian and chicken genome

architectures …

Page 41: Identifying conserved spatial patterns  in genomes

Is this cluster biologically meaningful? Could it have occurred in a comparison of

two random genomes?

The QuestionSuppose two whole genomes were

compared, and this max-gap cluster was identified:

Page 42: Identifying conserved spatial patterns  in genomes

Statistical Testing

Hypothesis testing Alternate hypothesis: shared ancestry Null hypothesis: random gene order

Discard clusters that could have arisen under the null model

Determine the probability of observing a similar cluster under the null hypothesis

Page 43: Identifying conserved spatial patterns  in genomes

Given an allowed gap size of g, what is the probability of observing a max-gap cluster

containing exactly h matching gene pairs?

How do we calculate this probability?

h=4

The Problem

Page 44: Identifying conserved spatial patterns  in genomes

The Inputs

n: number of genes in each genome

m: number of matching genes pairs

g: the maximum gap allowed in a cluster

h: number of matching genes in the cluster

h=4

m=6 n=22

g=2

Page 45: Identifying conserved spatial patterns  in genomes

The probability when m = n

…the probability of a max-gap cluster is 1

(regardless of the allowed gap size)

If gene content is identical…

Page 46: Identifying conserved spatial patterns  in genomes

Probability of a cluster of size h when m < n

h genes m genesm-h genes

1. Create chains of h genes in both

genomes

2. Place m-h remaining genes so they do not

extend the cluster

3. Normalize to get a probability

*

Basic approachEnumerate all ways to:

Page 47: Identifying conserved spatial patterns  in genomes

Probability of observing a cluster of size h

number of ways to place h genes so they form a chain

in both genomes

number of ways to place m-h remaining genes so they do not

extend the cluster

All configurations of m gene pairs in two genomes of size n

Page 48: Identifying conserved spatial patterns  in genomes

Probability of observing a cluster of size h

number of ways to place h genes so they form a chain

in both genomes

number of ways to place m-h remaining genes so they do not

extend the cluster

All configurations of m gene pairs in two genomes of size n

Page 49: Identifying conserved spatial patterns  in genomes

Number of ways to place h genes in two genomes so they form a cluster

h genes m genesm-h genes

Choose h genes to compose

the cluster

Assign each gene to a

selected spot in each genome

Select h spots in each genome, so they form a

max-gap chain

Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004

Page 50: Identifying conserved spatial patterns  in genomes

Probability of observing a cluster of size h

number of ways to place h genes so they form a chain

in both genomes

number of ways to place m-h remaining genes so they do not

extend the cluster

All configurations of m gene pairs in two genomes of size n

Page 51: Identifying conserved spatial patterns  in genomes

Approach: design a rule specifying where the genes can

be placed so that the cluster is not extended count the positions that satisfy the rule

Counting the number of ways to place m-h genes outside the cluster

g = 1

h = 3

Page 52: Identifying conserved spatial patterns  in genomes

Rule 1: A gene can go anywhere except in the cluster (the white box).

g = 1

gaps ≤ 1

Counting the number of ways to place m-h genes outside the cluster

Too lenient

overcounts

Page 53: Identifying conserved spatial patterns  in genomes

Rule 2: Every gene must be at least g+1 positions from

the cluster (outside the grey box).

g = 1

g + 1

g +1

Counting the number of ways to place m-h genes outside the cluster

Too strict

undercounts

Page 54: Identifying conserved spatial patterns  in genomes

g = 1

h = 3 gap > 1

gap > 1

Counting the number of ways to place m-h genes outside the cluster

Rule 2: Every gene must be at least g+1 positions from

the cluster (outside the grey box).

Too strict

undercounts

Page 55: Identifying conserved spatial patterns  in genomes

g = 1

gap > 1

Counting the number of ways to place m-h genes outside the cluster

Rule 3: At most one member of each gene pair can be in

the grey box.

Too lenient

overcounts

Page 56: Identifying conserved spatial patterns  in genomes

Rule 3: At most one member of each gene pair can be in

the grey box.

g = 1

Counting the number of ways to place m-h genes outside the clustergaps ≤ 1

Too lenient

overcounts

Page 57: Identifying conserved spatial patterns  in genomes

g = 1

Acceptable positions for a gene depend on the positions of the remaining genes

Use strict and lenient rules to calculate upper and lower bounds on G

Counting the number of ways to place m-h genes outside the cluster

Page 58: Identifying conserved spatial patterns  in genomes

Estimating G

Upper bound: Erroneously

enumerates this configuration

Lower bound: Fails to enumerate this configuration

Page 59: Identifying conserved spatial patterns  in genomes

Probability of observing a cluster of size h

number of ways to place h genes so they form a chain

in both genomes

number of ways to place m-h remaining genes so they do not

extend the cluster

Hoberman, Sankoff, Durand Journal of Computational Biology, 2005

Page 60: Identifying conserved spatial patterns  in genomes

What can we learn from this statistical result? Are we less likely to observe a large cluster

(containing more gene pairs) than a small cluster?

How large does a cluster have to be before we are surprised to observe it?

How do we choose the maximum allowed gap value?

Page 61: Identifying conserved spatial patterns  in genomes

Clu

ster

Pro

babi

lity

g=10

Whole-genome comparison cluster statistics

g=20

n=1000, m=250

h (cluster size)

With a significance

threshold of 10-4, any cluster

containing 8 or more genes is

significant.

Page 62: Identifying conserved spatial patterns  in genomes

E. coli vs B. Subtilis

Under null hypothesis, by g=25 all genes should form a single cluster

Completecluster doesn’t form until g=110

Typical operonsizes

clusters above the orange line are significant at the .001 level

Algorithm:Bergeron et al, 02

Statistics:Hoberman et al, 05

By g=9, probabilities no longer decrease monotonically with cluster size

Page 63: Identifying conserved spatial patterns  in genomes

Statistical analysis of max-gap gene clusters

1. Data analysis: Identifies statistically significant max-gap clusters

2. Search: Suggests a more principled approach for choosing a gap size that will yield significant clusters

3. Modeling: Provides insight into criteria for cluster definitions larger clusters should be less likely to occur by

chance

Page 64: Identifying conserved spatial patterns  in genomes

Outline

Introduction and Motivation Evolution of spatial organization Applications: why identify related genomic

regions?

Problem Background Why is this challenging? Introduction to cluster finding

Results Statistics for pairwise clusters Statistics for three-way clusters

Page 65: Identifying conserved spatial patterns  in genomes

Cluster Statistics

r-windows

max-gap

2 regionsDurand & Sankoff,

2003my work

3 regions my work unsolved

Cluster Models

Joint work with Narayanan Raghupathy

Page 66: Identifying conserved spatial patterns  in genomes

Duplicated segments can be particularly difficult to identify

Whole genome duplication

chromosome 2

chromosome 1

Ancestral chromosome

chr 2

chr 1

Page 67: Identifying conserved spatial patterns  in genomes

Duplicated segments can be particularly difficult to identify

Ancestral chromosome

chr 2

chr 1

Whole genome duplication

Reciprocal gene loss

Page 68: Identifying conserved spatial patterns  in genomes

Comparisons of multiple genomes offer significantly more power to detect highly diverged homologous segments

Arabidopsis thaliana region 1

Rice chr 2

Arabidopsis thaliana region 2

Simillion et al, PNAS 2002

Rice

At 2At 1

5 3

0

1

20 22

22

126 26

At 2At 1

Page 69: Identifying conserved spatial patterns  in genomes

Existing Statistical Approaches

x2

x23x123

x3x13x1

x12

1. Only consider x123, genes shared between all regions disregards pairwise overlaps2. Conduct multiple pairwise comparisons does not consider the greater impact of genes

shared among all three regions requires that at least two pairwise clusters are

independently significant

W1W3

W2

Durand and Sankoff, J Comp Biology, 2003Methods reviewed in Simillion et al, Bioessays, 2004

No existing statistical approach considers all these quantities in a single test.

Page 70: Identifying conserved spatial patterns  in genomes

r-windows for pairwise comparison

Two windows of size r that share at least m homologous gene pairs

r = 6

Page 71: Identifying conserved spatial patterns  in genomes

With two regions there is a natural test statistic

An r-window cluster spanning two regions can be characterized by the number of shared

genes, m.

mr-m r-m

Durand and Sankoff, J Comp Biology, 2003

The probability that two random windows share m

genes:

r = 6

Page 72: Identifying conserved spatial patterns  in genomes

With three regions there are many more quantities to consider

Three-way overlap: x123

Pairwise overlaps: x12, x23, x13

x3

x23x123

x2x12x1

x13

We want to consider both types of overlaps

Page 73: Identifying conserved spatial patterns  in genomes

Our r-window statistics for three windows

Given three randomly selected windows of length r,

we derived equations for the probability of observing such a cluster

under a null hypothesis of random gene order.

),,,( 232313131212123123 xXxXxXxXP

Raghupathy, Hoberman, Durand. APBC 2007.

x3

x23x123

x2x12x1

x13

First tests that uses both pairwise and three-way overlaps

Page 74: Identifying conserved spatial patterns  in genomes

How serious are these limitations?

Pairwise tests have a number of limitations:

1. They do not consider the greater impact of genes shared among all three regions

2. They require that at least two pairwise comparisons are independently significant

Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.

Page 75: Identifying conserved spatial patterns  in genomes

What is the impact of ignoring three-way genes?

a) Three genes are shared by all three windows

b) Three distinct genes are shared by each pair of windows

030

0 303

3

Each pair of regions shares exactly

three genes

Page 76: Identifying conserved spatial patterns  in genomes

Which cluster is least likely to occur by chance?

a)

0h0

0

h0h

h

b)

n=5000, r=100

h

h

hxxij 123,00, 123 xhxij

Page 77: Identifying conserved spatial patterns  in genomes

(a) is less likely to occur than (b)

a) Three genes are shared by all three windows

b) Three distinct genes are shared by each pair of windows

Pairwise tests cannot distinguish these two clusters

but pairwise tests score them identically

Page 78: Identifying conserved spatial patterns  in genomes

How much of a limitation is this?

Pairwise tests have a number of limitations:

1. They do not consider the greater impact of genes shared among all three regions

2. They require that at least two pairwise comparisons are independently significant

Our derived expressions can be used to compare pairwise and three-way cluster probabilities for typical genome parameters and window sizes.

Page 79: Identifying conserved spatial patterns  in genomes

When no genes appear in all three regions, are pairwise statistical tests sufficient?

n=5000, r =100, x123=0

h

α=0.001

Two Pairwise Tests

h hh

h ?h

Three-window

0

0

Page 80: Identifying conserved spatial patterns  in genomes

When no genes appear in all three regions, are pairwise statistical tests sufficient?

The three-window test is strictly more general than two pairwise tests

The three-window test will always reject the null hypothesis for any cluster in which the pairwise tests reject the null hypothesis

Two Pairwise Tests

5 55

8 ?8

Three-window

0

0

Page 81: Identifying conserved spatial patterns  in genomes

These results suggest that multi-region tests can identify more distantly related homologous regions.

There is an ongoing debate about whether there was 1 or 2 whole genome duplications in early vertebrates

More powerful statistical tests might help resolve this issue

Page 82: Identifying conserved spatial patterns  in genomes

Summary1. First statistical tests for max-gap clusters,

the most commonly used definition determines minimum significant cluster size allows more principled selection of gap

parameter reveals problems with max-gap definition

2. More sensitive statistical tests for comparisons of three genomic regions first test to consider both pairwise and three-

way overlaps can identify more distantly related regions may be able to resolve outstanding questions in

molecular evolution

Page 83: Identifying conserved spatial patterns  in genomes

Acknowledgements

Collaborators Dannie Durand, Biological Sciences and

Computer Science, CMU David Sankoff, Math and Statistics University of

Ottawa Narayanan Raghupathy, Biological Sciences,

CMU

Funders Barbara Lazarus Women@IT Fellowship The Sloan Foundation

Page 84: Identifying conserved spatial patterns  in genomes

Thanks

Page 85: Identifying conserved spatial patterns  in genomes

Questions?

Page 86: Identifying conserved spatial patterns  in genomes
Page 87: Identifying conserved spatial patterns  in genomes
Page 88: Identifying conserved spatial patterns  in genomes
Page 89: Identifying conserved spatial patterns  in genomes
Page 90: Identifying conserved spatial patterns  in genomes
Page 91: Identifying conserved spatial patterns  in genomes
Page 92: Identifying conserved spatial patterns  in genomes
Page 93: Identifying conserved spatial patterns  in genomes
Page 94: Identifying conserved spatial patterns  in genomes

Advantages of an analytical approach

Analyzing incomplete datasets Principled parameter selection Understanding statistical trends Insight into tradeoffs between

definitions Efficiency Accuracy

Page 95: Identifying conserved spatial patterns  in genomes

1 2 3 4 5 …. n-L+1 …. n

Ways to place the leftmost gene in the chain, so there are at least L-1 places left

The maximum length of the chain is: L = (h-1)g + h

The number of ways to create a chain of h genes

Page 96: Identifying conserved spatial patterns  in genomes

Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

There are h-1 gaps in a chain of h genes

Choices for the size of each gap

(from 0 to g)

The number of ways to create a chain of h genes

Page 97: Identifying conserved spatial patterns  in genomes

Chains near the end of

the genome

Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

There are h-1 gaps in a chain of h genes

1 2 3 4 5 …. n-L+1 …. n

Choices for the size of each gap

(from 0 to g)

The number of ways to create a chain of h genes

Page 98: Identifying conserved spatial patterns  in genomes

l = w-1

Gaps are constrained:

And sum of gaps is constrained:

Counting clusters at the end of the genome

l = m

Page 99: Identifying conserved spatial patterns  in genomes

g1 g2 g3 gm-1

l < w

A known solution:

Page 100: Identifying conserved spatial patterns  in genomes

l = w-1

Gaps are constrained:

And sum of gaps is constrained:

Counting clusters at the end of the genome

l = m

Page 101: Identifying conserved spatial patterns  in genomes

d(m,g,m)

w-2

Cluster Length

d(m,g,m)

+ …

d(m,g,m) + d(m,g,m+1) +

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

ww-1 … m+1m

d(m,g,m+1)+

d(m,g,m) d(m,g,m+1)+

Line of Symmetry

l = w-1

l = m

Page 102: Identifying conserved spatial patterns  in genomes

Exploiting Symmetry

g=

1

1

g g-1

g-1

=

wm

m+1 w-1

1

2 g-2

=m+2 w-2 1

2 g-2

g

g

g-1 g-1

g

g

l=

Page 103: Identifying conserved spatial patterns  in genomes

d(m,g,m)

w-2

Cluster Length

d(m,g,m)

+ …

d(m,g,m) + d(m,g,m+1) +

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2)

d(m,g,m) + d(m,g,m+1) + … + d(m,g,w-2) + d(m,g,w-1)

ww-1 … m+1m

d(m,g,m+1)+

d(m,g,m) d(m,g,m+1)+

(g+1)m-1

(g+1)m-1

(g+1)m-1

l = w-1

l = m

Page 104: Identifying conserved spatial patterns  in genomes

Number of ways to position h genes in a genome of n genes so they form a max-gap chain

Ways to placeremaining h-1

genes

Starting positions near end

Starting positions

Page 105: Identifying conserved spatial patterns  in genomes