Whole Genome Mammalian Clone Sets for High-Resolution BAC Arrays Krzywinski M 1, Bosdet I 1, Smailus...

1
Whole Genome Mammalian Clone Sets for High-Resolution BAC Arrays Krzywinski M 1 , Bosdet I 1 , Smailus D 1 , Chiu R 1 , Mathewson C 1 , Wye N 1 , Asano J 1 , Barber S 1 , Brown-John M 1 , Chan S 1 , Chand S 1 , Chittaranjan S 1 Cloutier A 1 Fjell C 1 , Girn N 1 , Gray C 1 , Kutsche R 1 , Lee D 1 , Lee SS 1 , Masson A 1 , Mayo M 1 , McLeavy C 1 , Olson T 1 , Pandoh P 1 , Anna-Liisa Prabhu 1 , Shin H 1 , Spence L 1 Stott J 1 , Taylor S 1 , Tsai M 1 , Yang G 1 , Albertson D 2 , Lam W 1 , Erik Shoenmakers 3 , Choy C 4 , Osoegawa K 4 , Zhao S 5 , de Jong P 4 , Schein J 1 , Jones S 1 , Marra M 1 The ability to detect and localize chromosomal rearrangements with a high degree of sensitivity and specificity across an entire genome plays a major role in the study and classification of genetic diseases and developmental abnormalities. Genomic alterations have been implicated in the growth and progression of cancer, in mental retardation and in other congenital defects. Effective study of chromosomal anatomy using technologies such as FISH and array CGH requires access to a set of clones representing the genome with sufficient granularity. Identification and construction of a clone set that fulfills these requirements is the first step. To this end, we have undertaken the selection of BAC clone sets representing the human, mouse and rat genomes, with the objective of achieving sub-100kb resolution. The BAC clone sets are selected using existing BAC library, fingerprint map and sequence resources, with the end goal being a clone set providing comprehensive coverage and high-resolution sampling. Each full-genome clone set contains approximately 30,000 BACs. The clones provide an average of 2X redundant coverage. Clones’ identities are verified using fingerprints. The human BAC clone set has been selected and arrayed and is available to the public, providing 76kb sampling resolution and coverage of 99.5% of the sequenced portion of the genome. The mouse set, which has coverage statistics equivalent to the human set, has been arrayed and will soon undergo clone verification assessment. We are currently finalizing the clone selections for the rearray of the rat genome. The density of these clone sets is an order of magnitude greater than that of currently available whole genome CGH arrays, offering the prospect of detecting smaller chromosomal rearrangements. Clone lists and annotations for the sets, as well as this poster, are available at http://mkweb.bcgsc.ca/bacarray . 1. Introduction 2. Methodology of Clone Selection 3. Clone Set Characteristics 4. Coverage and Redundancy Author Affiliations 1 Genome Sciences Centre, British Columbia Cancer Research Centre, 600 W 10th Avenue, Vancouver BC V5Z 4E6, Canada, www.bcgsc.ca ; 2 Cancer Research Institute, Box 0808, University of California at San Francisco, San Francisco CA 94143- 0808, USA, cc.ucsf.edu ; 3 Human Genetics, 417, University Medical Center Nijmegen, P.O. Box 9101, 6500 HB Nijmegen, The Netherlands; 4 BACPAC Resources, Children's Hospital Oakland Research Institute, 747 52nd St., Oakland CA 94609, USA, bacpac.chori.org ; 5 The Institute for Genomic Research, 9712 Medical Center Drive, Rockville MD 20850, USA, www.tigr.org Resources Clone libraries :: RPCI-11/13: www.chori.org/bacpac; CalTechD www.tree.caltech.edu | BAC physical maps :: Washington University Genome Sequencing Centre www.genome.wustl.edu, Genome Sciences Centre www.bcgsc.ca/lab/mapping | Sequence Assemblies :: NCBI www.ncbi.nlm.nih.gov, Baylor Human Genome Sequencing Centre www.hgsc.bcm.tmc.edu, UCSC Genome Project genome.ucsc.edu | BAC end database :: TIGR www.tigr.org Funding NHGRI | Genome Canada | Genome BC Acknowledgements 5. Data and Clone Set Access Figure 3. Coverage (top) and resolution (bottom) of the clone sets, evaluated in 700kb windows. Figure 4. Depth of coverage, resolution and gaps in the clone sets. [A] Redundancy in the clone sets is measured by the fraction of the genome represented by a given number of BACs. The human clone set has a 1X:2X depth ratio of 1:1, with approximately 35% of the genome represented by single BACs and the remaining 65% by two or more BACs. Clone libraries used to construct the mouse and rat sets are comprised of larger clones and, given that the genomes of roughly the same size and that the clone sets contain roughly the same number of clones, both mouse and rat have a larger average depth of coverage than human. [B] The effective clone set resolution is measured by a weighted average of clone covers, where the weights are the cover sizes. All three sets have resolution of approximately 80kb. This means that if one randomly selects a point on the genome, 50% of the time it will be represented by a clone cover of 75kb or smaller. [C,D] Gaps in the clone sets were estimated by locating regions of the sequence assembly that were not represented by BACs in the sets. BAC assembly coordinates were derived from BAC end, Golden Path and in silico mapping methods. Both the Golden Path and in silico mapping methods do not necessarily reflect the full size of the insert of the clone. The uncertainty in the coordinates obtained by in silico mapping is approximately 10kb at each end. The difference between the actual location of the clone and the Golden Path coordinates is a function of the fraction of the clone’s insert that has been sequenced. The Genbank sequence records for sequenced BACs do not always contain the information required to derive the coordinates of the full insert. A B C D human mouse rat rearray size number of clones 32,855 a 28,103 27,312 with paired-end BES 12,598 (38%) 16,972 (60%) 19,175 (62%) sequenced 7,345 (22%) 15,561 (55%) 15,989 (59%) with sequence coordinates 31,686 (96%) 27,668 (98%) 26,547 (97%) clone properties clone libraries RPCI11, RPCI13, Caltech-D RPCI-23, RPCI-24 CHORI- 230, RPCI-31, RPCI-32 avg clone size 147 kb 172 kb 204 kb avg depth of coverage 1.9 x 2.0 x 2.2 x avg clone overlap 73 kb 91 kb 118 kb coverage and resolution coverage of sequence assembly 99.5% 99.7% 98.7% (99.1% b ) coverage of fingerprint map 98% 98.9% 99.3 % average resolution 76 kb 80 kb 88 kb rearray status rearrayed yes yes in progress validated yes in progress available yes Table 1. Summary of the three mamallian BAC arrays. a 32,432 clones in the human set have been validated by fingerprinting and the remaining were selected during a QA/QC replacement round and will be validated in the near future; b excluding chrUn source rearray human mouse rat target genom e human source BACs (% a ) target coverage (% b ) 31,686 c 2.79 Gb 21,498 (78) 1.61 Gb (58%) 18,808 (71) 1.52 Gb (54) mouse source BACs (%) target coverage (%) 23,887 (75) 1.37 Gb (55) 27,668 2.49 Gb 22,068 (83) 2.08 Gb (84) rat source BACs (%) target coverage (%) 27,716 (87) 1.19 Gb (45) 23,891 (86) 2.04 Gb (77) 26,547 2.66 Gb Table 2. Number of BACs and coverage in orthologous relationships. In the example of mouse BACs aligned to the human genome, 21,498 BACs from the mouse rearray (78% of the 27,668 mouse BACs localized to the mouse genome) have alignments to the human genome and the alignments provide 1.61 Gb of coverage (58% of the human genome). a relative to the number of BACs in the cognate source rearray; b relative to the size of the target genome; c diagonal cells contain number of BACs in the source rearray that have localizations to the source genome along with the total detected coverage by these BACs. Figure 1. Rat fingerprint map contig 1012 (top). Clones selected for the rat rearray are highlighted in green. Statistics relating similarity of the fingerprints of adjacent selections are shown to the right of the selected clones. UCSC track for the region is shown below. The aim of creating the BAC rearrays was to generate a laboratory resource designed for high-resolution BAC array CGH studies and other whole-genome investigative approaches to relating chromosomal changes with phenotypes. These applications are particularly important to us as we seek to develop genomic reagents of utility in cancer research. The sets (a) contain on the order of 30,000 clones, a number which can be practically printed onto array slides, (b) faithfully represent both the fingerprint map and genomic assembly and (c) incorporate redundancy in coverage by controlling the amount of overlap between genome- adjacent rearray selections. For each of the sets, the selection was driven by the fingerprint map (Fig 1) and ancillary clone annotations in the form of BAC end sequence (BES) records, BES-based coordinates on the assembly, and Genbank accession status of the clone. Clones were selected that (a) had fingerprint which were typical of the observed population (thus unusually small or large clones were avoided), (b) met overlap criteria between adjacent selections, and (c) were derived from selected clone libraries (Table 1). The libraries from which clones were selected are readily available to investigators and are already found in many labs. In order to precisely position the rearray clone selections on their cognate genome, we localized the clones using a combination of BAC end coordinates, in silico fingerprint mapping and sequence assembly coordinates, where available. During the clone selection process, we prioritized the selection of clones that were sequenced and that had BES-based coordinates. This was done to provide a dense coordinate scaffold which could be used to localize the remaining clones. Over 96% of clones in each set are positioned on the genome (Table 1) and we expect this value to increase with new versions of the genome assemblies, in particular for mouse and rat. In order to ensure that the correct clone was selected and correctly rearrayed, the identity each rearrayed clone was validated by fingerprinting. The validation fingerprints were compared to those stored in the fingerprint map. The validation fingerprints for the human set have been completed and will commence shortly for the mouse set (Table 1). Resolution Depth of assembled sequence coverage by clones in the sets was calculated using BAC sequence coordinates obtained from BES alignments, in silico fingerprint anchoring and Golden Path assembly information. The resolution of the set was determined by using the concept of clone covers. The set of covers is found by intersecting the cover of every clone with those of all its neighbours. Any base pair location will be covered by a group of clones. The cover is the largest contiguous sequence region covered by the same group of clones. A B C D Consider the example above with four BACs (A,B,C,D) overlapping in the manner shown. There are 6 intersections of clones. Thus, the sequence region can be resolved into 6 regions. For example, if BACs B and C show positive hybridization in an experiment, the probe can be localized to the fourth cover. The smaller the average size of the cover, the higher the effective resolution of a clone set. 6 clone covers 1 2 3 4 5 6 Requiremen ts The clone sets were generated with an effort to 1. to fully represent the underlying genome, as determined by representation of the fingerprint map and genomic assembly 2. to contain about 30,000 clones 3. to provide about 2X coverage of the genome 4. to be sampled from readily available libraries 5. to contain clones whose fingerprints fall within 3 of the population distributions of size and number of fragments. 6. to validate the identity of each clone by obtaining high resolution fingerprints and comparing them to those stored in the fingerprint map Libraries Human | RPCI-11 (91%), RPCI-13 (2%), Caltech-D (7%) Mouse | RPCI-23 (69%), RPCI-24 (31%) Rat | CHORI-230 (99%), RPCI-31 (<1%), RPCI-32 (<1%) Annotation Human | selection: fpmap Nov 2001, assembly hg11 July 2002 | current analysis: fpmap Nov 2003, assembly hg16 July 2003 Mouse | selection: fpmap Jun 2003, assembly mm3 Feb 2003 | current analysis: fpmap Jan 2004, assembly mm4 Oct 2003 Rat | selection and analysis: fpmap Jan 2004, assembly rn3 Jun 2003 Orthologous Relationships We have related orthologous members of all three rearray sets using whole-genome alignments (Fig 2, Table 2). In this process, we have identified the orthologous locations, where possible, of each BAC from each set on the other two genomes. In Figure 2, for example, clone A from the human rearray is aligned to three regions of the mouse genome. Conversely, clones a, b, c from the mouse genome are aligned to a region of the human genome. Alignments were grouped if the distance difference between adjacent alignment positions on the source and target genomes was less than 10kb or if both distances were less than 20kb. Table 2 shows the number of BACs from each array that align to the other genomes and the total coverage provided by the alignments. Figure 2. Orthologous relationships are constructed when a clone set from one genome is projected onto another genome using whole-genome alignments. We have attempted to use as much information as possible to determine the precise location of every clone in the set on the genome assembly. Some clones, about 3-4% in each set, remain unlocalized. We expect that this is due to sequence assembly gaps and regions in the fingerprint map that are not represented by the assembly. The coverage statistics of the clone sets are shown in Table 1 are calculated using our sequence position annotations and fingerprint data. Because the positions calculated using in silico anchoring and assembly coordinates tend to underestimate the full extent of the clone on the assembly, we expect that the value of coverage in Table 1 represents a lower limit. Moreover, due to the fact that fingerprints measure overlap with lower sensitivity than obtained with sequence information, the fingerprint map coverage is also a lower limit. The coverage and resolution of the three sets across the genomes are summarized in Figure 3. Clone set data and annotations based on the latest releases of the assemblies and fingerprint maps are publically available for download. Visualization of clone layout is provided using tracks in the UCSC Genome Browser (Fig 1). The human rearray has been available for about one year, both in the whole- genome and chromosome-specific sets, from BACPAC Resources. We anticipate that the mouse set will be available shortly. Figure 5. Rearray data portal. Clone coverage and redundancy are shown in Figure 4 below. The difference in resolution and depth of coverage between the sets is due to differences in sizes of clones in the libraries from which the clones were sampled. The human BACs are on average 25% smaller than those from rat. The number of gaps in the human set is larger than the other two sets, although smaller on average, primarily because portions of the human assembly are uniquely represented by clones from a variety of exotic libraries. http://mkweb.bcgsc.ca/ bacarray

Transcript of Whole Genome Mammalian Clone Sets for High-Resolution BAC Arrays Krzywinski M 1, Bosdet I 1, Smailus...

Page 1: Whole Genome Mammalian Clone Sets for High-Resolution BAC Arrays Krzywinski M 1, Bosdet I 1, Smailus D 1, Chiu R 1, Mathewson C 1, Wye N 1, Asano J 1,

Whole Genome Mammalian Clone Sets for High-Resolution BAC Arrays

Krzywinski M1, Bosdet I1, Smailus D1, Chiu R1, Mathewson C1, Wye N1, Asano J1, Barber S1, Brown-John M1, Chan S1, Chand S1, Chittaranjan S1 Cloutier A1

Fjell C1, Girn N1, Gray C1, Kutsche R1, Lee D1, Lee SS1, Masson A1, Mayo M1, McLeavy C1, Olson T1, Pandoh P1, Anna-Liisa Prabhu1, Shin H1, Spence L1

Stott J1, Taylor S1, Tsai M1, Yang G1, Albertson D2, Lam W1, Erik Shoenmakers3, Choy C4, Osoegawa K4, Zhao S5, de Jong P4, Schein J1, Jones S1, Marra M1

The ability to detect and localize chromosomal rearrangements with a high degree of sensitivity and specificity across an entire genome plays a major role in the study and classification of genetic diseases and developmental abnormalities. Genomic alterations have been implicated in the growth and progression of cancer, in mental retardation and in other congenital defects. Effective study of chromosomal anatomy using technologies such as FISH and array CGH requires access to a set of clones representing the genome with sufficient granularity. Identification and construction of a clone set that fulfills these requirements is the first step.

To this end, we have undertaken the selection of BAC clone sets representing the human, mouse and rat genomes, with the objective of achieving sub-100kb resolution. The BAC clone sets are selected using existing BAC library, fingerprint map and sequence resources, with the end goal being a clone set providing comprehensive coverage and high-resolution sampling. Each full-genome clone set contains approximately 30,000 BACs. The clones provide an average of 2X redundant coverage. Clones’ identities are verified using fingerprints.

The human BAC clone set has been selected and arrayed and is available to the public, providing 76kb sampling resolution and coverage of 99.5% of the sequenced portion of the genome. The mouse set, which has coverage statistics equivalent to the human set, has been arrayed and will soon undergo clone verification assessment. We are currently finalizing the clone selections for the rearray of the rat genome.

The density of these clone sets is an order of magnitude greater than that of currently available whole genome CGH arrays, offering the prospect of detecting smaller chromosomal rearrangements. Clone lists and annotations for the sets, as well as this poster, are available at http://mkweb.bcgsc.ca/bacarray.

1. Introduction

2. Methodology of Clone Selection

3. Clone Set Characteristics 4. Coverage and Redundancy

Author Affiliations1Genome Sciences Centre, British Columbia Cancer Research Centre, 600 W 10th Avenue, Vancouver BC V5Z 4E6, Canada, www.bcgsc.ca; 2Cancer Research Institute, Box 0808, University of California at San Francisco, San Francisco CA 94143-0808, USA, cc.ucsf.edu; 3Human Genetics, 417, University Medical Center Nijmegen, P.O. Box 9101, 6500 HB Nijmegen, The Netherlands; 4BACPAC Resources, Children's Hospital Oakland Research Institute, 747 52nd St., Oakland CA 94609, USA, bacpac.chori.org; 5The Institute for Genomic Research, 9712 Medical Center Drive, Rockville MD 20850, USA, www.tigr.org

Resources

Clone libraries :: RPCI-11/13: www.chori.org/bacpac; CalTechD www.tree.caltech.edu | BAC physical maps :: Washington University Genome Sequencing Centre www.genome.wustl.edu, Genome Sciences Centre www.bcgsc.ca/lab/mapping | Sequence Assemblies :: NCBI www.ncbi.nlm.nih.gov, Baylor Human Genome Sequencing Centre www.hgsc.bcm.tmc.edu, UCSC Genome Project genome.ucsc.edu | BAC end database :: TIGR www.tigr.org

Funding

NHGRI | Genome Canada | Genome BC

Acknowledgements

5. Data and Clone Set Access

Figure 3. Coverage (top) and resolution (bottom) of the clone sets, evaluated in 700kb windows.

Figure 4. Depth of coverage, resolution and gaps in the clone sets. [A] Redundancy in the clone sets is measured by the fraction of the genome represented by a given number of BACs. The human clone set has a 1X:2X depth ratio of 1:1, with approximately 35% of the genome represented by single BACs and the remaining 65% by two or more BACs. Clone libraries used to construct the mouse and rat sets are comprised of larger clones and, given that the genomes of roughly the same size and that the clone sets contain roughly the same number of clones, both mouse and rat have a larger average depth of coverage than human. [B] The effective clone set resolution is measured by a weighted average of clone covers, where the weights are the cover sizes. All three sets have resolution of approximately 80kb. This means that if one randomly selects a point on the genome, 50% of the time it will be represented by a clone cover of 75kb or smaller. [C,D] Gaps in the clone sets were estimated by locating regions of the sequence assembly that were not represented by BACs in the sets. BAC assembly coordinates were derived from BAC end, Golden Path and in silico mapping methods. Both the Golden Path and in silico mapping methods do not necessarily reflect the full size of the insert of the clone. The uncertainty in the coordinates obtained by in silico mapping is approximately 10kb at each end. The difference between the actual location of the clone and the Golden Path coordinates is a function of the fraction of the clone’s insert that has been sequenced. The Genbank sequence records for sequenced BACs do not always contain the information required to derive the coordinates of the full insert.

A B

C D

human mouse rat

rearray size

number of clones 32,855a 28,103 27,312

with paired-end BES 12,598 (38%) 16,972 (60%) 19,175 (62%)

sequenced 7,345 (22%) 15,561 (55%) 15,989 (59%)

with sequence coordinates 31,686 (96%) 27,668 (98%) 26,547 (97%)

clone properties

clone librariesRPCI11, RPCI13,

Caltech-DRPCI-23,RPCI-24

CHORI-230, RPCI-31, RPCI-32

avg clone size 147 kb 172 kb 204 kb

avg depth of coverage 1.9 x 2.0 x 2.2 x

avg clone overlap 73 kb 91 kb 118 kb

coverage and resolution

coverage of sequence assembly 99.5% 99.7%

98.7%(99.1%b)

coverage of fingerprint map 98% 98.9% 99.3 %

average resolution 76 kb 80 kb 88 kb

rearray status

rearrayed yes yes in progress

validated yes in progress

available yes

Table 1. Summary of the three mamallian BAC arrays. a32,432 clones in the human set have been validated by fingerprinting and the remaining were selected during a QA/QC replacement round and will be validated in the near future; bexcluding chrUn

source rearray

human mouse rat

target

genome

humansource BACs (%a)

target coverage (%b)

31,686c

2.79 Gb21,498 (78)

1.61 Gb (58%)18,808 (71)1.52 Gb (54)

mousesource BACs (%)

target coverage (%)

23,887 (75)1.37 Gb (55)

27,6682.49 Gb

22,068 (83)2.08 Gb (84)

ratsource BACs (%)

target coverage (%)

27,716 (87)1.19 Gb (45)

23,891 (86)2.04 Gb (77)

26,5472.66 Gb

Table 2. Number of BACs and coverage in orthologous relationships. In the example of mouse BACs aligned to the human genome, 21,498 BACs from the mouse rearray (78% of the 27,668 mouse BACs localized to the mouse genome) have alignments to the human genome and the alignments provide 1.61 Gb of coverage (58% of the human genome). arelative to the number of BACs in the cognate source rearray; brelative to the size of the target genome; cdiagonal cells contain number of BACs in the source rearray that have localizations to the source genome along with the total detected coverage by these BACs.

Figure 1. Rat fingerprint map contig 1012 (top). Clones selected for the rat rearray are highlighted in green. Statistics relating similarity of the fingerprints of adjacent selections are shown to the right of the selected clones. UCSC track for the region is shown below.

The aim of creating the BAC rearrays was to generate a laboratory resource designed for high-resolution BAC array CGH studies and other whole-genome investigative approaches to relating chromosomal changes with phenotypes. These applications are particularly important to us as we seek to develop genomic reagents of utility in cancer research.

The sets (a) contain on the order of 30,000 clones, a number which can be practically printed onto array slides, (b) faithfully represent both the fingerprint map and genomic assembly and (c) incorporate redundancy in coverage by controlling the amount of overlap between genome-adjacent rearray selections.For each of the sets, the selection was driven by the fingerprint map (Fig 1) and ancillary clone annotations in the form of BAC end sequence (BES) records, BES-based coordinates on the assembly, and Genbank accession status of the clone. Clones were selected that (a) had fingerprint which were typical of the observed population (thus unusually small or large clones were avoided), (b) met overlap criteria between adjacent selections, and (c) were derived from selected clone libraries (Table 1). The libraries from which clones were selected are readily available to investigators and are already found in many labs.

In order to precisely position the rearray clone selections on their cognate genome, we localized the clones using a combination of BAC end coordinates, in silico fingerprint mapping and sequence assembly coordinates, where available. During the clone selection process, we prioritized the selection of clones that were sequenced and that had BES-based coordinates. This was done to provide a dense coordinate scaffold which could be used to localize the remaining clones. Over 96% of clones in each set are positioned on the genome (Table 1) and we expect this value to increase with new versions of the genome assemblies, in particular for mouse and rat.

In order to ensure that the correct clone was selected and correctly rearrayed, the identity each rearrayed clone was validated by fingerprinting. The validation fingerprints were compared to those stored in the fingerprint map. The validation fingerprints for the human set have been completed and will commence shortly for the mouse set (Table 1).

ResolutionDepth of assembled sequence coverage by clones in the sets was calculated using BAC sequence coordinates obtained from BES alignments, in silico fingerprint anchoring and Golden Path assembly information.

The resolution of the set was determined by using the concept of clone covers. The set of covers is found by intersecting the cover of every clone with those of all its neighbours. Any base pair location will be covered by a group of clones. The cover is the largest contiguous sequence region covered by the same group of clones.

A

B

C

D

Consider the example above with four BACs (A,B,C,D) overlapping in the manner shown. There are 6 intersections of clones. Thus, the sequence region can be resolved into 6 regions. For example, if BACs B and C show positive hybridization in an experiment, the probe can be localized to the fourth cover. The smaller the average size of the cover, the higher the effective resolution of a clone set.

6 clone covers

1 2 3 4 5 6

RequirementsThe clone sets were generated with an effort to

1. to fully represent the underlying genome, as determined by representation of the fingerprint map and genomic assembly

2. to contain about 30,000 clones

3. to provide about 2X coverage of the genome

4. to be sampled from readily available libraries

5. to contain clones whose fingerprints fall within 3 of the population distributions of size and number of fragments.

6. to validate the identity of each clone by obtaining high resolution fingerprints and comparing them to those stored in the fingerprint map

LibrariesHuman | RPCI-11 (91%), RPCI-13 (2%), Caltech-D (7%)

Mouse | RPCI-23 (69%), RPCI-24 (31%)

Rat | CHORI-230 (99%), RPCI-31 (<1%), RPCI-32 (<1%)

AnnotationHuman | selection: fpmap Nov 2001, assembly hg11 July 2002 | current analysis: fpmap Nov 2003, assembly hg16 July 2003

Mouse | selection: fpmap Jun 2003, assembly mm3 Feb 2003 | current analysis: fpmap Jan 2004, assembly mm4 Oct 2003

Rat | selection and analysis: fpmap Jan 2004, assembly rn3 Jun 2003

Orthologous Relationships

We have related orthologous members of all three rearray sets using whole-genome alignments (Fig 2, Table 2). In this process, we have identified the orthologous locations, where possible, of each BAC from each set on the other two genomes.

In Figure 2, for example, clone A from the human rearray is aligned to three regions of the mouse genome. Conversely, clones a, b, c from the mouse genome are aligned to a region of the human genome.

Alignments were grouped if the distance difference between adjacent alignment positions on the source and target genomes was less than 10kb or if both distances were less than 20kb.

Table 2 shows the number of BACs from each array that align to the other genomes and the total coverage provided by the alignments.Figure 2. Orthologous relationships are constructed when a

clone set from one genome is projected onto another genome using whole-genome alignments.

We have attempted to use as much information as possible to determine the precise location of every clone in the set on the genome assembly. Some clones, about 3-4% in each set, remain unlocalized. We expect that this is due to sequence assembly gaps and regions in the fingerprint map that are not represented by the assembly.

The coverage statistics of the clone sets are shown in Table 1 are calculated using our sequence position annotations and fingerprint data. Because the positions calculated using in silico anchoring and assembly coordinates tend to underestimate the full extent of the clone on the assembly, we expect that the value of coverage in

Table 1 represents a lower limit. Moreover, due to the fact that fingerprints measure overlap with lower sensitivity than obtained with sequence information, the fingerprint map coverage is also a lower limit. The coverage and resolution of the three sets across the genomes are summarized in Figure 3.

Clone set data and annotations based on the latest releases of the assemblies and fingerprint maps are publically available for download. Visualization of clone layout is provided using tracks in the UCSC Genome Browser (Fig 1). The human rearray has been available for about one year, both in the whole-genome and chromosome-specific sets, from BACPAC Resources. We anticipate that the mouse set will be available shortly.

Figure 5. Rearray data portal.

Clone coverage and redundancy are shown in Figure 4 below. The difference in resolution and depth of coverage between the sets is due to differences in sizes of clones in the libraries from which the clones were sampled. The human BACs are on average 25% smaller than those from rat. The number of gaps in the human set is larger than the other two sets, although smaller on average, primarily because portions of the human assembly are uniquely represented by clones from a variety of exotic libraries.

http://mkweb.bcgsc.ca/bacarray