A 4-Mb High-Density Single Nucleotide Polymorphism-Based Map around Human APOE

8
A 4-Mb High-Density Single Nucleotide Polymorphism-Based Map around Human APOE Eric Lai,* ,1 John Riley,² Ian Purvis,² and Allen Roses* , ² *US Discovery Genetics, GlaxoWellcome, Inc., 5 Moore Drive, Research Triangle Park, North Carolina 27709; and ²UK Molecular Genetics, GlaxoWellcome Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, United Kingdom Received May 18, 1998; accepted September 17, 1998 Whole-genome association studies using single-nu- cleotide polymorphisms (SNPs) are the proposed method of choice for the identification of loci associ- ated with complex diseases. In this report, we address the feasibility of generating high-density SNP maps (with <100-kb spacing). As a pilot study, we concen- trated on a 4-Mb region around the human APOE locus on chromosome 19. We compared the efficiency of SNP detection using YAC-based versus BAC/PAC-based maps, sequencing individual DNAs versus a pooled DNA sample, and we evaluated three different soft- ware applications for polymorphism detection. A total of 121 SNPs (25 in coding regions) were identified. The frequency of SNP detection was 1 SNP/1.1 kb of genomic sequence. From APOE to CALM3 (;2 Mb), the average marker spacing was approximately 30 kb. Fifty-one SNPs were genotyped in five populations, and 10 SNPs showed an allele frequency differential greater than 0.5 between populations. Our results demonstrated that high-density SNP maps can be effi- ciently generated using existing technologies and that a genome-wide map with 60,000 –100,000 SNPs is achievable in a reasonable time frame. © 1998 Academic Press INTRODUCTION Recently there has been considerable interest in us- ing single-nucleotide polymorphisms (SNPs) for ge- nome-wide association studies in the identification of complex disease genes and for pharmacogenetic appli- cations (Schafer and Hawkins, 1998). Quantitative analysis and mathematical modeling have suggested that whole-genome association is more effective than linkage analysis for the identification of complex dis- ease genes (Lander and Schork, 1994; Risch and Meri- kangas, 1996; Lander, 1996; Kruglyak, 1997). Several groups in the private sector and at least one publicly funded initiative are targeting 60,000 or more SNPs to develop a whole-genome SNP-based map (http://www. abbott.com/news/1997news/pr072897.htm.; NIH Re- quest for Application (RFA) HG-98-001, 1998; http:// www.nhgri.nih.gov/Grant_info/Funding/rfa-hg-98- 001.html.) An abundance of recent articles and initia- tives have generated public debate on scientific policies and the patent issues regarding SNPs (Marshall, 1997a,b; Collins et al., 1997). However, other than a RFA from the National Human Genome Research In- stitute, there has been little public discussion on the feasibility of the project. The specific aim of this study was to investigate the potential of generating a high- density SNP map (,100-kb spacing). As a model sys- tem, we concentrated on a 4-Mb region (from XRCC1 to CALM3) around the apolipoprotein E (APOE) gene on the q arm of human chromosome 19. The project was made possible since extensive Alzheimer disease (AD) and myotonic dystrophy (DM) patient and family DNA resources are available to us for genetic mapping stud- ies. Since both the AD and the DM loci are located in this 4-Mb region, the patient resources, together with a high-density SNP map, would be a powerful tool for future studies. In this communication, we report the efficient generation of a high-density SNP-based map using existing technologies. MATERIALS AND METHODS In silico SNP detection. For each unique entry, all available sequences for the same gene were retrieved from the UniGene data- base (NCBI; Schuler et al., 1996). The sequences were assembled using either Sequencher or ESTBlast (Gill et al., 1997). The assem- bled sequences were compared, and single-nucleotide polymor- phisms found in 20% of the sequences were flagged as putative polymorphisms. PCR primers spanning the SNPs and amplifying products of 150 to 300 bp were designed using MacVector (Oxford Molecular Group) to confirm DNA variations. Yeast artificial chromosome (YAC) procedures. YACs were ob- tained from Research Genetics from the coordinates in the LLNL database (http://www-bio.llnl.gov/bbrp/genome/genome.html). Five YACs were initially characterized, by pulsed-field gel electrophore- sis, FISH analysis, and STS content mapping, to ensure they were stable, were nonchimeric, and contained the region of interest. All 1 To whom correspondence should be addressed at the Department of Molecular Genetics, US Discovery Genetics, GlaxoWellcome, Inc., 5 Moore Drive, Research Triangle Park, NC 27709. Telephone: (919) 483-3272. Fax: (919) 483-0315. E-mail: ehl21107@ glaxowellcome.com. GENOMICS 54, 31–38 (1998) ARTICLE NO. GE985581 31 0888-7543/98 $25.00 Copyright © 1998 by Academic Press All rights of reproduction in any form reserved.

Transcript of A 4-Mb High-Density Single Nucleotide Polymorphism-Based Map around Human APOE

A 4-Mb High-Density Single Nucleotide Polymorphism-Based Maparound Human APOE

Eric Lai,*,1 John Riley,† Ian Purvis,† and Allen Roses*,†

*US Discovery Genetics, GlaxoWellcome, Inc., 5 Moore Drive, Research Triangle Park, North Carolina 27709; and †UK MolecularGenetics, GlaxoWellcome Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, United Kingdom

Received May 18, 1998; accepted September 17, 1998

Whole-genome association studies using single-nu-cleotide polymorphisms (SNPs) are the proposedmethod of choice for the identification of loci associ-ated with complex diseases. In this report, we addressthe feasibility of generating high-density SNP maps(with <100-kb spacing). As a pilot study, we concen-trated on a 4-Mb region around the human APOE locuson chromosome 19. We compared the efficiency of SNPdetection using YAC-based versus BAC/PAC-basedmaps, sequencing individual DNAs versus a pooledDNA sample, and we evaluated three different soft-ware applications for polymorphism detection. A totalof 121 SNPs (25 in coding regions) were identified. Thefrequency of SNP detection was 1 SNP/1.1 kb ofgenomic sequence. From APOE to CALM3 (;2 Mb), theaverage marker spacing was approximately 30 kb.Fifty-one SNPs were genotyped in five populations,and 10 SNPs showed an allele frequency differentialgreater than 0.5 between populations. Our resultsdemonstrated that high-density SNP maps can be effi-ciently generated using existing technologies and thata genome-wide map with 60,000–100,000 SNPs isachievable in a reasonable time frame. © 1998 Academic

Press

INTRODUCTION

Recently there has been considerable interest in us-ing single-nucleotide polymorphisms (SNPs) for ge-nome-wide association studies in the identification ofcomplex disease genes and for pharmacogenetic appli-cations (Schafer and Hawkins, 1998). Quantitativeanalysis and mathematical modeling have suggestedthat whole-genome association is more effective thanlinkage analysis for the identification of complex dis-ease genes (Lander and Schork, 1994; Risch and Meri-kangas, 1996; Lander, 1996; Kruglyak, 1997). Several

groups in the private sector and at least one publiclyfunded initiative are targeting 60,000 or more SNPs todevelop a whole-genome SNP-based map (http://www.abbott.com/news/1997news/pr072897.htm.; NIH Re-quest for Application (RFA) HG-98-001, 1998; http://www.nhgri.nih.gov/Grant_info/Funding/rfa-hg-98-001.html.) An abundance of recent articles and initia-tives have generated public debate on scientific policiesand the patent issues regarding SNPs (Marshall,1997a,b; Collins et al., 1997). However, other than aRFA from the National Human Genome Research In-stitute, there has been little public discussion on thefeasibility of the project. The specific aim of this studywas to investigate the potential of generating a high-density SNP map (,100-kb spacing). As a model sys-tem, we concentrated on a 4-Mb region (from XRCC1 toCALM3) around the apolipoprotein E (APOE) gene onthe q arm of human chromosome 19. The project wasmade possible since extensive Alzheimer disease (AD)and myotonic dystrophy (DM) patient and family DNAresources are available to us for genetic mapping stud-ies. Since both the AD and the DM loci are located inthis 4-Mb region, the patient resources, together with ahigh-density SNP map, would be a powerful tool forfuture studies. In this communication, we report theefficient generation of a high-density SNP-based mapusing existing technologies.

MATERIALS AND METHODS

In silico SNP detection. For each unique entry, all availablesequences for the same gene were retrieved from the UniGene data-base (NCBI; Schuler et al., 1996). The sequences were assembledusing either Sequencher or ESTBlast (Gill et al., 1997). The assem-bled sequences were compared, and single-nucleotide polymor-phisms found in 20% of the sequences were flagged as putativepolymorphisms. PCR primers spanning the SNPs and amplifyingproducts of 150 to 300 bp were designed using MacVector (OxfordMolecular Group) to confirm DNA variations.

Yeast artificial chromosome (YAC) procedures. YACs were ob-tained from Research Genetics from the coordinates in the LLNLdatabase (http://www-bio.llnl.gov/bbrp/genome/genome.html). FiveYACs were initially characterized, by pulsed-field gel electrophore-sis, FISH analysis, and STS content mapping, to ensure they werestable, were nonchimeric, and contained the region of interest. All

1 To whom correspondence should be addressed at the Departmentof Molecular Genetics, US Discovery Genetics, GlaxoWellcome, Inc.,5 Moore Drive, Research Triangle Park, NC 27709. Telephone:(919) 483-3272. Fax: (919) 483-0315. E-mail: [email protected].

GENOMICS 54, 31–38 (1998)ARTICLE NO. GE985581

310888-7543/98 $25.00

Copyright © 1998 by Academic PressAll rights of reproduction in any form reserved.

media were as in Dracopoli et al. (1995). Only one of the YACs, CEPH821g2, passed these quality control criteria and was utilized furtherfor deletion mapping. A further YAC, CEPH 830h9, when obtaineddirectly from CEPH passed all the QC criteria and was used to mappolymorphisms.

The YACs were deleted using the fragmentation vector pBCL8.The vector was digested with SalI according to the enzyme manu-facturer’s instructions, ensuring the creation of recombinogenic ends(Lewis et al., 1992). The yeast was transformed using 1 mg of linear-ized pBCL8 and a lithium acetate transformation protocol (Dracopoliet al., 1995), rather than a yeast spheroplast protocol. The deletantswere selected for on SD media lacking lysine. After 5 days colonieswere picked into a 96-well plate containing 150 ml of sterile waterand analyzed by replicating on duplicate selective plates Lys2,Trp2

and Lys2,Ura2. DNA from 60 colonies that grew only on theLys2,Trp2 was prepared as described in Anand et al. (1989). Follow-ing analysis by pulsed-field gel electrophoresis and transfer of theDNA onto a membrane, deletants were visualized by autoradiogra-phy after hybridization to 32P-labeled genomic DNA. YACs contain-ing different-sized inserts were characterized further by STS analy-sis. The deleted ends of the YACs were rescued by circularization andtransformation into Escherichia coli (Hermanson et al., 1991).

Inter-Alu PCR on YACs. YAC plugs were prepared using a lith-ium dodecyl sulfate procedure as described in Anand et al. (1989).Following two washes at room temperature for 30 min each in 1 mlof Tris (10 mM) EDTA (1 mM), pH 8.0, the YAC plug was washed in1 ml of sterile water for 30 min and finally melted in 1 ml of water.One microliter of this solution was utilized in PCR. Each primer(HG306 or HG1) was used individually at 50% of the normal concen-tration, i.e., 12.5 pmol/25-ml reaction volume using 13 Promegabuffer under the following cycling conditions: 95°C for 2 min, (92°Cfor 45 s, 55°C for 45 s, 72°C for 2 min, 35 cycles), and a final extensiontime of 10 min at 72°C. These were cloned into the TA cloning systemof Invitrogen, and following PCR from the colonies using M13F andPCRII Ext, they were sequenced using dye terminator technologywith M13R and T7. Primers were HG1, -CCACTGCACTCCAGCCT-GGG; HG306, -GCCTCCCAAAGTGCTGGGATTACAG; and PCRIIExt, GAGCGGATAACAATTTCAC.

Genomic library screening, vectorette, and direct end sequencing ofclones. PCR primer pairs were used to screen two clone libraries, aP1-based artificial chromosome (PAC) library (“Down to The Well”Human PAC DNA Pools, Genome Systems, St. Louis, MO) corre-sponding to three human genome equivalents and a human bacterialartificial chromosome (BAC) library (Release II; Research Genetics,Huntsville, AL). We then generated chromosomal walking primersfrom the BAC/PAC clones using direct sequencing and vectoretteprocedures (Dracopoli et al., 1995). The ABI Big Dye terminatorchemistry and BAC protocol were used for end sequencing by using;1 mg BAC/PAC DNA and 15 pmol primer. Primers were obtainedfrom Promega for Sp6 (59-TATTTAGGTGACACTATAG-93), T7 (59-TAATACGACTCACTATAGGG-93), M13/221 (59-GTTTTCCCAGT-CACGAC-93), and M13/rev (59CAGGAAACAGCTATGAC-93). We se-quenced the ends of 74 PACs and 14 BACs. The average read lengthwas 315 bp with a success rate of 70%.

Construction of small-insert libraries. Small-insert 3-kb Plasmidlibraries were generated in the minimal sequencing vector pOT2a(O. Hubbard, C. Martin, and M. Palazzolo, unpublished). pOT2avector was prepared by BstXI digestion of the parent plasmid pOT2a-sacB followed by preparative agarose gel electrophoresis to separatethe vector fragment from a Bacillus subtilis sacB gene fragment. The1.6-kb pOT2a was purified by Glassmax DNA Isolation Matrix sys-tem (Gibco BRL, Gaithersburg, MD). To prepare inserts for libraryconstruction, BAC and PAC DNA was sonicated for 15 s using anultrasonic processor (Heat Systems, Farmingdale, NY) set at 10%output power. The resulting random fragments were end-repairedusing T4 polymerase, Klenow, and T4 kinase. BstXI linkers (59-CTCTAAAG-39, 59-CTTTAGAGCACA-39) were ligated to the blunt-ended fragments. The configuration of the BstXI sites in pOT2a andthe sequence of the linkers allowed a ligation strategy that mini-

mized the recovery of clones without inserts (Seed, 1987). DNAsamples were electrophoresed on a low-melting-temperature agarosegel, and fragments of 3000–4000 bp were isolated and purified usingGlassmax DNA Isolation Matrix System. Twenty nanograms of lin-earized vector and 200 ng of random DNA fragments were ligatedovernight using T4 DNA ligase at 16°C and transformed into DH10Bcompetent cells (Life Technologies, Inc., Gaithersburg, MD) by elec-troporation using Gene Pulser II (Bio-Rad, Hercules, CA) with set-tings of 1.8 kV, 200 ohms, and 25 mF. Transformed bacteria wereselected on LB agar plates containing 5% sucrose and 12.5 mg/mlchloramphenicol.

DNA sequencing. Sequencing templates were isolated from sin-gle colonies and purified using REAL Prep 96 Plasmid Kit (Qiagen,Chatswoth, CA). Ten-microliter reactions containing 500 ng DNAand 1.6 pmol primer Seq01 (59-CACTATAGAACTCGAGCAGCTG-39) or Seq02 (59-CGACTCACTATAGGGAGACCG-39) were used togenerate sequence using ABI Prism Big Dye terminators (PE Ap-plied Biosystems, Foster City, CA).

Radiation hybrid mapping. The radiation hybrid panels (G3; Coxet al., 1990, GB4; Gyapay et al., 1996, and TNG3; Stewart et al.,1997) were purchased from Research Genetics, and the amplifica-tions were performed according to the manufacturer’s specification.We calibrated the TNG3 panel by using two sets of primers withknown distances (16 and 68 kb apart). The radiation dose to distanceconversion for TNG3 in this region was 2 kb/cR.

Diversity DNA pool. The DNA pool (Coriell Cat. No. NA13406)contains equal quantities of DNA from 13 males and 15 females fromthe following populations (2 each): Mayan Indian, Karitiana Indian,Suri Indian, Quechua Indian, Auca Indian, Jemez Pueblo Indian,Cheyenne Indian, Khmer Cambodian, Southern Chinese, Japanese,Druze Arab, Nasioi Melanesian, Biaka Pygmy, and Mbuti Pygmy.

RESULTS

In silico identification of coding single-nucleotidepolymorphisms. A number of publications havestressed the importance of identifying polymorphismswithin protein-coding regions or near them in the 59and 39 untranslated regions, i.e., cSNPs (Lander andSchork, 1994; Lander, 1996; Collins et al., 1997). Thus,we were interested in determining whether the abun-dance of cDNA sequences in the public databases canbe used to facilitate the identification of cSNPs. The4-Mb region was divided into two subregions: the UKgroup concentrated on the XRCC1–APOE region andthe U.S. group focused on the APOE–CALM3 region.One hundred forty-one ESTs (107 unique entries) wereidentified in the region from XRCC1 to APOE (Table 1).To identify putative cSNPs, sequences from the genes/ESTs were blasted against GenBank and dbEST usingESTBlast (Gill et al., 1997) to identify 40 putativecSNPs in 35 genes/ESTs. A total of 7 cSNPs weremapped to the XRCC1–APOE region based on the YACphysical map and confirmed by DNA sequencing.Forty-five genes and 69 ESTs were assigned to theregion from APOE to CALM3. One hundred twenty-one putative cSNPs were identified by ESTBlast, 38(31%) of which were confirmed by sequencing. Thisdemonstrated that cSNPs can be identified efficientlyusing publicly available data. However, only 25 cSNPswere mapped to the region from XRCC1 to CALM3 asdetermined by physical clone contigs.

32 LAI ET AL.

Comparison of the YAC-based direct approach andthe BAC/PAC-based random approach for SNP detec-tion. Two physical mapping approaches were taken tocompare the efficiency of SNP detection: YAC-basedand BAC/PAC-based physical maps. A 2-Mb YAC-based physical map was created from XRCC1 to APOEwhile a 2-Mb large-insert bacterial clone BAC/PAC-based physical map was derived from APOE toCALM3. For the YAC-based physical map, a singleYAC was initially characterized by pulsed-field gelelectrophoresis, FISH analysis, and STS content map-ping to ensure it was stable, was nonchimeric, andcontained the region of interest. The YAC was deletedwith the vector pBCL8 via Alu repeats (Lewis et al.,1992), which resulted in a series of deleted YACs withan average size interval of 50–100 kb deleted from oneend. The ends of the YAC subclones were rescued togive sequences at defined intervals (Hermanson et al.,1991). Additional sequences were provided from inter-Alu products generated from YACs. Primers were de-signed from the genomic sequences and resequenced in7 different CEPH DNAs to scan for polymorphisms. Insummary, we identified 36 SNPs from XRCC1 to APOEusing the YAC-based approach, giving a total of 43SNPs including the 7 cSNPs. The average spacing ofthe markers on this side of the map was 75 kb, with thelargest gap between SNPs at 200 kb and the smallestgap at 3 bp.

To assess the efficiency of SNP production using arandom approach, 46 BAC/PAC clones covering the2-Mb region were isolated by STS library screening,pooled, and randomly cloned into nine small-insert (3kb) libraries. Approximately 500 small-insert cloneswere picked per 100 kb of input DNA, and both ends ofthe clones were sequenced. The sequences were pro-cessed automatically to remove vector and bacterialcontamination, followed by a homology search againstGenBank and dbEST to identify coding regions. High-quality human sequences greater than 400 bp werethen passed into Primer3.0 (Rozen and Skaletsky,1997), following masking of repetitive elements, to de-

velop PCR primer pairs for SNP scanning (see nextsection). A total of 60 SNPs were isolated using se-quences from the 3-kb clones. The frequency of poly-morphisms detected was 1 SNP/1.1 kb of genomic se-quences (a total of 96 SNPs from XRCC1 to CALM3with 106 kb of sequences). Eighteen cSNPs and all 60SNPs were mapped back to the large-insert bacterialclones in the region and in the TNG3 radiation hybridpanel. The average spacing between SNPs from APOEto CALM3 was 30 kb, with the largest gap equal to 200kb and the smallest gap at 20 bp (Fig. 1). The physicalcontig map of the APOE to CALM3 region was con-structed by STS-content mapping using all availableSTSs and SNPs. A total of 121 SNPs were isolated fromXRCC1 to CALM3 (43 SNPs from YAC-based approachand 78 SNPs from BAC/PAC-based approach).

SNP detection by resequencing individual versuspooled DNA samples. There are many published pro-cedures for polymorphism detection, but the most ro-bust method for identifying unknown sequences con-tinues to be direct DNA sequencing. The main barrierto high-throughput mutation scanning is the largenumber of individuals that must be scanned for poly-

TABLE 1

In Silico Identification of cSNPs from XRCC1 to CALM3

Regions

Number of genes/ESTs mapped to theregion by radiation

hybird mapping

Number of genes/ESTscontaining putative

cSNPs identifiedelectronically

(Gill et al., 1997)

Number of putativecSNPs confirmed by

DNA sequencing

Number of cSNPsthat were mapped

to the genomicclones in the region

Genes ESTsGenes

(cSNPs)ESTs

(cSNPs)Fromgenes

FromESTs

Fromgenes

FromESTs

XRCC1–APOE 6a 101a 5 (6) 30 (34) 3c 4c 3c 4c

APOE–CALM3 45b 69b 37 (88) 19 (33) 27d 11d 15d 3d

a Mapped by using the radiation hybrid G3 panel (Cox et al., 1990) and the Human Transcript map (Shuler et al., 1996).b Mapped by radiation hybrid GB4 panel using a two-point lod score of .15 (Gyapay et al., 1996).c Confirmation by DNA sequencing was performed only with genes and ESTs that were mapped to the YAC clones in the region.d All 121 putative cSNPs were tested by DNA sequencing regardless of their map location.

FIG. 1. SNP marker spacing distribution from APOE to CALM3.

33CONSTRUCTION OF HIGH-DENSITY GENETIC MAPS

morphism detection. Seven separate DNA samples(i.e., 14 chromosomes) are required for detecting SNPswith an allele frequency .20%, and 29 separate DNAsamples (i.e., 58 chromosomes) are required for detec-tion of low-frequency (5%) SNPs with a 95% confidencelevel (N 5 log(1 2 CI)/log(12P), where N is the numberof chromosomes required, CI is the confidence level,and P is allele frequency of the marker). Fortunately,advances in sequencing dye chemistry allow the detec-tion of polymorphisms in pooled DNA samples (Kwoket al., 1994). Using 230 pairs of primers designed fromthe end sequence data from the small-insert (3 kb)libraries, we sequenced 7 different (not pooled) CEPHDNA samples and one DNA diversity pool to scan forpolymorphisms (Fig. 2). Twenty-eight SNPs were de-tected using this set of primers. Four SNPs were de-tected only in the diversity DNA pool, and 14 SNPswere detected only in the CEPH samples. The majority(7 of 10) of the SNPs that were detected in the CEPHbut not in the diversity pool were attributed to lowallele frequencies (,20%) and would be difficult todetect using pooled DNA because of low signal from theminor allele. The other discordance is most likely dueto the differences in the allele frequencies of the SNPsin different populations (see RFLP data below). Thesedata clearly demonstrate that pooled DNA samples canbe used effectively (.50% success rate) to detect allSNPs with an allele frequency greater than 20%. Theuse of pooled DNA samples provides both time andexpense savings essential to SNP detection on thegenomic scale.

Comparison of SNP detection programs. As previ-ously seen in other genome projects, the ability to pro-cess large amounts of data is paramount to the successof the project. We compared three programs (Laser-gene, DNASTAR, Madison, WI; Sequencher, GeneCodes Co.; and PolyPhred (Nickerson et al., 1997)) forprocessing speed and the ability to identify SNPs. Se-quences from 245 primer pairs were processed inde-pendently by these programs. Forty SNPs were iden-tified using Lasergene. Sequencher confirmed all 40SNPs and identified 3 additional SNPs that weremissed by Lasergene. Finally, PolyPhred detected allthe SNPs found by both Lasergene and Sequencherand found 3 additional SNPs. Of the three programstested in this study, PolyPhred proved to be the mosteffective and also the least labor-intensive program forpolymorphism detection. It takes sequences directlyfrom our sequence server, processes the sequences,identifies putative polymorphisms, and presents thedata (with sequence traces) for manual confirmation.PolyPhred, using the default settings, identified over90% of the polymorphisms automatically and correctly.With future improvements, we are confident that thefalse positive rate can be reduced to less than 10%.

Determining the allele frequency of the SNPs in dif-ferent ethnic groups. Sixty SNPs were converted intoPCR-RFLP assays to confirm the accuracy of polymor-

phism detection with DNA sequencing. Fifty-one SNPswere successfully converted into direct RFLP assays(85%) and were genotyped with DNAs from 10 AfricanPygmy, 5 Druze, 3 Japanese, 5 Chinese, and the same7 CEPH samples used in DNA sequencing (Table 2).Even though the sample size is small, it is important tonote that the allele frequencies of many SNPs varydramatically among different ethic backgrounds. TenSNPs had an allele frequency differential greater than0.5 between two populations (e.g., 1032/TaqI in AfricanPygmy (1/0) and CEPH (0.14/0.86)).

DISCUSSION

We have demonstrated that high-density (30-kbspacing) SNP-based maps can be generated efficientlyand rapidly using existing methodologies. It is there-fore realistic to propose the identification of 60,000–100,000 SNPs to construct a high-density genome mapwith an average distance of 30 kb. Genetic epidemio-logical analyses of the SNP map of a 4-Mb regioncontaining the APOE and DM loci will be a practicaltest to determine the distance over which linkage dis-equilibrium could detect the presence of a susceptibil-ity polymorphism for a complex disease [APOE andAD] and an autosomal dominant trait [DM]. This ap-plication of focused SNP technology may be appropri-ate for any linkage region and should have great utilityin localizing/identifying regions for complex diseasegenes with known linkage to a chromosomal region.

One of the major goals of this paper is to assess theefficiency of current SNP discovery approaches andtechnologies. We compared the efficiency of SNP detec-tion using YAC-based versus BAC/PAC-based maps,sequencing individual DNAs versus a pooled DNAsample, and evaluated three different software appli-cations for polymorphism detection. The major advan-tage of the YAC-based approach is that it provides adirected effort to locus-specific SNP discovery. Theother advantage is that the orientation and the dis-tances of the ends of the YAC deletants are known, andthus the spacing and the location of the SNPs can bedetermined. The disadvantage of the YAC approachincludes insert instability and chimeric clones, thusthe requirement and added steps for stringent qualitycontrol. Currently, using existing vectors, it is difficultto increase the SNP density by generate YAC deletantswith small insert size differences. Finally, YAC-basedapproach might not be suitable for some regions be-cause of low recombination frequency. We used a ran-dom BAC/PAC-based approach for SNP discovery inthe region from APOE to CALM3. The BAC/PAC cloneswere randomly grouped into nine small-insert (3 kb)libraries, and sequences were generated from bothends of the 3-kb clones. SNPs identified by this ap-proach were mapped back to the BAC/PAC clones andwith the high-density radiation hybrid panel TNG3.The major advantage of the random approach is theability to obtain SNPs at a higher density than the

34 LAI ET AL.

YAC-based directed approach. The major disadvan-tages of the random approach are the large amount ofsequencing and radiation hybrid mapping requiredand the potential of having large gaps in the map.Based on these results, a random approach using a

small-insert (1–3 kb) total human genome library asthe starting material would be the preferred initialmethod for high-density genome-wide SNP discoveryfollowed by a direct approach using YAC/BAC/PAC totarget specific regions of low SNP coverage. However,

FIG. 2. Polymorphism detection by sequencing of individual DNA samples (A) or pooled DNA sample (B).

35CONSTRUCTION OF HIGH-DENSITY GENETIC MAPS

for laboratories without high-throughput sequencingcapability, the YAC deletion methodology provides auseful alternative to identifying SNPs for specific dis-ease loci.

The amounts of time, effort, and cost required toidentify the SNPs from the two approaches are very

similar. The project was completed in 4 months, withcost ranging from $3000 (cost structure based on per-forming the work at large pharmaceutical companies,which should be considerably lower in academic ornonprofit organizations) per gene-based SNP (includ-ing failures) to $1500 for random SNPs. The cost in-

TABLE 2

Allele Frequencies of SNPs in Five Different Ethnic Populations

SNPs

African Pygmy (20) Druze (10) Japanese (6) Chinese (10) CEPH (14)

Allele 1 Allele 2 Allele 1 Allele 2 Allele 1 Allele 2 Allele 1 Allele 2 Allele 1 Allele 2

7093/MnlI 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.81 0.19XRCC1/StuI 0.75 0.25 1.00 0.00 failed failed 0.20 0.80 0.50 0.5040901/BsmAI 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.5040867/DraI 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.5040885/MvaI 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.5033130/Sau96I 1.00 0.00 1.00 0.00 1.00 0.00 0.80 0.20 0.94 0.0640919/BslI 0.60 0.40 0.80 0.20 0.83 0.17 0.50 0.50 0.57 0.43582/NruI 0.60 0.40 0.20 0.80 0.17 0.83 0.50 0.50 0.50 0.50811/AluI 0.70 0.30 1.00 0.00 1.00 0.00 1.00 0.00 0.83 0.1715386/NlsIV 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50PVR/BslI 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.79 0.2142079/FnuHI 0.63 0.38 0.30 0.70 0.50 0.50 0.50 0.50 0.56 0.44PRR2/Sau96I 0.30 0.70 0.70 0.30 0.50 0.50 0.70 0.30 0.43 0.57ApoC1/HpaI 1.00 0.00 0.90 0.10 1.00 0.00 0.00 1.00 0.75 0.252067/MaeIII 0.55 0.45 0.20 0.80 0.50 0.50 0.10 0.90 0.64 0.362050/AvaII 0.90 0.10 0.80 0.20 0.67 0.33 0.60 0.40 0.57 0.431046/NspI 0.80 0.20 0.80 0.20 0.17 0.83 0.60 0.40 0.57 0.4386/AciI 0.38 0.63 0.30 0.70 0.25 0.75 0.30 0.70 0.29 0.71111(APOC4)/AvaI 0.65 0.35 0.80 0.20 0.33 0.67 0.60 0.40 0.64 0.36112/ThaI 0.65 0.35 0.50 0.50 0.25 0.75 0.38 0.63 0.42 0.582590/BsaJI 0.30 0.70 0.10 0.90 0.50 0.50 0.10 0.90 0.07 0.932144/BstNI 0.55 0.45 0.80 0.20 0.83 0.17 0.60 0.40 0.50 0.502551/BsaJI 0.15 0.85 0.30 0.70 0.83 0.17 0.50 0.50 0.42 0.582536/DraI 0.30 0.70 0.67 0.33 0.50 0.50 0.50 0.50 0.50 0.502526/EarI 0.50 0.50 0.40 0.60 0.17 0.83 0.50 0.50 0.58 0.422151/HphI 0.30 0.70 0.13 0.88 0.83 0.17 0.80 0.20 0.57 0.439(CKM)/TaqI 0.78 0.22 0.70 0.30 0.00 1.00 0.10 0.90 0.79 0.2171(ERCC2)/DrdI 0.94 0.06 0.60 0.40 1.00 0.00 0.90 0.10 0.79 0.212266/BstEII 0.75 0.25 0.80 0.20 0.83 0.17 0.60 0.40 0.79 0.21114(FOSB)/HinfI 0.95 0.05 0.80 0.20 1.00 0.00 0.90 0.10 0.93 0.0794(TRSP)/HinfI 0.80 0.20 0.60 0.40 0.50 0.50 0.60 0.40 0.50 0.50104/MaeIII 1.00 0.00 0.80 0.20 0.50 0.50 1.00 0.00 0.57 0.43106(DM)/TfiI 0.90 0.10 0.50 0.50 0.17 0.83 0.40 0.60 0.43 0.572282/BsaJI 0.05 0.95 0.50 0.50 0.75 0.25 0.60 0.40 0.57 0.432311/AvaII 0.00 1.00 0.00 1.00 0.33 0.67 0.00 1.00 0.29 0.712584/TspRI 0.13 0.88 0.00 1.00 0.00 1.00 0.10 0.90 0.07 0.932582/NcoI 0.95 0.05 1.00 0.00 0.83 0.17 0.90 0.10 0.93 0.072535/MspI 0.20 0.80 0.50 0.50 0.50 0.50 0.20 0.80 0.40 0.601040/BsmAI 0.20 0.80 0.40 0.60 0.33 0.67 0.30 0.70 0.21 0.7926(CALM3)/PstI 1.00 0.00 0.90 0.10 1.00 0.00 1.00 0.00 0.86 0.14123/NcoI 0.00 1.00 0.25 0.75 0.33 0.67 0.50 0.50 0.07 0.931027/BstEII 0.30 0.70 0.20 0.80 0.67 0.33 0.70 0.30 0.00 1.001032/TaqI 1.00 0.00 0.30 0.70 0.33 0.67 0.70 0.30 0.14 0.861027/Bsu36I 0.20 0.80 0.30 0.70 0.33 0.67 0.50 0.50 0.00 1.002105/BstEII 0.00 1.00 0.40 0.60 0.00 1.00 0.20 0.80 0.50 0.502322/NlaIII 0.83 0.17 0.90 0.10 1.00 0.00 1.00 0.00 0.93 0.071296/AvaI 0.60 0.40 0.10 0.90 0.00 1.00 0.20 0.80 0.07 0.931090/BstUI 0.05 0.95 0.20 0.80 0.67 0.33 0.10 0.90 0.21 0.791087/BsmFI 0.10 0.90 0.60 0.40 0.17 0.83 0.50 0.50 0.71 0.292195/EcoNI 0.33 0.67 0.40 0.60 0.33 0.67 0.30 0.70 0.64 0.361003/AvaII 0.67 0.33 0.80 0.20 0.50 0.50 0.90 0.10 0.21 0.79

Note. The number of chromosomes examined is shown in parentheses after each population. The SNPs are shown as marker numbersfollowed by the enzyme used for detection. SNPs with an allele frequency differential greater than 0.5 between two populations are shownin bold.

36 LAI ET AL.

cludes sequencing, mapping, labor, reagents, hard-ware, and overheads (i.e., inclusive cost). These actualcosts neither incorporate large-scale automation nortake into account the advancement of the human ge-nome project. With automation and increased through-put of the genome sequencing centers, we estimate thatan average inclusive cost of $1500 per SNP would besufficient for the construction of a high-density ge-nome-wide SNP map in the next 3 years. This trans-lates into $150 million for a 100,000-SNP map.

Table 2 illustrates the degree of heterogeneity ofSNPs in various human racial or ethnic groups. Theallele frequency of each SNP for the particular racialand ethic group from which the individual being testedhas evolved therefore must be known and factored intoany analyses. While the data in Table 2 are prelimi-nary, it is clear that an SNP in Caucasians (CEPHUtah) may not be a useful polymorphism in the Japa-nese or the Middle Eastern populations. Reducing anordered map of SNPs applicable to a single patient topractical use will require detailed data and sophisti-cated informatics. In the physician’s office, disclosureof racial and ethnic information will become necessaryfor appropriate medical care. In the public arena, theracial and ethnic phenotypes of control populationsforming the allele-frequency databases will need to beaccurately assessed.

Pharmacogenomic applications propose using spe-cific patterns of SNPs, organized linearly as haplo-types, to detect similar loci of genetic susceptibility.While expense is an issue, the technical hurdle thatneeds to be solved is a rapid read-out mechanism: thespeculative favorite in industry is the use of nucleicacid-based hybridization technologies (Fodor et al.,1991). Given that these expensive and technical hur-dles will be solved, the most critical element is theapplication of a very complicated technology to thedevelopment of medicines and its subsequent use foran individual diagnosis. It is important to note thatthere are formidable problems in applying an SNP mapto pharmacogenomic practice. Concerns involving SNPpatent issues, ethnicity evaluations, diagnostic for-mats, interpretation of data, changing medical prac-tice, and patient care need to be considered. Pharma-cogenomics offers great promise for providing bettermedical care to everyone and deserves appropriate at-tention from the ethical and legal sectors.

ACKNOWLEDGMENTS

We thank Arash Afshari, Jennifer Aponte, Harrison Baines, Sha-hid Bhatti, Linda Briley, Jon Charnecki, Christine Donnelly, MaryFling, Dana Fraser, Rachel Gibson, Nicola Goddard, Nathalie Godi-not, Steve Haneline, Louise Hosking, Indira John, Sheena Lumsden,Anita Nelsen, Jennifer O’Sullivan, Tracey Saunders, Philip Rivers,Mark Rubino, Stephanie Shouse, Jacqueline Smallwood, Sandy Stin-nett, Zheng Yu Xue, David Yarnall, and Astrid Yeo for excellentexperimental assistance; Chee Gee See for FISH experiments;George Altshuller, Charles Ashanti, Doreen Burt, ChristopherChristodoulou, Nicky Crocker, Michele Godlevski, Bill Kimmerly,

Lisa Mamo, Raymond Thomas, and Imani Torruella-Miller for DNAsequencing; Terri Fleming and Greg Miller for development of thePolyPhred user interface; Don Holt, Lei Du, Ray Grimaila, LindaMcCarthy, and Tai-He Xia for database support and radiation hybridmap calculations; Paul Furdon, Suzanne Orton, Marcus Oxer, andPhilippe Sanseau for sequence processing; Rob Gill for ESTBlast;Curtis Alexander, Sujoy Ghosh, and Gary Kucera for assistance indatabase searches; Meg Ehm, Doug Montgomery, and Quan Nguyenfor calculation of the allele frequency; and Lee Babiss, Dan Burns,Nancy Shepherd, and Ann Saunders for critically reading the manu-script. SNP information has been submitted to NCBI’s dbSNP, andthe radiation hybrid data vectors have been submitted to EBI’sRHDB. In addition, a graphic display of the SNP map is located athttp://science.glaxowellcome.com.

REFERENCES

Anand, R., Villasante, A., and Tyler-Smith, C. (1989). Constructionof yeast artificial chromosome libraries with large inserts usingfractionation by pulsed field gel electrophoresis. Nucleic Acids Res.17: 3225–3433.

Collins, F. S., Guyer, M. S., and Chakravarti, A. (1997). Variationson a theme: Cataloging human DNA sequence variation. Science278: 1580–1581.

Cox, D. R., Burmeister, M., Price, E. R., Kim, S., and Myers, R. M.(1990). Radiation hybrid mapping: A somatic cell genetic methodfor constructing high-resolution maps of mammalian chromo-somes. Science 250: 245–250.

Dracopoli et al., Eds. (1995). “Current Protocols in Human Genetics,”Vol. 1, p. 5.9.15, Wiley, New York.

Fodor, S. P. A., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., andSolas, D. (1991). Light-directed, spatially addressable parallelchemical synthesis. Science 251: 767–773.

Gill, R. W., Hodgeman, T. C., Littler, C. B., Oxer, M. D., Montgomery,D. S., Taylor, S., and Sanseau, P. (1997). A new dynamic tool toperform assembly of expressed sequence tags (ESTs). CABIOS 13:453–457.

Gyapay, G., Schmitt, K., Fizames, C., Jones, H., Vega-Czarny, N.,Spillett, D., Muselet, D., Prud’Homme, J. F., Dib, C., Auffray, C.,Morissette, J., Weissenbach, J., and Goodfellow, P. N. (1996). Aradiation hybrid map of the human genome. Hum. Mol. Genet. 5:339–346.

Hermanson, G. G., Hoekstra, M. F., McElligott, D. L., and Evans,G. A. (1991). Rescue of end fragments of yeast artificial chromo-somes by homologous recombination in yeast. Nucleic Acids Res.19: 4943–4948.

Kruglyak, L. (1997). The use of a genetic map of biallelic markers inlinkage studies. Nat. Gent. 17: 21–24.

Kwok, P. Y., Carlson, C., Yager, T. D., Ankener, W., and Nickerson,D. A. (1994). Comparative analysis of human DNA variations byfluorescence-based sequencing of PCR products. Genomics 23:138–144. [ABI Big Dye terminators (http://www.perkin-elmer.com)]

Lander, E. S. (1996). The new genomics: Global views of biology.Science 274: 536–539.

Lander, E. S., and Schork, N. J. (1994). Genetic dissection of complextraits. Science 265: 2037–2048.

Lewis, B. C., Shah, N. P., Braun, B. S., and Denny, C. T. (1992).Creation of a yeast artificial chromosome fragmentation vectorbased on lysine-2. GATA 9: 86–90.

Marshall, E. (1997a). Snipping away at genome patenting. Science277: 1752–1754.

Marshall, E. (1997b). “Playing chicken” over gene markers. Science278: 2046–2048.

Nickerson, D. A., Tobe, V. O., and Taylor, S. L. (1997). PolyPhred:Automating the detection and genotyping of single nucleotide sub-stitutions using fluorescence-based resequencing. Nucleic AcidsRes. 25: 2745–2751.

37CONSTRUCTION OF HIGH-DENSITY GENETIC MAPS

Nurminsky, D., and Hartl, D. (1993). Amplification of the ends ofDNA fragments cloned in bacteriophage P1. Biotechniques 15:201–208.

Risch, N., and Merikangas, K. (1996). The future of genetic studies ofcomplex human diseases. Science 273: 1516–1517.

Rozen, S., and Skaletsky, H. J. (1997). Primer3: Source code avail-able at http://www-genome.wi.mit.edu/genome_software/other/primer3.html.

Schafer, A. J., and Hawkins, J. R. (1998). DNA variation and thefuture of human genetics. Nat. Biotech. 16: 33–39.

Schuler, G. D., Boguski, M. S., Stewart, E. A., Stein, L. D.,Gyapay, G., Rice, K., White, R. E., Rodriguez-Tome, P., Ag-garwal, A., Bajorek, E., Bentolila, S., Birren, B. B., Butler, A.,Castle, A. B., Chiannilkulchai, N., Chu, A., Clee, C., Cowles,

S., Day, P. J., Dibling, T., Drouot, N., Dunham, I., Duprat, S.,East, C., and Hudson, T. J. (1996). A gene map of the humangenome. Science 274: 540 –546. [http://www.ncbi.nlm.nih.gov/SCIENCE96/]

Seed, B. (1987). An LFA-3 cDNA encodes a phospholipid-linkedmembrane protein homologous to its receptor CD2. Nature 329:840–842.

Stewart, E. A., McKusick, K. B., Aggarwal, A., Bajorek, E., Brady,S., Chu, A., Fang, N., Hadley, D., Harris, M., Hussain, S., Lee,R., Maratukulam, A., O’Connor, K., Perkins, S., Piercy, M., Qin,F., Reif, T., Sanders, C., She, X., Sun, W. L., Tabar, P., Voyticky,S., Cowles, S., Fan, J. B., and Cox, D. R. (1997). An STS-basedradiation hybrid map of the human genome. Genome Res. 7:422– 433.

38 LAI ET AL.