Genome Annotate

download Genome Annotate

of 13

Transcript of Genome Annotate

  • 8/12/2019 Genome Annotate

    1/13

    REVIEWS

    For thousands of years, rabbis have laboured over thetext of the Torah,seeking to make this cryptic,unevenand internally contradictory text into a coherent sys-tem of law,and storing this commentary into an anno-

    tated version of the text,known as the Talmud. Overtime, the amount of annotation in the Talmud hasgreatly exceeded the original text each line of theTorah is now surrounded by layers of commentary inan onionskin fashion.

    So it is with the genome.The past decade has seenthe completion of numerous whole-genome-sequencing projects, beginning with microbialgenomes13 and continuing with the eukaryoticspecies Saccharomyces cerevesiae (yeast)4,Caenorhabditis elegans (worm)5, Drosophilamelanogaster(fruitfly)6,Arabidopsis thal iana(mustardweed)7, and culminating most recently with theannouncement by public and private groups ofWORK-ING DRAFT versions of the human genome8,9. Other

    genomes are either on the way or contemplated,including mouse, rat, zebrafish,pufferfish and non-human primates.

    Like the torah,the genome represents a mixture ofevolutionary intent (insofar as natural selecti on canbe said to i ntend anything) and historical accident.Although there is assuredly an underlying rhyme andreason to the genome, there is also much that is hap-hazard.Fragments of viral and prokaryotic genomesthat infected ancestral individuals, mobile elements,

    pseudogenes and repeti ti ve elements,are all traps (oropportunities) wait ing to surpr ise the genome scien-tist. More importantly, principal aspects of the basicorganization of the genome are sti ll quite murky: the

    regulation of alternative splicing,the control of tran-scripti on, the role of intergenic material, and thefunction of many non-coding RNAs, to name just aprominent few.

    The attenti on of the sequencing communi ty isnow focused on genome annotation the process oftaking the raw DNA sequence produced by thegenome-sequencing projects and adding the layers ofanalysis and interpretati on necessary to extr act i tsbiological significance and place it into the context ofour understanding of biological processes. Genomeannotation itself is a multi -step process, falling moreor less neatl y into three categories: nucleotide-level,protein-level and process-level annotati on (FIG.1).

    This review surveys the various ways that genome

    annotation is carr ied out, the techniques used andthe diverse sociological models that have been adopt-ed to organize the annotators.

    Nucleotide-level annotation

    Yet another eukaryoti c genome is complete, andsomeone hands you a big unannotated FASTA FILE(three bil li on base pairs in the case of a mammalianspecies, or the content of about fi ve CD-ROMs).What now?

    GENOME ANNOTATION:FROM SEQUENCE TO BIOLOGY

    Lincoln Stein

    The genome sequence of an organism is an information resource unlike any that biologists

    have previously had access to. But the value of the genome is only as good as its annotation.

    It is the annotation that bridges the gap from the sequence to the biology of the organism.

    The aim of high-quality annotation is to identify the key features of the genome in particular,

    the genes and their products. The tools and resources for annotation are developing rapidly,

    and the scientific community is becoming increasingly reliant on this information for all aspects

    of biological research.

    WORKING DRAFT

    A worki ng draft sequence has

    come to mean a genomic

    sequence before it is finished.

    Working draft sequences

    contain multiple gaps,unrepresented areas and

    misassemblies. In addition, the

    error rate of working draft

    sequence is higher than the 1 in

    10,000 error rate that is

    standard for finished sequence.

    FASTA FILE

    A common file format used for

    the storage and t ransfer of

    sequence data. It contains raw

    DNA or protein sequence,but

    no annotation information.

    NATU RE REVIEWS | GENETICS VOLUME 2 | JULY 2001 | 4 9 3

    Cold Spri ng Harbor

    Laboratory, 1 Bungtown

    Road,Cold Spri ng Harbor,

    New York 11724,USA.

    e-mail : [email protected]

  • 8/12/2019 Genome Annotate

    2/13

    4 9 4 | JULY 2001 | VOLUME 2 www.nature.com/reviews/genetics

    R E V I E W S

    Findi ng genomic landmarks.Finding landmarks is a rel-atively straightfor ward task. Short sequences,such asPCR-based genetic markers,can be identi fied rapidlyusing thee-PCRprogram10.Longer sequences,such asrestr iction-fragment length polymorphism markers,canbe found usingBLASTN11,SSAHA12or another rapid

    sequence-similarity searching algorithm. (SeeBOXES1and 2 for information about the resources and softwaretools discussed in this article.)

    In the case of the human working draft,an impor-tant bridging activit y was the integration of thesequence with the cytogenetic map,much beloved byclinical geneticists and hunters of disease genes. In thiscase, the human cytogenetic map was integrated withthe genome by systematic fluorescencein situ hybridiza-tion mapping of bacterial artificial chromosome (BAC)clones against metaphase chromosomes13.The assemblyof the fly genome was aided immeasurably by in situhybridization of each BAC in the physical map to poly-tene chromosomes,providing bridges between the cyto-genetic and physical maps with an accuracy of a few

    hundred kilobases6.

    Gene fi ndi ng.Gene finding is the most visible part ofthi s phase. In small prokaryotic genomes,gene find-ing is largely a matter of identi fying long open readingframes (ORFs). Even here,however,ambiguities ariseif long ORFs overlap on opposite strands,and the truecoding region must be sorted out . As genomes getlarger, gene finding becomes increasingly tri cky.Themain issue is the signal-to-noise ratio.In a prokaryot-ic genome,such asHaemophilus influenzae,85% of its1.8-Mb genome is in coding regions.The correspond-ing number i n yeast i s not much lower,at 70%.Forthese genomes,calling genes is an exercise in runninga computer program that carries out a six-frame

    translation and identifies all ORFs that are longerthan a chosen threshold. But even in these smallgenomes, finding genes has not been enti rely effor t-less.The number of predicted yeast genes, for exam-ple, took several years to sett le down, and there arestill several short ORFs that have an uncertain statusas bona fide genes.

    In the fly and the worm, however, less than 25% ofthe genome is in coding regions,and the number falls tojust a few per cent in humans.The process of findinggenes is further complicated by the presence of splicingand alternative splicing. In the human genome,a typicalexon is 150 bp and a typical intron is several kilobases,and there is no clear delineation between the intergenicregions that separate adjacent genes and the intragenic

    regions that separate exons.Defining the precise startand stop posit ion of a gene and the spli cing pattern ofit s exons among all the non-coding sequence is likefinding a very small and indistinct needle in a very largeand distracting haystack.

    Several sophisticated software algorithms havebeen devised to handle gene prediction in eukaryoticgenomes, including GENSCAN14, Genie15,GeneMark.hmm16, Grail17, HEXON18, MZEF19,Fgenes20 GeneFinder (P. Green, unpubl ished data)

    Mapping. The fir st step in genome annotati on is toidenti fy the punctuation marks. Where are theknown genes, genetic markers and other landmarkspreviously identified by genetic,cytogenetic or RADIA-TION HYBRID MAPPING?Where are the tRNAs, rRNAs andother non-translated RNAs?Where are repeti tive ele-ments?Is there evidence for ancient duplications inthe genome and, if so,where are the end points of theputative dupl icated regions? All these questi ons arereally an extended form of physical mapping,attempting to convert the terra incognit aof raw DNAsequence into a set of easil y recognized landmarksand reference points.

    Along with gene finding, the principal activi ty ofthis phase of annotation is identifying and placing all

    known landmarks into the genome. For example,during nucleotide-level annotati on, annotatorswil l search for known geneti c markers, radiationhybrid markers and CLONE ENDS in the sequence, andplace them; they thereby form br idges between thegenomic sequence and pre-existing genetic, radiationhybrid and physical maps. This provides a path toconnect the pre-genomic li terature, which i s oftenbased on such landmarks, wit h post-genomicresearch.

    RADIATION HYBRID MAPPING

    An experimental technique that

    uses radiation-induced

    chromosomal breakpoints in

    somatic-cell hybrids to map the

    positions of sequence tagged

    sites (STSs).

    CLONE ENDS

    Genomic sequencing projects

    typically sequence the ends of

    bacterial artificial chromosome

    and plasmid clones, in additi on

    to shotgun sequencing enti re

    clones.Dur ing assembly, the

    clone end sequences are used to

    create a scaffold on which the

    genome sequence is pieced

    together.

    X

    Where?Nucleotide-level annotation

    What?Protein-level annotation

    How?Process-level annotation

    Figure 1 | The three layers of genome annotation:

    where, what and how?

  • 8/12/2019 Genome Annotate

    3/13

  • 8/12/2019 Genome Annotate

    4/13

    4 9 6 | JULY 2001 | VOLUME 2 www.nature.com/reviews/genetics

    R E V I E W S

    RNAs. rRNAs can be found easil y by similarit ysearching, but the rest are tr icky, because of boththeir short length and their nucleotide diversity.tRNAs are amenable to de novopredicti on throughalgori thms that search for characteristic str ucturalsignatures, such as hairpin formation3032. The most

    widely used tRNA predicti on program is tRNAScan-SE32, which combines several algorit hms to identi fytRNAs with high accuracy in good running times. Itcan also distinguish active tRNAs from tRNApseudogenes. This program was used dur ing therecent annotation of the publi c human sequence toidentify 497 tRNAs and 324 putative pseudogenes9.

    Other non-coding RNAs,such as telomerase RNAand the U112 series of spli ceosome RNAs, can beidentif ied by sequence similari ty, but there are li kelyto be many non-coding RNAs that have not yet beenidentified33. Just coming online now are new algo-ri thms based on identi fying characteristic patterns ofmismatched base pairs in cross-species alignments,for example mouse and human (S. Eddy, personal

    communication). Preliminary results indicate thatthere might be hundreds of previously unrecognizednon-coding RNAs in the genome.It wi ll be fascinat-ing to learn t he role and function of non-codingRNAs discovered in this way.

    The situati on is simi lar wi th regulatory regions(reviewed in REF. 34). A relati vely small number oftranscriptional- factor-binding sites have been identi -fi ed by classical experi mental methods. Thesequences for these sites are available in curated data-bases such asTRANSFAC and PROSITE, and can befound in genomic sequence by applying similaritysearch methods that are specialized for short motifs.However, as with the non-coding RNAs, the numberof known regulatory regions is almost cert ainl y a

    small fraction of what is out there.The development of algorithms to search for regu-

    latory regions is a hot research topic in bioinformatics.One popular class of algori thms, exempli fied by theMEME program35, searches for mot ifs in nucleotideand protein sequences that occur more often thanwould be predicted by chance. However, the predic-tions from such algorithms are strongly dependent onthe care with which the input sequences are chosen,and are typically used str ictly as the start ing point forexperimental confirmation.

    However, it is widely anticipated that nucleotide-level similarity comparisons among two or morerelated species will allow us to identif y control andregulator y regions with notably greater power and

    accuracy.When or thologous genomic regions of twoclosely related species, such as C. elegansandCaenorhabdi ti s bri ggsae36,37, mouse and human38,39,and human and chimpanzee40, are examined, con-served blocks of nucleotide similari ty upstream fromthe transcript ional start site are commonly found,indicating possible evolutionarily conserved regulato-ry regions.With the mouse genomic sequence nowbecoming available, there should be a great explosionin such studies over the next few years.

    The Human Sequencing Consortium, based onthe Ensembl gene annotation system9, took almostthe reverse approach, beginning with ab initi ogenepredictions from GENSCAN and then strengtheningthe predictions using nucleotide and protein simi lari-ties. These predicted gene models were merged andreconciled with the output of Genie EST,and fi nallymerged with the contents of the RefSeq library.

    Although the two groups approached the gene find-ing problem from different directions,both gave greaterweight to cDNA and EST alignments than to ab initi o

    gene prediction.So,it is not too surprising that the esti-mates from both groups of the number of genes werevery close,~30,000.

    Non- codin g RNAs and r egulat ory r egions.There ismuch more to the genome than coding regions.Onthe cutt ing edge of nucleotide-level annotation is thesearch for non-coding RNAs and transcriptional reg-ulatory regions. Non-coding RNAs include tRNAs,rRNAs, small nucleolar RNAs and small nuclear

    Box 1 | Resources and tools I

    The following list provides brief descriptions of some of the software tools andresources mentioned in the article.Online, links are provided to these resources fromthis box,and from the text.

    BLASTN, BLASTX,BLASTP, PSI-BLAST . . . . . . . http://www.ncbi.nlm.nih.gov/BLAST/This family of sequence-similarity search tools allows you to rapidly search a queryprotein or nucleotide sequence against a large database of sequences,to identifysequences that are similar to the search sequence.

    Ensembl . . . . . . . . . . . . . . . . . . http://www.ensembl.orgThis web site, a joint project of the European Bioinformatics Institute and the SangerCentre,seeks to make available a high-quality, consistent set of annotations on thehuman and mouse genomes.

    e-PCR. . . . . . . . . . . . . . . . . . . . http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi

    BLAST does not work well with short sequences,such as PCR primers. The ePCRprogram was developed to search rapidly for a primer pair (using their sequences andknown physical separation) in a large sequence, such as a genome.

    FlyBase. . . . . . . . . . . . . . . . . . . http://www.flybase.org/A fully realized model organism system database, presenting curated annotations on

    theDrosophi la melanogastergenome, as well as rich information on the genetics of theorganism, mutant strains and molecular resources.

    GeneMark.hmm . . . . . . . . . . . http://genemark.biology.gatech.edu/GeneMark/Another gene-prediction program that uses hidden Markov models (HMMs). Theonline version supports numerous eukaryotic and prokaryotic genomes.

    Genie . . . . . . . . . . . . . . . . . . . . http://www.fruit fly.org/seq_tools/genie.html

    The gene-prediction program used to annotate genes inDrosophi la melanogaster.Theonline version has been trained for human and Drosophilasequence.

    GENSCAN. . . . . . . . . . . . . . . . http://genes.mit .edu/GENSCAN.htmlThis is probably the most widely used gene-prediction program. It uses HMMs topredict the presence of a gene given the raw DNA sequence. The online versionprovides prediction services for vertebrates, Arabidopsis thal ianaand maize.

    Grail . . . . . . . . . . . . . . . . . . . . . http://compbio.ornl.gov/Grail-1.3/One of the oldest gene-prediction programs still in use,this software uses a neural

    network to predict genes.The online version provides gene-prediction services forhuman, mouse,Arabidopsis,DrosophilaandEscherichi a colisequences.

  • 8/12/2019 Genome Annotate

    5/13

    NATU RE REVIEWS | GENETICS VOLUME 2 | JULY 2001 | 4 9 7

    R E V I E W S

    However, repeti tive elements are more than a nui-sance for genome assembly.The study of repeats isitself a fascinating discipline,which has revealed a vir-tual bestiary of repeat families,each with unique dis-ti nguishing features and survival strategies. Therewere several surprises involvi ng repetiti ve elements

    during the annotation of the human genome.One ofthe more interesti ng came from the analysis of theapparent age of representatives of each family, inwhich age was determined by counting the mutationsthat had accumulated in the representative relative tothe canonical sequence. This analysis9 detected amarked and unexplained decline in t ransposable ele-ment activi ty over the past 3350 Myr of human evo-lution. Another unexpected outcome of this annota-tion activit y was the finding of a shift of olderALU-type transposable elements from (A+T)-richregions, which are favoured by the tr anspositi onmechanism, to gene-r ich (G+C)-r ich regions afinding that raises the possibility that Alu elements aresubject to posit ive selecti on. Clearly,much is to be

    gained from a closer examination of this junk DNA.

    Mappi ng segmental dupli cati ons.Distinct from identi-fying and mapping repeti tive elements is the identifica-tion of large segmental-duplications in the genome.

    One of the big surprises that emerged during thesequencing of the mustard weed genome was evidencefor several distinct large-scale duplication events in theorganisms past7,41.Over 60% of the predicted ORFs inArabidopsismatch a paralogue somewhere else in thegenome,and these duplicated genes are organized intolarge syntenic blocks that mi ght be several hundredORFs long (FIG.3).

    An analysis of the apparent age of these duplications,on the basis of sequence divergence,indicates four dis-

    tinct segmental-duplication events in the genome,occurring between 100 and 200 Myr ago41 a period oftime that spans the diversification of theANGIOSPERMS.This helps to explain the paucity of conserved syntenybetween the genetic maps of rice (a MONOCOTYLEDON) andArabidopsis(aDICOTYLEDON). So,the study of segmentalduplications can reveal important information about theevolutionary history of a species.

    The human genome itself has blocks of segmentalduplication,although none so marked as mustard weed.The Celera and sequencing consortium papers indicatethat 5% of the genome might be involved in ancientsegmental duplications,and there is preliminary evi-dence that the true number might be significantly high-er (E.Eichler,personal communication).

    Mappi ng vari ations. The last principal activity ofnucleotide-level annotation is identi fying and map-ping polymorphisms. Single-nucleotide polymor-phisms (SNPs) have emerged as a valuable tool forgenetic mapping, population genetics studies andclinical diagnosis42.

    SNPs had a prominent role in the recent annota-tions of the human genome8,43. In principle,it i s easyto identify SNPs by simply aligning the genomic

    Ident if yin g repeti ti ve elements. Repeti tive, or inter-spersed,elements are an important feature of eukaryoticgenomes,and indeed account for a large proportion ofthe variation in genome size.In the human genome,thelargest genome sequenced so far,44% of nucleotides arecontributed by repeti tive elements9.The repeti tive ele-ments are derived from active transposable elements andare frequently dismissed as parasit ic or junk DNA.

    Identifying and mapping repeti tive elements actually

    starts well before any other annotation activities begin,because of the need to identify and exclude repeti tiveregions dur ing the genome assembly process. Knownrepetitive elements are masked from the sequence using aprogram such asRepeatMasker (A.F.A.Smit and P.Green,unpublished data),or are identified and excluded by mea-suring the apparent coverage of the region being assem-bled.Once a family of repetitive elements is identified,it isrelatively straightforward to find other members of thefamily by nucleotide-based sequence-similarity searching.

    ALUSEQUENCE

    A dispersed, intermediately

    repetiti ve DNA sequence found

    in the human genome in about

    300,000 copies.The sequence is

    about 300 base pairs long.The

    name Alu comes from the

    restr icti on endonuclease (AluI)

    that cleaves it.

    ANGIOSPERM

    Flowering seed plant.

    MONOCOTYLEDON

    One of the two principal classes

    of flowering plant, monocots

    are characterized by a single

    cotyledon (primitive leaf) in

    the embryonic plant.Maize,

    ri ce,wheat and other grasses are

    common monocots.

    Box 2 | Resource s and tools II

    HMMER. . . . . . . . . . . . . . . . . . . . . http://hmmer.wustl .edu/

    HMMER is a protein-similarity search engine. It uses HMMs to provide greatersensitivity when searching for evolutionarily distant proteins. It is more accurate thanBLASTP, but notably slower.

    HMMGene . . . . . . . . . . . . . . . . . . . http://www.cbs.dtu.dk/services/HMMgene/Another HMM-based gene-prediction program.This one provides options forvertebrate andCaenorhabdi ti s eleganssequences.

    InterPro . . . . . . . . . . . . . . . . . . . . . . http://www.ebi.ac.uk/proteome/

    The InterPro database is on its way to becoming the prime resource for protein-levelannotation.

    Proteome,Inc. . . . . . . . . . . . . . . . . http://www.proteome.com/An example of the cottage industry approach to annotation. The curated databases atthis site include protein databases forC. elegans, yeast,human and mouse.

    RefSeq and LocusLink, National Center for BiotechnologyInformation (NCBI) . . . . . . . . . . . http://www.ncbi.nlm.nih.gov/In addition to the GenBank database of nucleotide and protein sequences, and thePubMed bibliographic search service, NCBI hosts RefSeq and LocusLink, two curateddatabases of nucleotide- and protein-level annotation.

    RepeatMasker. . . . . . . . . . . . . . . . .http://www.genome.washington.edu/uwgc/analysistools/repeatmask.htmRepeatMasker is the most widely used tool for identifying and masking repeats ingenomes. It operates using a large database of repeat consensus sequences previouslyidentified in various eukaryotic species.

    SSAHA. . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Software/analysis/SSAHA/

    SSAHA is a new, fast algorithm for searching for nearly identical nucleotidesequences. It makes it possible to quickly match a partial DNA sequence to a largegenome, such as in the human.

    SWISS-PROT andSWISS-PROT TrEMBL . . . . . . . . ht tp://www.ebi.ac.uk/swissprot/

    The SWISS-PROT database is a high-quality, curated database of protein sequencesfrom all species.SWISS-PROT TrEMBL is an automated merge of SWISS-PROT withpredicted proteins from genomic sequencing and other sequencing projects.A rule-based system minimizes the amount of redundancy and errors in the merged

    database.

    UCSD Genome Browser . . . . . . . http://genome.ucsc.edu/

    This web site,run by graduate student Jim Kent, presents an integrated view ofannotations on the human genome contributed by numerous groups.

  • 8/12/2019 Genome Annotate

    6/13

    4 9 8 | JULY 2001 | VOLUME 2 www.nature.com/reviews/genetics

    R E V I E W S

    Sequence data for SNP calling can come from var-ious sources.In the case ofC. elegans,SNPs have beenidenti fied by shotgun sequencing DNA from variouswild-type isolates (C. elegansSNP data45) and thenaligned to the reference genomi c sequence. In thecase of the human sequence,SNPs have been called

    by aligning ESTs to genomic sequence and by poolingand shotgun sequencing the genomic DNA of a panelof unrelated indi viduals, foll owed by ali gnment ofthe sequence data to the human working draft43.Numerous SNPs came from within the human work-ing draft i tself by identi fying variations in regions ofclone overlap. In the case of the Celera draft, a total offive unrelated donors were used to create the librariesused for shotgunning and assembly.In the case of thepubl ic consorti um sequence, 75% the sequencewas contr ibuted by a single individual, and the poly-morphisms found were presumably because ofheterozygosity.

    Annotation of human SNPs have yielded someunexpected results.The most intriguing finding is that

    the distribution of SNPs departs significantly from whatwould be predicted under the standard populationgenetics models.Some regions of the genome containSNP hot spots and others are SNP poor. What combina-tion of historical, structural or selective pressures areresponsible for this uneven distr ibution, and how is itrelated to other factors,such as the distribution of genesand repetitive elements?

    Protein-level annotation

    After asking where, genome annotators ask what.This stage of genome annotati on seeks to compile adefinitive catalogue of the proteins of the organisms,to name them and to assign them putative functions.

    H. influenzaehas 1,709 genes,yeast has ~5,600,

    and fly, worm, mustard weed and human have~13,000,~19,000,~24,000 and >30,000,respectively.Of these gene sets,only a small fraction correspond toknown, well -characteri zed proteins. Faced withnumerous proteins of unknown function,annotatorsgenerally begin by classifying them into more man-ageable groups or protein fami li es, and by usingsimilarities to better-characterized proteins ofother species.

    This process sounds easier than it is in practice.The int ri nsic problem comes from the nature of theevolutionary process. During the evolution of a pro-tein family,an ancestral protein i s duplicated one ormore times, and the copies diverge, forming a familyof related proteins known as paralogues. However,

    simi larity in function does not inevitably follow fromsharing a common ancestor, and there are many casesof two protein family members that have str iki nglydivergent functions.For example, the lens crystall insare derived from a family of proteins that ordinarilyfunction as enzymes and CHAPERONINS46,47. Furthercompli cati ng the issue is the proclivi ty of genes topick up or lose functional domains during the courseof evolution,creating chimeric proteins that share twoor more unrelated ancestors48.

    sequences of two or more indivi duals and findingplaces where the sequence of one diverges from theother. In practice,SNP-calling algori thms must distin-guish biological variations from those due to sequenc-ing errors.Several algorithms have been developed todo this12,43,44; each uses sequence quality data to make a

    probabil ity estimate.

    DICOTYLEDON

    One of the two principal classes

    of flowering plant, dicots arecharacterized by two cotyledons

    (primitive leaves) in the

    embryonic plant. Tomatoes,

    maple trees and mustard are

    common dicots.

    CHAPERONINS

    A class of ring-shaped,heat-

    shock proteins that have a key

    role in protein folding and

    protection from stress.

    E0+ E1

    + E2+

    Einit+

    Esngl+

    (single-exongene)

    (5UTR ) (3UTR)

    Eterm+

    P+

    I0+

    F+

    Forward (+ ) strand Forward (+) strand

    Reverse () strand R everse () strand(intergenic

    region)

    N

    I1+ I2

    +

    I0 I1

    I2

    T+

    Esngl

    (single-exongene)

    (5UTR ) (3UTR)

    F T

    (pro)

    A+(poly-Asignal)

    P(pro)

    A(poly-Asignal)

    Einit Eterm

    E0 E1

    E2

    Figure 2 | A hidden Markov model explicitly models the

    probabilities for the transition from one part of a gene

    to another. In this model, used by the GENS CAN algorithm,

    each circle or diamond represents a functional unit in the

    gene. For example, Einit

    is the initial exon and Eterm

    is the last.

    The arrows represent the probability of a transition from one

    part of a gene to another. The algorithm is trained by

    running a set of known genes through the model and

    adjusting the weights of each transition to reflect realistic

    transition probabilities. T hereafter, test sequence data can be

    run through the model one base position at a time, and the

    model will read out the probability of a gene being present at

    that position. The states that occur below the dashed line

    correspond to a gene in the reversed strand, and thus are

    symmetric with those above the line. E, exon; I, intron; U TR ,

    untranslated region; pro, promoter. (From REF.14 Academic

    Press Ltd (1997).)

  • 8/12/2019 Genome Annotate

    7/13

    NATU RE REVIEWS | GENETICS VOLUME 2 | JULY 2001 | 4 9 9

    R E V I E W S

    bibliographic references,descriptions of the functionand biological role of the protein, protein familyassignments and pointers to structural data, if avail -able. Because of it s curated nature, the proteinsequences contained in SWISS-PROT are li kely to beof high reliabili ty.

    However, the accelerated rate of genome sequencinghas outstri pped the speed of the curators of SWISS-PROT.This gap is fi lled by SWISS-PROT TrEMBL,anautomated translation of coding DNA sequence (CDS)entries submit ted to the nucleotide databases.It acts asthe pre-annotation source for SWISS-PROT.TrEMBLis fil tered to remove redundancy within it self andbetween TrEMBL and SWISS-PROT,and is then subjectto a first-pass search for functional domains using thetools described below.The current version of TrEMBL(release 16) contains 489,620 sequences, an order ofmagnitude larger than SWISS-PROT.

    A complementary approach is to search againstdatabases of functional domains.Among the common-ly used databases are PFAM51, a collection of HMM

    profi les and alignments for common protein families.The current release (6.1, March 2001) contains 2,727protein family entri es, and is searchable using theHMMER software tool32.A typical PFAM entry is TAZzinc finger, the moti f that allows the cAMP-response-element-binding protein (CREB) to bind to other tran-scription factors,such as p53,and to regulate the initia-tion of transcription.Other databases commonly usedduring automated protein annotation include the fol-lowing: PRINTS52, a compendium of short proteinmotifs that captures common protein folds anddomains; PROSITE53,a database of longer protein sig-natures known as profi les; ProDom54, a collection ofprotein domains derived from the PSI-BLAST proce-dure50; the SMART curated collection of protein

    domains55

    ; and BLOCKS56

    , a database of conservedprotein regions and their multiple alignments.

    The various protein family,domain and motif data-bases are highly overlapping,but differ in their nomen-clature, their search methods and their suitabil it y fordiverse tasks. This makes it dif ficult to interpret theresults when a predicted protein hits entries in several ofthe databases.Have they hit the same thing?

    An important development in recent years has beenan intensive effort to integrate the protein signature data-bases into the unified InterPro resource57. InterPro is across-referencing system for equivalent entries in thePFAM, PRINTS, PROSITE, ProDom, BLOCKS andSMART databases.This resource allows genome annota-tors to run a predicted protein against each of the mem-

    ber databases,collect the matching domains and families,and then translate each into its corresponding uniqueInterPro entry.The most recent InterPro release (3.0,March 2001) contains 3,591 entries that correspond to2,628 protein families,888 domains,and 75 repeats andpost-translational modification sites.Each InterPro entrycontains a brief description of the family or domain,a listof SWISS-PROT and TrEMBL proteins that contain theentry,li terature references and outgoing links to the cor-responding entries of the member databases.

    The comparison of proteins between species is a richsource of functional annotation.For example, if a well-characterized yeast protein is known to be involved inthe initiation of DNA replication,then it is likely that aprotein predicted from the human genomic sequencethat is similar to the yeast protein wil l have the samefunction. However, the nature of protein evolutionagain lurks to ambush the unwary annotator. Thehuman gene might be directly descended from a com-mon ancestor of the yeast gene,in which case it is called

    an orthologue,or i t might be descended from a dupli -cated and diverged copy of the gene,in which case it is aparalogue. In this case, it would be a mistake to conflatethe yeast with the human protein and assume that theyhave the same functional role.

    Although various techniques have been developed toidentify and cluster groups of orthologous proteins inan automated fashion, for example theCOGsystem49,many predicted proteins evade unambiguous automat-ed classification into an orthologous group.In practice,what is typically done now is to classify predicted pro-teins on the basis of functional domains, folds andmotifs,as well as by broad similarity to better-character-ized proteins.The sorting out of protein phylogeny thenfollows on as a slower research activity.

    A typi cal protein annotation pipeline wil l searchfor similarities using the BLASTP or PSI-BLASTtools50 against several different databases of proteinsequences.Among the most valuable of the whole-protein sequence collections are SWISS-PROT andSWISS-PROT TrEMBL28.The former is a curated col-lection of confirmed protein sequences (86,593 as ofrelease 39), which have been extensively annotatedand cross-referenced with other sequence and struc-ture databases. SWISS-PROT annotations include

    M b 5 10 15 20 25 30

    M b 5 10 15 20 25 30

    I

    II

    III

    IV

    V

    Figure 3| Segmental duplications.TheArabidopsis thalianagenome is dominated by large

    regions of segmental duplication. The horizontal bars represent the five Arabidopsis

    chromosomes (IV). Coloured bands connect similar regions. In some cases, the duplications are

    on the same chromosome, and in others, the duplications have been inverted (twisted bands).

    Light grey regions are the nucleolus organizer regions, black represents the centromeres. Scale is

    in megabases. (From Dirk Haase, M IPS Institute for Bioinformatics, G SF N ational Research

    Centre for Environment and Health, M unich, G ermany. K indly provided by theMIPS Arabidopsis

    thalianadatabase.)

  • 8/12/2019 Genome Annotate

    8/13

    5 0 0 | JULY 2001 | VOLUME 2 www.nature.com/reviews/genetics

    R E V I E W S

    terms are added to the existi ng hierarchy.To give aspecific example of how this works, the enzyme phos-phati date cyti dylylt ransferase (EC 2.7.7.41), isdescribed by the GO function phosphatidate cyti dy-lyl transferase (accession GO:0004605), the processphospholipid biosynthesis (accession GO:0008654)

    and the component membrane fraction(GO:0005624).

    Recently, other model organism databases havejoined the GO Consortium, includingtheArabidopsisInformation Resource62, the C. elegansdatabaseWormBase63and the fission yeast database PomBase.Each of these genome databases is annotating the con-fi rmed and predicted genes in its corresponding speciesusing GO terms,allowing quick comparison and cross-referencing between them. GO has also been adoptedby several groups for use in annotati ng human genes.GO-term annotations are a feature of the InterProdatabase mentioned earli er, and are used in theProteomedatabases and in the RefSeq database26.

    Process-level annotation extends well beyond pure-

    ly computational work. High-throughput techniques,such as transposon-mediated insertional mutagenesis(reviewed in REF.64), microarray expression analysis65,RNA INTERFERENCE

    66, direct assay of protein-expressionlevels by mass spectroscopy67, screening for proteinnucleotide interaction sites using systematic evolut ionof li gands by exponential enrichment (SELEX)68 andother techniques, green-fluorescent-protein-basedassays for determining the anatomic and temporal pat-terns of gene expression69 and yeast two-hybridstudies70, all provide vital clues to the role that genesand proteins have in biological processes,and providea rich layer of annotation. Indeed, taken a step further,conventional bench research and genome annotationbegin to merge. Each experiment adds an item of

    information to our knowledge of biology,and this intur n enhances our understanding of the genomethrough the genes and proteins it touches.

    The sociology of genome annotation

    Annotation at the nucleotide,protein and process lev-els form t he basis of genome annotation. All thatremains is marshalling the technology,manpower andcomputational resources to undertake the mammothtask of annotating a new genome.Inevitably,this leadsour discussion away from the science of annotation toits sociology.

    Organizing genome annotati on efforts.Genome anno-tation generally fol lows one of several organizational

    models:the factory,the museum, the cottage industryand the party.Each model is better suited to a differentphase of the annotation task.

    During the first phase of annotation, when the pri-mary concern is to f ind genes, to map variations and toidentify other structural landmarks,the factory modelworks best. This model, best exempli fied by theEnsembl project, relies on a high degree of automa-tion.The Ensembl annotation system is built around acluster of 500 computers and a pipeline of annotation

    InterPro has been used as the basis for several pro-tein annotation projects for the yeast, worm, fruit fly,Arabidopsisand human genomes. In these organisms,between 40 and 50% of predicted proteins have amatch to at least one InterPro entry.More than half ofthe predicted proteins in eukaryotes belong to novel

    protein families,highlighting how much sti ll needs tobe learned.

    Process-level annotation

    The last and, in many ways,most challenging part ofgenome annotation is relating the genome to biologi-cal processes.How do the building blocks of genes andproteins relate to the cell cycle,cell death,embryogene-sis,metabolism, and the maintenance of health anddisease? Functional annotation, as it is someti mescall ed,has long been a feature of genome-sequenceanalysis. The publi cati on of each new genome isinevitably accompanied by a table or a pie-chart show-ing the distribution of proteins classif ied by function,for example metabolism and cytoskeleton. Until

    recently, what had been lacking in these analyses was acommonly accepted classification scheme that com-bined the breadth required to describe biological func-ti ons among diverse species with the specif ici ty anddepth needed to distinguish a part icular protein fromother members of its famil y.The lack of such a stan-dard hampered the ability to relate genes that wereannotated by di fferent research groups,part icularlywhen crossing species borders.

    A breakthrough of sorts came two years ago,whenthree model organism databases, theSaccharomycesGenome Database58,FlyBase59and theMouse GenomeDatabase60, formed a consort ium to create a GeneOntology(GO)61.The GO is a standard vocabulary fordescribing the function of eukaryotic genes. It consists

    of three subparts:molecular function, biological processand cellular component. Molecular function termsdescribe the tasks carr ied out by individual gene prod-ucts,such as its enzymatic activit y.Biological processterms are used for broader biological goals, such asmeiosis.Cellular component terms describe genes interms of the subcellular structures they are localized to,such as organelles,as well as the macromolecular com-plexes they belong to,such as the ribosome.

    The elegance of the GO is that i t is organized as ahierarchy of terms (or more accurately, as aDIRECTEDACYCLIC GRAPH that allows a term to appear in severalplaces in the hierarchy).More general terms,such asenzyme,lead to more specif ic terms,such as lyase,carbonoxygen lyase, hydro-lyase and threonine

    dehydratase.This flexibili ty allows genes to be anno-tated to whatever level of specificity the current bio-logical understanding allows.A protein that is clearlyan orthologue of a threonine dehydratase in anotherspecies can be annotated with the most specific term,whereas another protein that is clearl y in the lyasefamily, but the enzymati c activi ty of which has notbeen confirmed,can be labelled with one of the moregeneric terms. Thi s design also all ows the GO tobecome increasingly bushy as more speciali zed

    DIRECTED ACYCLIC GRAPH

    (DAG).A type of hierarchy

    similar to the outline of a paper

    in that it has headers,

    subheaders and sub-

    subheaders.The main

    difference from a strict

    hierarchy is that each topic in a

    DAG is allowed to have more

    than one parent topic.

    RNA INTERFERENCE

    A phenomenon in which the

    expression of a gene istemporaril y inhibited when a

    double-stranded

    complementary RNA is

    introduced into the organism.

    SELEX

    This is an in vitroselection

    method in which very large

    collections of oligonucleotides

    can be screened for specific

    functions.

  • 8/12/2019 Genome Annotate

    9/13

    NATU RE REVIEWS | GENETICS VOLUME 2 | JULY 2001 | 5 0 1

    R E V I E W S

    from the paper and integrate them with the database.The result is a curated best guess at the functional sig-nificance of each gene in the genome,with ti es back tothe literature whenever possible.

    In addition to genome annotations,model organ-ism databases curate information about special

    aspects of the biology of their model. For example,WormBase maintains the cell li neage and neuronalwir ing diagram of the nematode.FlyBase tracks themany strains and mutants that makeDrosophilaa richexperimental animal.

    A variant on the museum theme is the cottageindustry, an organizational model adopted with suc-cess by Proteome,Inc. (J.Gerrels,personal communi-cation). In this model, the bulk of the curators workpart-time out of their homes or labs,and are recrui tedfrom the ranks of postdoctoral fellows,graduate stu-dents and faculty.

    Then there is the party model, made famous by theDrosophila annotation jamboree hosted by Celera andthe BerkeleyDrosophilaGenome Sequencing group71.

    The part y model puts leading biologists from thecommuni ty into the same room together wi th anequal number of bioinformati cians and has themspend a solid block of time (typically a week) annotat-ing the genome.The biologists mine the genome fortheir favouri te gene famil ies, and the bioinformati-cians provide the tools and technical expertise to facil-itate this.The jamboree is, in effect, a frontal chargeon the genome.

    The jamboree model was more recently used at ameeting of theFANTOM (Functional Annotation ofMouse) Group at RIKEN in September 2000,to anno-tate ~21,000 full-length mouse cDNA clones72.However, other groups have not rushed to embracethi s organizational model, and whether annotation

    jamborees are to become a permanent feature of thegenomics landscape, once the novelt y of genomicsequences wears off, remains to be seen.

    Publishing and sharing annotati ons.Genomes are typi-call y annotated by several groups. For example, thehuman genome is undergoing active annotation byCelera,Ensembl,the National Center for BiotechnologyInformation, the computational biology group at OakRidge National Laboratoryand others.

    The manifest strength of having several groupsinvolved is that researchers benefi t from their diverseapproaches.The weakness is that a diversity of sourcesfor the annotations tends to fragment the information.A researcher seeking to compare the annotations made

    by one group to those of another must visit several dif-ferent web sites and surmount various obstacles,including incompatible user interfaces,coordinate sys-tems, fi le formats and naming conventions.Withoutextensive bioi nformatics support, thi s task can benearly impossible to carry out on more than a fewgenes at a time.

    Fortunately, several posit ive trends indicate thatthere is light at the end of the tunnel. One is in therealm of fi le formats, in which a small number of

    programs. A sequence entering the pipeli ne is runthrough a suite of gene-prediction programs,variousnucleotide- and protein-based similarit y searching

    algorithms,and the protein-domain search programsdescri bed above.The results are then collated by arule-based system into the gene predicti ons that arepublished on the Ensembl Web site.This design leadsto a deliberately broad but shallow baseline annota-tion of the genome(FIG.4).As a consequence, the bulkof the Ensembl development team are engineers andcomputer scientists.

    In contrast to the factory is the museum model, typi-cally seen in later phases of genome annotation,whenthe emphasis shif ts from finding the genes to interpret-ing their functional roles.In the museum model,a set ofcurators catalogue and classify the genome in a systemat-ic fashion,finding and correcting mistakes made by thegene-prediction and functional annotation algorithms.

    In contrast to the earl ier phase, much of this work isdone by hand,and much of it is orientated towards cap-tur ing the current and past scienti fic literature on theorganism and integrating it with the genomicsequence.

    The museum model is favoured by the model organ-ism databases,and also applies to some extent to thegroups who study the relati onship and phylogeny ofprotein sequences.In a typical model organism data-base, curators work through a certain number ofresearch papers each week, abstract the conclusions

    Figure 4| An example of genome annotation. The Ensembl web site is a rich source of

    annotations on the human genome. This view shows the positions of predicted genes,

    homologous regions of the mouse, single nucleotide polymorphisms, repetitive elements, and

    the locations of clone ends and other structural landmarks. (From the Ensembl Web site, a

    collaboration between EM BL-EBI and the Sanger Centre.)

  • 8/12/2019 Genome Annotate

    10/13

    5 0 2 | JULY 2001 | VOLUME 2 www.nature.com/reviews/genetics

    R E V I E W S

    The differences between conventional biologicalresearch and genome annotation have tended to mar-ginalizethe latter. Annotati on is seen as an acti vitythat is set apart from biology,done by a special groupof people who are not exactly biologists.This is unfor-tunate,because after the init ial high-throughput stage,

    genome annotation becomes crucially dependent onconventi onal hypothesis-driven research. Tying theexperimental l iterature to the genome is a crucial stepin making genomic data useful to the general researchcommunity.

    The strength of curated databases is that humancurators can identify contradictions between papers,and in some cases can catch and cor rect mistakes bythe original authors. However, the li terature-dri vencuration model is also ineff icient because it requirescurators to transcribe conclusions from papers intothe database. Signif icant time is lost tr ying to f il l incrucial gaps in the paper, such as missing sequenceaccession numbers and ambiguously identified genes,and a curator is often called on to make a judgement

    call on the basis of incomplete information. Mostcurators dream about an annotation ready paper inwhich the relevant genes, proteins and processes areidentified in some sort of structured abstract.

    As a community,we should consider ways to br inggenome annotation into the mainstream. Biologistsmust become directly involved with annotation;data-base curators must be afforded the same respect astheir counterparts on the bench.One step might be tomake the curation of a gene,gene family, or other setof database records, something akin to writ ing aninvi ted review art icle,and provide its author with acitation.Another step would be to incorporate princi-ples of peer review into curated annotations.One alsocould imagine using scientific meetings as an occasion

    for the leading scientists in a field to review and com-ment on the current status of an annotation database.

    As the scienti fi c li terature becomes increasinglyelectronic,and genome annotation becomes increas-ingly like the literature,it is time to think about mar-rying the two. This wil l accelerate the process of anno-tati on and help t o clear t he path from sequence tobiology. Indeed,although genome sequencing i tself isa highly specialized activity, genomic annotati on issomething to which the entire life science communitycan contribute.

    standard genome annotation mark-up languages havebegun to gain general acceptance.One such languageis GAME,originally developed for use in the annota-tion of theDrosophilasequence and now the underly-ing data transfer language of the OPEN SOURCE Apolloannotation editor,which is under development at the

    Sanger Centre.GAME is notable for its rich syntax fordescribing the experimental evidence that underliesan annotati on.Another is the distr ibuted annotationsystem (DAS), a li ghtweight annotation languageintended primarily for i ndexing and visuali zati on.FlyBase,Ensembl and WormBase all now make theirgenome annotations available in this format.

    A second posit ive trend is the development of a richset of open-source libraries ofsoftware tools for storing,manipulating and visualizing genome annotations(BioPerl 2001, BioPython 2001, BioJava2001 andBioCORBA 2001). If, as seems increasingly likely,newannotation projects adopt pre-existing tools rather thancreating their own,the current confusion of user inter-face conventions,search tools and data formats,might

    some day become a distant memory.

    Bri nging annotat ion in to the mainstr eam. Genomeannotation is no different in many respects fromother aspects of molecular biology. It , too, involveshypothesis creation, testing,refinement and publica-ti on. There are also obvious differences.Whereas theexperimental results from conventional research fiteasily in the size and format of a printed publication,genome annotation studies are most suited for publi-cation in electronically accessible databases.Annotation sets are also commonly updated regularly,giving them a fluidity that is uncharacteristic of con-ventional publications.Annotation sets do not receivecitations in MedLine.

    OPEN SOURCE

    A type of software distributi on

    in which the source code (the

    human-readable instructions)

    are made freely available. The

    Linux operating system is open

    source.The Microsoft Windows

    and Macintosh operatingsystems are not.

    1. Fleischmann, R. D. et al. Whole-genome random

    sequencing and assembly of Haemoph ilus influenzaeRd.

    Science269, 496512 (1995).

    2. Fraser, C . et al. The minimal gene complement of

    Mycop lasma genitalium. Science270, 397403 (1995).

    3. C ole, S. et al. Deciphering the biology of Mycobacterium

    tuberculosisfrom the complete genome sequence. Nature

    393, 537544 (1998).

    4. G offeau, A . et al. Life with 6000 genes. Science274, 546

    (1996).

    The description of how the first eukaryotic genome

    was sequenced and annotated.

    5. TheC. elegansSequencing Consortium. G enome

    sequence of the nematode C. elegans: a platform for

    investigating biology. Science282, 20122018 (1998).

    The sequencing and annotation of the

    Caenorhabd itis elegansgenome.

    6. M yers, E . W. et al. A whole-genome assembly of

    Drosophila. Science287, 21962204 (2000).The sequencing and annotation of the Drosophila

    melanogastergenome.

    7. ArabidopsisG enomics Initiative (AG I). Analysis of the

    genome sequence of the flowering plant Arabidopsis

    thaliana. Nature408, 796815 (2000).

    8. Venter, J. et al. The sequence of the human genome.

    Science291, 13041351 (2001).A landmark paper describing the human rough draft

    (private version) and its annotation.

    9. International Human Genome Sequencing Consortium

    (IH G SC ). Initial sequencing and analysis of the human

    genome. Nature409, 860921 (2001).A landmark paper describing the human rough draft

    (public version) and its annotation.10. Schuler, G . Sequence mapping by electronic PC R.

    Genome Res. 7, 541550 (1997).

    11. Altschul, S. F., G ish, W., M iller, W., M yers, E. W. & Lipman,

    D. J. B asic local alignment search tool. J. Mol. Biol. 215,

    403410 (1990).

    12. Ning, Z., Cox, A. & M ullikin, C. SSA HA: A fast search

    method for large DNA databases. Genome Res.

    (submitted).

    13. The BA C Resource Consortium. Integration of cytogenetic

    landmarks into the draft sequence of the human genome.

    Nature409, 953958 (2001).

    14. Burge, C . & K arlin, S. Prediction of complete gene

    Links

    FURTHER INFORMATION e-PCR | BLASTN | SSAHA | GENSCAN | Genie |GeneMark.hmm | Grail | HEXON | MZEF | Fgenes | HMMGene | BLASTX | RefSeq |Unigene | SWISS-PROT | Ensembl | RepeatMasker | C. elegansSNP data | COG |BLASTP | TrEMBL | PFAM | HMMER | PRINTS | TRANSFAC | PROSITE | ProDom |PSI-BLAST | SMART | BLOCKS | InterPro | SaccharomycesGenome Database | FlyBase |Mouse Genome Database | Gene Ontology | The Arabidopsis Information Resource |WormBase | Pombase | Proteome | FANTOM | Oak Ridge National Laboratory | DAS |BioPerl 2001 | BioPython 2001 | BioJava 2001 | BioCORBA 2001 | MIPS Arabidopsisthalianadatabase

  • 8/12/2019 Genome Annotate

    11/13

    NATU RE REVIEWS | GENETICS VOLUME 2 | JULY 2001 | 5 0 3

    R E V I E W S

    structures in human genomic D NA . J. Mol. Biol. 268,

    7894 (1997).

    The first description of GENSCAN and one of the best

    introductions to hidden Markov model-based gene-

    prediction programs.

    15. Reese, M. G ., K ulp, D., Tammana, H. & H aussler, D. G enie

    gene finding in Drosophila melanogaster. Genome Res.

    10, 529538 (2000).

    16. Besemer, J. & Borodovsky, M . H euristic approach to

    deriving models for gene finding. Nucleic Acids Res. 27,39113920 (1999).

    17. Uberacher, E. & M ural, R. Locating protein-coding regions

    in human DN A sequences by a multiple sensorneural

    network approach. Proc. Natl Acad. Sci. USA 88,

    1126111265 (1991).

    18. Solovyev, V., Salamov, A . & Lawrence, C . P redicting internal

    exons by oligonucleotide composition and discriminant

    analysis of spliceable open reading frames. Nucleic Acids

    Res. 22, 51565163 (1994).

    19. Zhang, M . Identification of protein coding regions in the

    human genome by quadratic discriminant analysis.

    Proc. Natl Acad. Sci. USA 94, 565568 (1997).

    20. Solovyev, V., S alamov, A. & Lawrence, C . Identification of

    human gene structure using linear discriminant functions

    and dynamic programming. Proc. Int. Conf. Intell. Syst.

    Mol. Biol. 3, 367375 (1995).

    21. K rogh, A. Two methods for improving performance of an

    HM M and their application for gene finding. Proc. Int. Conf.

    Intell. Syst. Mol. Biol. 5, 179186 (1997).

    22. Reese, M . et al. G enome annotation assessment in

    Drosophila melanogaster. Genome Res. 10, 483501

    (2000).The most comprehensive comparison of nucleotide-

    level annotation tools so far.

    23. Guigo, R ., A garwal, P., A bril, J. F., B urset, M . & Fickett, J. W.

    An assessment of gene prediction accuracy in large DNA

    sequences.Genome Res. 10, 16311642 (2001).A crucial comparison of ab init iogene-prediction

    algorithms versus those based on similarity

    searches.

    24. Yeh, R. F., Lim, L. P. & Burge, C. B. C omputational

    inference of homologous gene structures in the human

    genome. Genome Res. 11, 803816 (2001).

    25. K ent, W. J. & Z ahler, A. M . The intronerator: exploring

    introns and alternative splicing in Caenorhabditis elegans.

    Nucleic Acids Res. 28, 9193 (2000).

    26. Reboul, J. et al. O pen-reading-frame sequence tags (O STs)

    support the existence of at least 17, 300 genes inC.

    elegans. Nature Genet. 27, 332336 (2001).

    27. Pruitt, K., Katz, K., Sicotte, H. & M aglott, D. Introducing

    RefSeq and L ocusLink: curated human genome resources

    at the NC BI. Trends Genet. 16, 4447 (2000).

    28. Schuler, G . P ieces of the puzzle: expressed sequence tags

    and the catalog of human genes. J. Mol. Med. 75,

    694698 (1997).29. Bairoch, A. & A pweiler, R . The SWISS-PRO T protein

    sequence database and its supplement TrEM BL in 2000.

    Nucleic Acids Res. 28, 4548 (2000).

    30. Le, S., Chen, J. & Maizel, J. inStructure and Methods:

    Human Genome Initiative and DNA RecombinationVol 1

    (eds Sarma, R. H. & Sarma, M . H. ) 127136 (Adenine, New

    York, 1990).

    31. Pavesi, A., Conterio, F., Bolchi, A ., D ieci, G . & O ttonello, S.

    Identification of new eukaryotic tRN A genes in genomic

    DNA databases by a multistep weight matrix analysis of

    transcriptional control regions. Nucleic Acids Res. 22,

    12471256 (1994).

    32. Lowe, T. & Eddy, S. tRN Ascan-SE: a program for improved

    detection of transfer RN A genes in genomic sequence.

    Nucleic Acids Res. 25, 955964 (1997).

    33. Eddy, S. Noncoding RNA genes.Curr. Opin. Genet. Dev.9,

    695699 (1999).

    An easily accessible introduction to the fascinating

    world of non-coding RNA prediction.

    34. Pennacchio, L. & Rubin, E. Genomic strategies to identify

    mammalian regulatory sequences. Nature Rev. Genet. 2,

    100109 (2001).

    35. Bailey, T. L. & Elkan, C. Fitting a mixture model by

    expectation maximization to discover motifs in

    biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 2836(1994).

    36. Thacker, C ., M arra, M . A., Jones, A., Baillie, D. L. & R ose,

    A. M . Functional genomics inCaenorhabd itis elegans: an

    approach involving comparisons of sequences from related

    nematodes.Genome Res. 9, 348359 (1999).

    37. Aamodt, E. et al. Conservation of sequence and function of

    the pag-3genes fromC. elegansand C. briggsae. Gene8,

    6774 (2000).

    38. Wasserman, W. W., Palumbo, M ., Thompson, W., Fickett,

    J. W. & Lawrence, C . E. Humanmouse genome

    comparisons to locate regulatory sites. Nature Genet. 26,

    225228 (2000).

    39. Q i u, Y . et al. Human and m ouseabca1comparative

    sequencing and transgenesis studies revealing novel

    regulatory sequences. Genomics73, 6676 (2001).

    40. Margarit, E. et al. Identification of conserved potentially

    regulatory sequences of the SRYgene from 10 different

    species of mammals.Biochem. Biophys. Res. Commun.

    17, 370377 (1998).A good example of how comparative genomics can

    be used to identify putative regulatory sequences.

    41. Ku, H. M ., Vision, T., Liu, J. & Tanksley, S. D . C omparingsequenced segments of the tomato and Arabidopsis

    genomes: large-scale duplication followed by selective

    gene loss creates a network of synteny. Proc. Natl Acad.

    Sci. USA 97, 91219126 (2000).An elegant illustration of the power of comparative

    genomics.

    42. Brookes, A . The essence of SNPs. Gene234, 177186

    (1999).

    A comprehensive introduction to the potential

    contribution of single nucleotide polymorphisms to

    the understanding of human biology.

    43. The SNP Consortium. A map of human genome sequence

    variation containing 1.42 million single nucleotide

    polymorphisms. Nature409, 928933 (2001).

    44. Marth, G . et al. A general approach to single-nucleotide

    polymorphism discovery.Nature Genet. 23, 452456

    (1999).

    45. Koch, R. et al. Single nucleotide polymorphisms in wild

    isolates of Caenorhabdit is elegans. Genome Res. 10,

    16901696 (2000).

    46. Piatigorsky, J., Kantorow, M., G opal-Srivastava, R. &

    Tomarev, S . I. Recruitment of enzymes and stress proteins

    as lens crystallins. EXS71, 241250 (1994).47. Wistow, G . Lens crystallins: gene recruitment and

    evolutionary dynamism. Trends Biochem. Sci. 18, 301306

    (1993).

    48. Henikoff, S. et al. G ene families: the taxonomy of protein

    paralogs and chimeras. Science278, 609614 (1997).

    49. Tatusov, R. , G alperin, M., N atale, D. & Koonin, E. The COG

    database: a tool for genome-scale analysis of protein

    functions and evolution. Nucleic Acids Res. 28, 3336(2000).

    50. Altschul, S. F. & K oonin, E. V. Iterated profile searches with

    PSI-BLAST a tool for discovery in protein databases.

    Trends Biochem. Sci. 23, 444447 (1998).

    51. Bateman, A. et al. The P fam protein families database.

    Nucleic Acids Res. 28, 263266 (2000).

    52. Attwood, T. et al. PR IN TS -S: the database formerly known

    as PRINTS. Nucleic Acids Res. 28, 225227 (2000).

    53. Hoffman, K ., B ucher, P., Falquet, L. & Bairoch, A. The

    PR O SIT E database, its status in 1999. Nucleic Acids Res.

    27, 215219 (1999).

    54. Corpet, F., G ouzy, J. & K ahn, D. Recent improvements of

    the ProDom database of protein domain families.Nucleic

    Acids Res. 27, 263267 (1999).

    55. Ponting, C . P., Schultz, J., M ilpetz, F. & Bork, P. SM AR T:

    identification and annotation of domains from signalling and

    extracellular protein sequences. Nucleic Acids Res. 27,229232 (1999).

    56. Henikoff, J. G., Greene, E. A., Pietrokovski, S. & H enikoff, S.

    Increased coverage of protein families with the BL O CK S

    database servers. Nucleic Acids Res. 8, 228230 (2000).

    57. Apweiler, R. et al. The InterPro database, an integrated

    documentation resource for protein families, domains and

    functional sites. Nucleic Acids Res. 29, 3740 (2001).

    58. Cherry, J. et al. SGD: SaccharomycesG enome Database.

    Nucleic Acids Res. 26, 7379 (1998).

    59. The FlyBase Consortium. The FlyBase database of the

    DrosophilaG enome Projects and community literature.

    Nucleic Acids Res. 27, 8588 (1999).

    60. Blake, J., Eppig, J., Richardson, J., Bult, C. & K adin, J. The

    M ouse G enome Database (MG D): integration nexus for the

    laboratory mouse. Nucleic Acids Res. 29, 9194 (2001).A great example of a classic model organism

    database.

    61. Ashburner, M . et al. G ene ontology: tool for the unification

    of biology. Nature Genet. 25, 2529 (2000).

    62. Huala, E. et al. The Arabidopsis Information Resource

    (TA IR ): a comprehensive database and web-based

    information retrieval, analysis, and visualization system for amodel plant. Nucleic Acids Res. 29, 102105 (2001).

    63. Stein, L. et al. WormBase: network access to the genome

    and biology of Caenorhabd itis elegans. Nucleic Acids Res.

    29, 8286 (2001).This model organism database combines nucleotide-

    level annotation with the results of high-throughput

    analyses, such as RNA interference.

    64. K umar, A . & Snyder, M . Emerging technologies in yeast

    genomics.Nature Rev. Genet. 2, 302312 (2001).

    65. Schena, M. , Shalon, D., Davis, R. W. & Brown, P. O.

    Q uantitative monitoring of gene expression patterns with a

    complementary DN A microarray. Science270, 467470

    (1995).

    66. Fire, A. et al. Potent and specific genetic interference by

    double-stranded RNA inCaenorhabdit is elegans. Nature

    391, 806811 (1998).

    67. Griffin, T. J. et al. Q uantitative proteomic analysis using

    a M ALDI quadruple time-of-flight mass spectrometer.

    Anal. Chem. 73, 978986 (2001).

    68. Bouvet, P. Determination of nucleic acid recognition

    sequences by SEL EX. Methods Mol. Biol. 148, 603610

    (2001).

    69. G onzalez, C. & Bejarano, L. A. P rotein traps: usingintracellular localization for cloning. Trends Cell Biol. 10,

    162165 (2000).

    70. Cagney, G ., Uetz, P. & Fields, S. High-throughput screening

    for proteinprotein interactions using two-hybrid assay.

    Metho ds Enzymol. 328, 314 (2000).

    71. Pennisi, E. Ideas fly at gene-finding jamboree. Science24,21822184 (2000).

    72. Kawai, J. et al. Functional annotation of a full-length mouse

    cDNA collection. Nature409, 685690 (2001).

    AcknowledgementsI acknowledge R. D urbin, S . Edd y, E . B irney and A. Neuwald for

    helpful discussions during the preparation of this review. A portion

    of this work was supported by the National Human G enome

    Research Institute at the US N ational Institutes of Health.

  • 8/12/2019 Genome Annotate

    12/13

    O N L I N E O N LY

    MZEFhttp://sciclio.cshl.org/genefinder/Fgeneshttp://searchlauncher.bcm.tmc.edu:9331/gene-finder/Help/fgenes.htmlHMMGenehttp://www.cbs.dtu.dk/services/HMMgene/

    BLASTXftp://ftp.ncbi.nlm.nih.gov/blastRefSeqhttp://www.ncbi.nlm.nih.gov/LocusLink/refseq.htmlUniGenehttp://www.ncbi.nlm.nih.gov/UniGene/SWISS-PROThttp://www.ebi.ac.uk/swissprot/Ensemblhttp://www.ensembl.org/RepeatMaskerhttp://www.genome.washington.edu/uwgc/analysistools/repeatmask.htmC.elegans SNP Datahttp://www.genome.wustl.edu/gsc/CEpolymorph/snp.shtmlCOG

    http://www.ncbi.nlm.nih.gov/COG/BLASTPftp://ftp.ncbi.nlm.nih.gov/blastTrEMBLhttp://www.ebi.ac.uk/swissprot/PFAMhttp://pfam.wustl.edu/HMMERhttp://hmmer.wustl.edu/PRINTShttp://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/PRINTS.htmlTRANSFAChttp://transfac.gbf.de/TRANSFAC/PROSITEhttp://www.expasy.ch/prosite/

    ProDomhttp://www.biochem.ucl.ac.uk/bsm/dbbrowser/jj/prodomsrchjj.htmlPSI-BLASTftp://ftp.ncbi.nlm.nih.gov/blastSMARThttp://smart.embl-heidelberg.de/BLOCKShttp://www.blocks.fhcrc.org/InterProhttp://www.ebi.ac.uk/proteome/SaccharomycesGenome Databasehttp://genome-www.stanford.edu/Saccharomyces/FlyBasehttp://flybase.bio.indiana.edu/Mouse Genome Database

    http://www.informatics.jax.org/Gene Ontologyhttp://www.geneontology.org/The Arabidopsis Information Resourcehttp://www.arabidopsis.org/WormBasehttp://www.wormbase.org/PomBasehttp://www.sanger.ac.uk/Projects/S_pombe/Pombase.shtmlProteome,Inc.

    Now that many genome sequences are available,attentionis shifting towards developing and improving approachesfor genome annotation.

    Genome annotation can be classif ied into three levels: thenucleotide,protein and process levels.

    Gene finding is a chief aspect of nucleotide-level annota-

    tion.For complex genomes, the most successful methodsuse a combination of ab initiogene prediction andsequence comparison with expressed sequence databasesand other organisms. Nucleotide-level annotation alsoallows the integration of genome sequence with othergenetic and physical maps of the genome.

    The principal aim of protein-level annotation is to assignfunction to the products of the genome.Databases of pro-tein sequences and functional domains and motifs arepowerful resources for this type of annotation.Nevertheless, half of the predicted proteins in a newgenome sequence tend to have no obvious function.

    Understanding the function of genes and their products inthe context of cellular and organismal physiology is thegoal of process-level annotation.One of the obstacles to

    this level of annotation has been the inconsistency of termsused by di fferent model systems. The Gene OntologyConsortium is helping to solve this problem.

    There are several approaches to genome annotation: thefactory (reliance on automation),museum (manual cura-tion), cottage industry (exemplified by Proteome,Inc.) andparty (the CeleraDrosophilaannotation jamboree).

    As more scientists come to rely on genome annotation, itwill become more important for the scientific communityas a whole to contribute to this continuing process.

    Lincoln Stein received a dual M.D./Ph.D. degree fromHarvard Medical School in 1989,and did a residency inanatomical pathology at the Brigham & Womens Hospitalin Boston, Massachusetts. In 1993,he joined the Whitehead

    Institute Center for Genome Research where he helpeddevelop genome maps of the human and mouse,eventuallybecoming the Director of Informatics there.Since 1998,hehas been on the faculty of Cold Spring Harbor Laboratory,where he works on various model organism databases andsoftware systems for genome representation and analysis.

    LinksePCRhttp://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgiBLASTNftp://ftp.ncbi.nlm.nih.gov/blastSSAHAhttp://www.sanger.ac.uk/Software/analysis/SSAHA/GENSCAN

    http://genes.mit.edu/GENSCAN.htmlGeniehttp://www.fruitfly.org/seq_tools/genie.htmlGeneMarkHMMhttp://genemark.biology.gatech.edu/GeneMark/Grailhttp://compbio.ornl.gov/Grail-1.3/HEXONhttp://searchlauncher.bcm.tmc.edu:9331/gene-finder/Help/hexon.html

  • 8/12/2019 Genome Annotate

    13/13

    O N L I N E O N LY

    http://www.proteome.com/FANTOMhttp://www.gsc.riken.go.jp/e/FANTOM/Oak Ridge National Laboratoryhttp://www.ornl.gov/BioXML

    http://www.bioxml.orgDAShttp://biodas.orgBioPerl 2001http://www.bioperl.orgBioPython 2001http://www.biopython.org/BioJava 2001http://www.biojava.orgBioCorba 2001http://biocorba.org/

    Fig.3 legendMIPSArabidopsis thalianadatabasehttp://mips.gsf.de/proj/thal/db/gv/rv/rv_frame.html

    Box 1BLASTN,BLASTX, BLASTP,PSI-BLASThttp://www.ncbi.nlm.nih.gov/BLAST/Ensemblhttp://www.ensembl.orgePCRhttp://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgiFlyBasehttp://www.flybase.org/GeneMarkHMMhttp://genemark.biology.gatech.edu/GeneMark/Geniehttp://www.fruitfly.org/seq_tools/genie.htmlGENSCAN

    http://genes.mit.edu/GENSCAN.htmlGrailhttp://compbio.ornl.gov/Grail-1.3/HMMERhttp://hmmer.wustl.edu/HMMGenehttp://www.cbs.dtu.dk/services/HMMgene/InterProhttp://www.ebi.ac.uk/proteome/Proteome,Inc.http://www.proteome.com/RefSeq and LocusLink, NCBIhttp://www.ncbi.nlm.nih.gov/RepeatMaskerhttp:/ /www.genome.washington.edu/uwgc/analysistools/repeatmask.htm

    SSAHAhttp://www.sanger.ac.uk/Software/analysis/SSAHA/SWISSPROT and SWISSPROT-TrEMBLhttp://www.ebi.ac.uk/swissprot/UCSD Genome Browserhttp://genome.ucsc.edu/