NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI...

12
Would you like email alerts when new sequences or literature refer- ences for your favorite organism appear in Entrez? Would you like to archive your Entrez search strategies? Would you like to automatically par- tition your nucleotide search results by source database, species of origin, or sequence type? Would you like Entrez links displayed right on the web page rather than as a menu? My NCBI and a few simple configura- tion panels are all it takes to have it your way. continued on page 4 Have it your way with My NCBI! continued on page 7 May 2005 Volume 14, Issue 1 In this issue 1 GENSAT 1 My NCBI 2 Influenza Virus Resource 3 NCBI ToolKit Utility Programs 3 New Microbial Genomes 3 Iceman Preserved in GenBank 6 RefSeq Updates 6 RefSeq Release 11 6 New Organisms in UniGene 6 GenBank Release 147 9 New Genome Build 9 CCDS Database 9 NCBI Courses 10 PubMed Corrects Spelling 11 BLAST Lab 12 LocusLink Retired NCBI News National Center for Biotechnology Information National Library of Medicine National Institutes of Health Department of Health and Human Services GENSAT Project Data Now in Entrez The Gene Expression Nervous Systems ATlas (GENSAT) provides the anatomical location of gene expression in the mouse brain GENSAT is based on BAC trans- genic vectors in which endogenous protein coding sequences have been replaced by sequences encoding a reporter gene (EGFP). The EGFP is visualized in these sections by stain- ing with an anti-EGFP antibody, or by confocal microscopy of unstained tissue sections. The images show the relative rates of transcription for each target gene. Images are available for mice at embryonic day 15.5 (E15.5), postna- tal day 7 (P7) and adult developmen- tal stages. In addition, GENSAT contains images taken at various developmental stages of tissue sec- tions from non-transgenic mice lines in which the expression of a given gene is visualized using radiolabelled riboprobes with in situ hybridization. GENSAT can be queried by gene name, alias or symbol, region of brain, imaging methods such as in situ hybridization, or confocal microscopy, section types such as saggital, horizontal, or coronal, and the age and sex of the mice. As with all other Entrez databases search terms can field-limited and combined using Boolean logic. For example, we can use the following query to get a Figure 1. Entrez summary of GENSAT records matching the query given in the text. Clicking on the thumbnail image for the first record in "A" generates the view in "B" showing details of the set of images for serotonin receptor 4 and links to other Entrez databases in the "Links" menu.

Transcript of NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI...

Page 1: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

Would you like email alerts whennew sequences or literature refer-ences for your favorite organismappear in Entrez? Would you like toarchive your Entrez search strategies?Would you like to automatically par-tition your nucleotide search resultsby source database, species of origin,or sequence type? Would you likeEntrez links displayed right on theweb page rather than as a menu? MyNCBI and a few simple configura-tion panels are all it takes to have ityour way.

continued on page 4

Have it your way withMy NCBI!

continued on page 7

May 2005Volume 14, Issue 1

In this issue

1 GENSAT

1 My NCBI

2 Influenza Virus Resource

3 NCBI ToolKit Utility Programs

3 New Microbial Genomes

3 Iceman Preserved in GenBank

6 RefSeq Updates

6 RefSeq Release 11

6 New Organisms in UniGene

6 GenBank Release 147

9 New Genome Build

9 CCDS Database

9 NCBI Courses

10 PubMed Corrects Spelling

11 BLAST Lab

12 LocusLink Retired

NCBI NewsNational Center for Biotechnology InformationNational Library of MedicineNational Institutes of HealthDepartment of Health and Human Services

GENSAT Project Data Nowin Entrez

The Gene Expression NervousSystems ATlas (GENSAT) providesthe anatomical location of geneexpression in the mouse brainGENSAT is based on BAC trans-genic vectors in which endogenousprotein coding sequences have beenreplaced by sequences encoding areporter gene (EGFP). The EGFP isvisualized in these sections by stain-ing with an anti-EGFP antibody, orby confocal microscopy ofunstained tissue sections. Theimages show the relative rates oftranscription for each target gene.Images are available for mice atembryonic day 15.5 (E15.5), postna-

tal day 7 (P7) and adult developmen-tal stages. In addition, GENSATcontains images taken at variousdevelopmental stages of tissue sec-tions from non-transgenic mice linesin which the expression of a givengene is visualized using radiolabelledriboprobes with in situ hybridization.

GENSAT can be queried by genename, alias or symbol, region ofbrain, imaging methods such as insitu hybridization, or confocalmicroscopy, section types such assaggital, horizontal, or coronal, andthe age and sex of the mice. As withall other Entrez databases searchterms can field-limited and combinedusing Boolean logic. For example, wecan use the following query to get a

Figure 1. Entrez summary of GENSAT records matching the query given in the text. Clicking on thethumbnail image for the first record in "A" generates the view in "B" showing details of the set ofimages for serotonin receptor 4 and links to other Entrez databases in the "Links" menu.

Page 2: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

Each year in the USA, more than200,000 patients are admitted to hos-pitals with influenza infections, andinfluenza-related deaths approach36,000. Effective vaccines can reducethe numbers of hospitalizations anddeaths, and the swift identification ofnew flu virus variants is an essentialcomponent in the development ofsuch vaccines. The InfluenzaGenome Sequencing Project, fundedby The National Institute of Allergyand Infectious Diseases (NIAID),aims to rapidly sequence flu virusesfrom samples collected throughoutthe world from humans and a varietyof animals.

The viral nucleotide sequencesresulting from this project, availablein GenBank, along with thesequences of their encoded proteins,form the supporting database forNCBI's new Influenza VirusResource (IVR). The IVR willenable scientists to quickly compareinfluenza virus strains so that emer-gent variants can be rapidly identi-fied. As the library of viral sequencesgrows, this database will act as a ref-erence to help further our under-standing of the spread of animalviruses to humans, and the spread ofinfluenza worldwide. Access theIVR home page at:

www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html

The Flu sequence database linkoffers an interface to the viralsequences that allows searches byvirus type, subtype, host organism,genome segment and country of ori-gin. Restrictions on year of virusisolation and sequence length mayalso be applied and the results of thesearch can be sorted, sequentially, byup to three of the search fields.

Individual nucleotide or proteinsequences can also be retrieved byGenBank accession number.

Similarly, the Flu genome viewer toolcan display viral nucleotide or pro-tein sequences ordered by genomesegments for each virus. All seg-ments of the same virus are groupedtogether in the same backgroundcolor, alternating in light blue andwhite, providing a convenient way tocheck the completeness of genomesegments for viruses of interest.Database searches can be performedsimilarily as described above, andnucleotide or protein sequences canalso be searched by adding a com-plete or partial virus name (e.g.Influenza A virus (A/NewYork/19/2003(H3N2)) or NewYork) in the box after “Name” andselecting “Find Nucleotides” or“Find Proteins”.

Sequences of interest can be selectedwithin the search results from eithertool by checking the boxes to the leftof their GenBank accession num-bers. Selected sequences can bedownloaded or “Saved” to be com-bined with results from a subsequentsearch. Multiple alignments ofselected nucleotide or protein se-quences generated by the MUSCLE1

alignment program can also beviewed. On the multiple sequencealignment display page, any twosequences can be selected for a pair-wise comparison using BLAST 2Sequences.

For more information on theresources above, access the Helpdocument linked from eachresource’s home page:

www.ncbi.nlm.nih.gov/genomes/FLU/SiteAbout.html

NCBI News

NCBI News is distributed four times a year. We welcome communication from users of NCBI databases and software and invite suggestions for articles in future issues. Send corre-spondence to NCBI News at the address below. To subscribe to NCBINews, send your name and address toeither the street or E-mail address below.

NCBI NewsNational Library of MedicineBldg. 38A, Room 3S-3088600 Rockville PikeBethesda, MD 20894Phone: (301) 496-2475Fax: (301) 480-9241E-mail: [email protected]

EditorsDennis BensonDavid Wheeler

ContributorsPeter CooperSusan DombrowskiMonica RomitiTao TaoSteve Pechous

WritersVyvy PhamDavid Wheeler

Editing and ProductionRobert Yates

Graphic DesignRobert Yates

In 1988, Congress established the National Center for Biotechnology Information as part of the National Libraryof Medicine; its charge is to create information systems for molecular biology and genetics data and perform research in computational molecular biology.

The contents of this newsletter may be reprinted without permission. The mention of trade names, com-mercial products, or organizations does not imply endorsement by NCBI, NIH, or the U.S. Government.

NIH Publication No. 05-3272

ISSN 1060-8788ISSN 1098-8408 (Online Version)

Influenza Virus ResourceDebuts

2Volume 14, Issue 1 NCBI News

continued on page 6

Page 3: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

3 Volume 14, Issue 1 NCBI News

offered only as source code. NCBInow distributes a number of thesecommand-line utilities for convert-ing, validating, indexing, and creatingNCBI ASN.1 records in executableform for 5 major computing plat-forms: Alpha, Linux, Macintosh,Solaris and MS Windows. They arerun in Terminal or CommandPrompt windows.

Each utility program accepts a num-ber of command line arguments,specified using a dash and a singleletter option code followed by anoption value. Some values areboolean and are given as either ‘T’,

The NCBI ToolKit provides sourcecode and configuration scripts thatmake it easy to compile NCBI soft-ware to run on a variety of comput-ing platforms. Some of the mostfamiliar applications that can be builtusing the ToolKit are blastall, blast-cl3, blastpgp, the BLAST web server,Sequin, Entrez2, and Spidey. Theseprograms are well-known to NCBIusers because they are provided inexecutable format for many commoncomputing platforms. However,there are many useful but less well-known ASN.1 utility programs in theToolKit that have previously been

Eight ASN.1 Utility Programs for Five Computer Platforms true, or ‘F’, false. Others are speci-fied using one-letter codes, such asformat specifiers, or strings, such asfile names or GenBank accessionnumbers. To see a complete list ofcommand line parameters for any ofthe programs, run the program witha trailing dash and no parameter. Alist of the eight programs with briefdescriptions is given in Box 1, whilea detailed description of one of themost versatile programs, “asn2all”,follows. In many situations, the mul-tifunctional program asn2all can berun instead of asn2fsa, asn2gb orasn2xml.

The program “asn2all” is primarilyintended to generate reports fromcontinued on page 5

New Microbial Genomes in GenBank®

Organism GenBank | RefSeqAccession Numbers

Gluconobacter oxydans 621H

Staphylococcus epidermidis RP62A

Staphylococcus aureus subsp. aureus COL

Thermococcus kodakaraensis

Dehalococcoides ethenogenes 195

Campylobacter jejuni RM1221

Ehrlichia ruminantium str. Welgevonden

Bacillus clausii sp. KSM-K16

Synechococcus elongatus PCC 6301

Francisella tularensis subsp. tularensis Schu 4

Silicibacter pomeroyi DSS-3

Zymomonas mobilis subsp. mobilis ZM4

Anaplasma marginale str. St. Maries

Geobacillus kaustophilus HTA426

Idiomarina loihiensis L2TR

Azoarcus sp. EbN1

Lactobacillus acidophilus

chromosome: CP000009 | NC_006677plasmid: CP000004—8 | NC_006672—6

chromosome: CP000029 | NC_002976plasmid: CP000028 | NC_006663

chromosome: CP000046 | NC_002951plasmid: CP000045 | NC_006629

AP006878 | NC_006624

CP000027 | NC_002936

CP000025 | NC_003912

CR767821 | NC_005295

AP006627 | NC_006582

AP008231 | NC_006576

AJ749949 | NC_006570

chromosome: CP000031 | NC_003911plasmid: CP000032 | NC_006569

AE008629 | NC_006526

CP000030 | NC_004842

chromosome: AP006520 | NC_006509plasmid: BA000043 | NC_006510

AE017340 | NC_006512

CR555306 | NC_006513

CP000033 | NC_006814

For more detailed information, see the online version of the May 2005 NCBI News,or use the GenBank or RefSeq Accession Number to search the Entrez “Genome”database using the query box on the NCBI Home Page.

In the fall of 1991, hikers in the Alpsfound the frozen body of a man atthe edge of a glacier which they tookto be that of a recent victim of aclimbing accident. In fact, the hikershad found the frozen, mummifiedbody of a man who lived and died inthe late Neolithic. Popularly known asthe “iceman”, this well-preservedmummy and the artifacts found withhim have provided an unprecedentedand detailed view of human history,culture and biology of late stone-ageEurope.

The well-preserved state of themummy has allowed the collection ofmolecular sequence data that hasshown the iceman’s relationship tomodern Europeans and has providedinsight into his diet and culture.Available sequences in GenBankinclude the iceman’s mitochondrialD-loop region as well as amplifiedsequences obtained in analyzing his

The Iceman Preserved inGenBank

continued on page 8

Page 4: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

Rockefeller University. The datafrom this study will shed light on thegenetics of disorders that affect thecentral nervous system and also pro-vide insight into the brain’s responseto both natural and foreign chemi-cals.

A list of the genes examined in thisstudy can be obtained by searchingEntrez Gene with the query:

“gene_gensat” [Filter]

The modified mouse lines are beingdeposited in the Mutant MouseRegional Resource Center(MMRRC).

The image data and supplementalinformation from the GENSATproject can be downloaded from

ftp.ncbi.nih.gov/pub/gensat

Details on the experimental methodsand results of the GENSAT projectare published in Nature. 2003 Oct30;425(6961):917-25

The data in the GENSAT databasecan be accessed by searching fromthe GENSAT home page at:

www.ncbi.nlm.nih.gov/projects/gensat

or from within Entrez at:

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gensat

Questions regarding GENSATshould be sent to:

[email protected]

—SD

4Volume 14, Issue 1 NCBI News

GENSATcontinued from page 1

list of images for genes with local-ized expression patterns:

“region specific”[ExpressionPattern] AND “strong”[ExpressionLevel] AND “neuron”[Cell Type]

Here, we are using the standardEntrez syntax to search GENSATfor the phrase “region specific”within the field “ExpressionPattern”, indicated within the squarebrackets, combined using boolean“AND”s with two other field-limitedphrases. Such queries can be builteasily with the tools available in theEntrez “Preview/Index” tab visiblein Figure 1A. The search returnssummaries of image sets for over5,000 genes as shown in Figure 1.

The summaries include a zoomable,thumbnail size image of the brainsection, seen in Figure 1A, the totalnumber of images that are availablefor each image type, and the stage of

Figure 2. GENSAT image browsing tool. The rectangular area in "A" is used to zoom to the detailedview in "B" showing the expression of the dopamine 4 receptor as dark EGFP density in the elongatedaxons of neurons. Controls for viewing successive images in the set are visible in the lower right of thefigure.

development. Double-clicking onthe thumbnail image for the firstimage set, that for 5-hydroxytrypta-mine receptor 4 (serotonin receptor4), generates a more detailed reportshown in Figure 1B. Links to otherEntrez databases, including EntrezGene, Nucleotide and PubMed, areprovided within the “Links” menu.To open the image browser, click onthe image of Figure 1B or the crop-ping icon, visible under the Linksmenu. The image browser, Figure2A, can be used to select, magnifyand simultaneously display up to 30subsections of the original image.

The GENSAT project, is an ongo-ing, collaborative study between theRockefeller University and the St.Jude Children’s Research Hospital,and aims to map the expression ofgenes in the central nervous systemof the mouse during the normaldevelopment. GENSAT is support-ed by a grant from the NationalInstitute of Neurological Disordersand Stroke (NINDS) to the

Page 5: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

ment to indicate the nucleotide out-put file and the “-v” argument forthe protein output file.

The “-f ” argument determines theformat to be generated. Legal valuesof “-f ” and the resulting formatsare:

g GenBank (nucleotide) or GenPept (protein)

f FASTAt 5-column feature tabley TinySet XMLs INSDSet XMLa ASN.1 of entire recordx XML version of entire record

The command:

asn2all -i gbpri1.aso -a t -b T -f g -ogbpri1.nuc -v gbpri1.prt

will generate both GenBank reportsfor nucleotide sequences andGenPept reports for proteinsequences from gbpri1.aso in the files“gbpri1.nuc” and “gbpri1.prt”,respectively.

A remote fetching option, “-r T”,allows the download of an ASN.1record from NCBI over a networkconnection using an accession num-ber or NCBI gi number as an identi-fier. For instance, to download thefeature table within the ReferenceSequence record, or RefSeq, for theEscherichia coli genome via remotefetch, use:

asn2all -r T -A NC_000913 -f t

The output of this command for thefirst NC_000913 feature is givenbelow. The 5-column feature tableformat used is identical to thatrequired as input to generate anASN.1 sequence file using tbl2asn,described in Box 1.

>Feature ref|NC_000913.2|

190 255 gene

gene thrL

gene_syn EG11277

locus_tag b0001

db_xref

GeneID:944742

The eight ASN.1 utility programsmay be downloaded at:

ftp.ncbi.nih.gov/asn1-converters

the binary ASN.1 Bioseq-setGenBank release files that are avail-able at:

ftp.ncbi.nih.gov/ncbi-asn1

Depending on the “f ” argument, theprogram can produce GenBank andGenPept flatfiles, FASTA sequencefiles, INSDSet structured XML,TinySeq XML, and 5-column featuretable formats. Prior to runningasn2all, the GenBank release files,which have an “.aso.gz” suffix,should be uncompressed using aprogram such as “gunzip”, resultingin files with suffix “.aso”. For exam-ple, gbpri1.aso is the first file in theprimate division, and the command:

gunzip gbpri1.aso.gz

will produce “gbpri1.aso”

Using asn2all, the name of the file toprocess is specified with the “-i”command line argument. Use “-a t”to indicate batch processing of aGenBank release file and “-b T” toindicate that it is binary ASN.1. Atext ASN.1 record, such as oneobtained on the web from Entrez,can be processed by using “-a a -bF” instead of “-a t -b T”.

Nucleotide and protein records with-in ASN.1 records can be processedsimultaneously. Use the “-o” argu-

Volume 14, Issue 1 NCBI News5

Box 2. Converting the Entrez GeneFTP files with gene2XML

The gene2xml program is the mostrecent to join the group of NCBI con-version tools. It reads the binaryASN.1 Entrezgene-Set files offeredon the Entrez Gene ftp site and con-verts them into an easily parsableXML format. The program can acceptthe name of a single file as input orthe path to a group of files to be con-verted. A option to filter the output byNCBI Taxon Id allows organism-spe-cific XML files to be created from asingle multi-species ASN.1 file. TheEntrez Gene FTP ASN.1 files arefound at:

ftp.ncbi.nih.gov/gene/DATA/

Box 1. Eight NCBI ToolKit utility programs now available in executable for fivecomputer platforms

asn2all: converts GenBank release files in ASN.1 format to a variety of other formats

asn2fsa: converts binary or text ASN.1 sequence files to FASTA format

asn2gb: converts binary or text ASN.1 sequence files to GenBank or GenPept flatfile formats

asn2idx: Generates accession/file offset indices for Bioseq-set release files

asn2xml: converts binary or text ASN.1 sequence files to XML format

asnval: validates ASN.1 release files

tbl2asn: automates the creation of sequence records for submission to GenBank

gene2xml: converts text or binary ASN.1 files of Entrez Gene records into XML

NCBI ToolKit Utility Programscontinued from page 3

Page 6: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

ftp.ncbi.nih.gov/refseq/release/release-notes

For more information, visit theNCBI RefSeq Web Site at:

www.ncbi.nih.gov/RefSeq

download the RefSeq IncrementalUpdate (RIU) products that areavailable daily:

ftp.ncbi.nih.gov/refseq/daily

Each RIU contains all new or updat-ed RefSeq records processed in thepreceding 24 hours or, in the eventof a failure, since the most recentlygenerated RIU. Users who wish tomaintain a local copy of the RefSeqdatabase typically start by processinga complete RefSeq release, and then

RefSeq Release 11 is now availableby anonymous FTP at:

ftp.ncbi.nih.gov/refseq/release

Release 11 includes genomic, tran-script, and protein sequences avail-able as of May 8, 2005, from 2,928organisms. The number of RefSeqaccessions in Release 11 and their

combined lengths is given in theshaded box.

RefSeq releases are posted bimonth-ly, and the next release is scheduledfor July. Release notes documentingthe scope and content of the releaseare provided at:

6Volume 14, Issue 1 NCBI News

Now that a reliable RefSeq releasecycle has been established, theRefSeq cumulative update was dis-continued on March 1, 2005. Theaffected files are located in the direc-tory:

ftp.ncbi.nih.gov/refseq/cumulative

The best method of staying currentwith the RefSeq database is to

RefSeq Release 11

GenomicRNAProtein

# of Accessions651,418400,504

1,425,971

39,048,660,749683,041,613507,980,644

RefSeq CumulativeUpdates Discontinued inMarch 2005

downloading and processing theRIUs on a daily basis. RIU process-ing starts at 2:00am Eastern Timeand is usually complete by 3:30am.The recommended time to checkand transfer new RIU updates fromthe RefSeq FTP site is 4:30am.Note that completion times for theRIU can sometimes be several hourslater, particularly if a substantialnumber of complete-chromosomeRefSeq records for multiple eukary-otic genomes have been updated ona single day.

# of Basepairs/Residues

Links to Reference Sequences(RefSeqs) of other influenzagenomes, protein structures, otherprotein and nucleotide sequences,and the most recent influenza virusliterature in PubMed, are also avail-able from the IVR home page.

1Edgar, Robert C. (2004), MUSCLE: multiplesequence alignment with high accuracy andhigh throughput. Nucleic Acids Research. 32(5),1792-97.

Influenza Viruscontinued from page 2

Additional resources are underdevelopment, including a multipleprotein alignment tool that will allowusers to statistically analyze andcompare sets of protein sequencesof the influenza virus. Proteindatasets can be visually representedand analyzed using hierarchical clus-tering with user-selected similaritycriteria and clustering algorithms.

New Organism in UniGene

UniGene now covers 50 animals andplants and can be searched using theEntrez search system where it islinked to nucleotide records. Arecent addition to UniGene is thesweet orange, Citrus Sinensis with49,999 transcript sequences in 5,830clusters.

GenBank® Release 147

GenBank Release 147 (April 2005)contains over 44 million sequenceentries totaling more than 48 billionbase pairs. Release 148 is expectedin June. GenBank is accessible viathe Entrez search and retrieval sys-

tem. The flatfile and ASN.1 ver-sions of the Release are found in the“genbank” and “ncbi-asn1” directo-ries respectively at:

ftp.ncbi.nih.gov

Uncompressed, the Release 147 flat-files consume about 185 gigabytes

while the ASN.1 version consumesabout 145 gigabytes. The data canalso be downloaded at the mirrorsite:

bio-mirror.net/biomirror/genbank

Page 7: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

Getting Started

To try My NCBI, click on the “MyNCBI” link under “Hot Spots” onthe NCBI home page. Cubby userscan log into their My NCBI accountsimmediately, using their Cubby usernames and passwords. New userscan click on the “register for anaccount” link to open an account. Inorder for My NCBI to rememberyour preferences and apply themwhenever you log on, your webbrowser must be configured toaccept cookies.

Archiving Search Strategies

Searches are archived by performingan Entrez search and clicking on the“Save search” link that appears tothe right of the query box when theresults are returned. To see a tableof your saved searches, click on the“Saved searches” link in the MyNCBI sidebar. Searches that are nolonger needed can be selected usingcheckboxes in the table and deleted.Saved search strategies remain withinMy NCBI indefinitely, until changedor deleted by you.

Requesting Email Alerts

Search results are emailed on a regu-lar schedule to users who specify‘Schedule’ options in the “Savedsearches” table. A flexible scheduleraccommodates email intervals ofdays to months and allows the speci-fication of particular days of theweek so that results always arrivewhen you expect them. Report stylesinclude full text, summary, and sever-al other Entrez formats and may beemailed in HTML or plain text. ADocument Delivery service to whichPubMed will send your orders whenyou use the “Send to Order” option

My NCBIcontinued from page 1

may be specified by clicking on the“Document delivery” link in the MyNCBI sidebar and choosing a servicefrom the list. By default, PubMedsends document delivery orders toLoansome Doc, NLM’s documentdelivery service.

Configuring Filter Tabs

Each of the Entrez databases in MyNCBI offers a rich selection of fil-ters from which you can select up tofive that automatically partition yoursearch results. To configure filters,click on the “Filters” link on the MyNCBI page and select the desiredEntrez database. A default set of fil-

ters is used for each database in newMy NCBI accounts and these can beseen under the “My Selections” tab.To delete a filter, simply uncheck it.The full selection of filters can beexamined using the “Browse” tabwhere they are organized as threehierarchical groups, “Linkouts,”“Links,” and “Properties”. The“Linkouts” group offers tabs forrecords with links to resources pro-vided by outside organizations. The“Links” group provides filters for

records that link to other Entrezdatabases. Use the “properties”group to create tabs for records thatfall into various categories, such asPubMed publication type, Nucleotidemolecule type, or database source.

Figure 1 shows filter tabs in use dur-ing a search of the Entrez Proteindatabase for “superoxide dismutase.”Of the 4410 records returned, acces-sible via the “All” tab, 626 areRefSeqs, 3075 have links to theEntrez Nucleotide database, 229have links to the Map Viewer, 456have links to the Protein Data Bank,and 113 have linkouts to an external

organism-specific database,Saccharomyces Genome Database.The latter set is shown in the activetab; the inset shows the result ofclicking on the “linkout” link for thefirst record. To add a filter to youroriginal Entrez query, click on thepush-pin shown on the active tab.

Configuring Entrez Link Formats

To configure the Entrez link displayformat, click on “User Preferences”

7 Volume 14, Issue 1 NCBI News

Figure 1. Document summaries returned by a query of "superoxide dismutase" in the Entrez Proteindatabase. The records returned are partitioned into 5 tabbed sets; those with linkouts to organism-specific databases, those derived from the Protein Data Bank, those linked to the Map Viewer, thoselinked to records in Entrez Nucleotide, and those that are RefSeqs. The linkout target of the firstrecord is Saccharomyces Genome Database. Note that the Entrez links are displayed as "Plainlinks" rather than as components of a pulldown "links" menu. This configuration option is availableunder "User preferences" on the My NCBI page.

continued on page 10

Page 8: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

8Volume 14, Issue 1 NCBI News

artifacts, skin surface and intestinalcontent. Molecular data from the ice-man are available in GenBank andintegrated into the NCBI’s Entrezsystem.

The simplest way to find thesequences associated with the icemanis by linking to them from a PubMedsearch. Searching with ‘neolithicmummy’ finds a dozen articles at thetime of this writing. Four of thesearticles link to sequence records inthe nucleotide database. A more pre-cise set of results can be obtained byusing the filter that shows only arti-cles with links to nucleotidesequences. The following query findsfour articles reporting icemansequences:

neolithic mummy AND “pubmednucleotide”[Filter]

In PubMed reference (4), Handt et al.report that the iceman’s mitochondri-al D-loop (hypervariable controlregion) is consistent with the geneticvariation of modern Europeans andis most similar to central and north-ern European populations. Thenucleotide link from the PubMedsummary leads to the iceman's D-loop sequence in GenBank recordS69989. The Web BLAST service canbe used to compare the iceman

sequence to selected D-loopsequences from GenBank as shown

in Figure 2 in which the ice-man sequence is aligned tosequences from a modernEuropean, AY041019, andfrom a Neanderthal human,AF254446.

The remaining three articlesreport analyses and identifi-cation of non-human DNAassociated with the iceman’sbody and clothing. Adiverse taxonomic assem-blage of sequences is linkedto the PubMed reference (1)

article, in which Rollo et al. reportedthe analysis of the iceman’s intestinalcontents. These sequences provide apartial menu for the iceman’s last twomeals and show the types of pollenpresent in his environment.

Following the nucleotide link fromthe PubMed summary of this article,displays the 15 sequence recordsreported. These are all conservedsequences of mitochondrial andchloroplast genes amplified by PCR.The biological origin of these

Figure 1. Display of four PubMed abstracts resulting from aquery using the phrase ‘neolithic mummy AND "pubmednucleotide"[Filter]’. Each record contains links to correspondingdata in other Entrez datasets, as shown from the Links menu.

ICEMANcontinued from page 3

Figure 2. BLAST alignment between the iceman'smitochondrial D-loop sequence, the Query sequence,and those of a modern human, AY041019, and aNeanderthal human, AF254446, respectively. Theoutput shown was created by using web BLAST withan Entrez query to limit the BLAST database to thetwo target sequences. The results are presented inflat query-anchored format. The iceman sequence isidentical to the modern and distinctly different fromthe Neanderthal sequence.

sequences was inferred by sequencesimilarity. In most cases, the authorswere able to assign organism sourcesonly to higher taxonomic categories.The red deer (Cervus elaphus) and ibex(Capra ibex) sequences were con-firmed by amplifying and sequencingthem both from the iceman and frommodern samples of the animalsthemselves. An interesting view ofthe taxonomic range of the organ-isms associated with the iceman withsequence data can be obtained by dis-playing all taxonomy links and creat-

ing a common tree within the Entreztaxonomy page.

The authors show that some of thesequence signatures found, such asthose for the deer and ibex meat andcereal grain (Triticeae), came fromfood items while others such as thepine (Pinus) and fern signatures werelikely from unintentionally ingestedpollen and spores.

The iceman has been touted by someas one of the most important discov-eries of the century. Even though theiceman died over 5,000 years ago, hismolecular legacy lives on at theNCBI.

—PC

Figure 3. A dendrogram showing the taxo-nomic classification of the DNA sourceorganisms associated with the iceman.

Page 9: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

Dog Reference Genome, Build 1Version 1

The Map Viewer now shows doggenome Build 1 Version 1, a 7.6-foldwhole-genome shotgun (WGS)assembly of a female boxer pro-duced by the Broad Institute.1 Thedog genome consists of 38 pairs ofautosomes and a pair of sex chro-mosomes, designated X and Y. Themitochondrial genome presented in

New Genome Build andMap Viewer Display

9 Volume 14, Issue 1 NCBI News

NCBI Technical Workshop:Programming with NCBI BLAST® May25-27, 2005 at the National Library ofMedicine, Bethesda, MD

Participants will learn how to createlocal BLAST databases and keepthem current, script BLAST search-es using the URL-API, and set up aBLAST Web Server. A workingknowledge of NCBI resources andthe Perl scripting language isrequired.

For more information and to applyfor the course, see the course Webpage at:

www.ncbi.nlm.nih.gov/Class/PowerTools/

NCBI FGPlus: Enhanced Field GuideJune 6-7, 2005 at the NationalLibrary of Medicine, Bethesda, MD

This expanded course providesdetailed coverage of NCBI molecu-lar databases and tools, especially theEntrez system and NCBI BLAST

NCBI Courses

Towards a Uniform HumanGenome Annotation: theConsensus CDS Database

Annotations of genes on the humangenome are displayed within severalpublic resources. These annotationsare made using different methods,resulting in gene coordinates andsequences that are similar but notalways identical. The human genomesequence is now sufficiently stable tobegin to compile a standard set ofgene annotations on the humangenome by identifying those geneplacements that are identical. TheConsensus CDS (CCDS) project is acollaborative effort to identify a coreset of human protein coding regions

that are consistently annotated andare of high quality.

The CCDS set is built by consensusamong the collaborating membersincluding the EuropeanBioinformatics Institute (EBI),National Center for BiotechnologyInformation (NCBI), the WellcomeTrust Sanger Institute (WTSI), andthe University of California, SantaCruz (UCSC).

Annotated genes that are included inthe CCDS set are given a uniqueidentifier and version number (e.g.,CCDS1.1, CCDS234.1) akin to theGenBank "accession.version" system.If the CDS structure changes or ifthe underlying genome sequence

changes, then the version number willbe incremented. With annotationand sequence based genome browserupdate cycles, the CCDS set will bemapped forward, maintaining identi-fiers. All changes to existing CCDSgenes are made by collaborationagreement.

The CCDS set is calculated on thebasis of coordinated whole genomeannotation updates carried out by theNCBI and Ensembl. To be includedin the CCDS set, coding regions mustbe annotated as full-length, with aninitiating ATG and valid stop codon;must be translated from the genomewithout frameshifts, and must useconsensus splice-sites.

continued on page 10

Build 1.1, with RefSeq accessionnumber NC_002008, is not derivedfrom the boxer used for the WGSassembly but was obtained from adog of the Sapsaree breed.

Graphical displays of features on thedog genome assembly, as well as radi-ation hybrid and physical maps, areshown. Displayable map featuresinclude NCBI contigs on the 'Contig'track, the WGS contigs on the 'Com-ponent' track, and genes, STSs, ESTs,Gnomon predicted gene models, and

a radiation hybrid map.2 Each of themarkers on the radiation hybrid maphas been integrated into NCBI’sUniSTS resource which provides reg-ular updates of positioning of thesemarkers on sequences available inGenBank.1Broad Institute: ww.broad.mit.edu/media/2003/pr_03_tasha.html2Guyon et al. A 1-Mb resolution radiation hybridmap of the canine genome. Proc Nat Acad SciUSA. 2003 Apr 29;100(9):5296-301PMID: 12700351

(Web, standalone and client versions).Special emphases of the course aregenomic information and molecularstructures. The hands-on practicalportion is more extensive andincludes the “Exploring 3DMolecular Structures” and“Identification of Disease Genes”NCBI course materials.

To register, and for more informa-tion:

www.ncbi.nlm.nih.gov/Class/FieldGuide/FGPlus

Page 10: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

Annotations are made via a mixtureof manual curation and automatedcomputational processing. Genomeannotations resulting from the NCBIand Ensembl pipelines are first com-pared to identify annotated codingregions that have identical locationson the genome. Then, lower qualityCDSs from this core set are removedpending additional review among thecollaboration groups. Quality testsinclude analysis to identify putativepseudogenes, retrotransposed genes,consensus splice sites, supportingtranscripts, and protein homology.

As of March 2005, the initial CCDSdataset contains 14,795 codingsequences and 13,142 genes, repre-senting more than half of the humangenes, according to the current genenumber.

Visit the CCDS Project Web site at:

www.ncbi.nlm.nih.gov/CCDS/index.html

Entrez’s PubMed database now includes a spell-check feature that offersspelling corrections for words within PubMed queries that appear to be mis-spelled. For instance, a misspelled query such as “celular” generates a littleover 4,600 hits. However, PubMed now returns a link to the results of a cor-rectly spelled query:

Did you mean: cellular (341,678 items)

Access to the PubMed spell-checker via scripts is provided by a new EntrezUtility called “espell”. Espell takes four parameters; “db”, “term”, “email”and “tool”. For example, an Eutility call such as:

eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?db=pubmed&term=cardiiac+

thromboosis+ischemia

returns the following XML formatted result:

<?xml version=”1.0”?>

<!DOCTYPE eSpellResult PUBLIC “-//NLM//DTD eSpellResult, 23 November 2004//EN”

“http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSpell.dtd”>

<eSpellResult>

<Database>pubmed</Database>

<Query>cardiiac thromboosis ischemia</Query>

<CorrectedQuery>cardiac thrombosis ischemia</CorrectedQuery>

<SpelledQuery><Replaced>cardiac</Replaced><Original>

</Original><Replaced>thrombosis</Replaced><Original>

ischemia</Original></SpelledQuery>

<ERROR/>

</eSpellResult>

In the call above, the “db” parameter specifies the “PubMed” database, theonly database supported at present, while the “term” parameter gives thesearch phrase. As with all Eutility calls, spaces within search phrases are rep-resented with “+” signs. The parameters “email” and “tool” are optional,however you may use them to provide an email address that NCBI can useto contact you and to identify your script, respectively. The use of the“email” and “tool” parameters is helpful in cases of a script malfunction.

The result includes the original query, the complete corrected query, and abreakdown of the terms in the complete corrected query flagged as either“replaced” or “original”.

Terms that are field-restricted, such as “heart[title]”, are not checked sinceeach field implies the use of a separate vocabulary that may include standardabbreviations that would be flagged as misspelled words in the context of amore general usage.

To read more about the Entrez Utilities, or to subscribe to the “utilities-announce” mailing list, see:

eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

PubMed® Corrects Spellling

10Volume 14, Issue 1 NCBI News

CCDS Databasecontinued from page 9

in the blue side bar and choose theformat you prefer from the “LinksDisplay” menu. Choices are“javascript menu,” the default, “plainlinks,” “standard pull-down,” and“pop-up menu.” Figure 1 shows anEntrez display in which links havebeen configured as “Plain Links”using My NCBI, and are thereforevisible for each record without theneed to first click on a “Links” link.

—MR

My NCBIcontinued from page 7

Page 11: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

the identified matches and their coordinates are saved in“pat_out”:

seedtop -d proteases -k active_site -p patternp -o pat_out

As shown by a partial result below, the target sequence witha match is marked by the “seqno=” line. This is followedby the reiterated pattern and matching positions marked bythe first and last numbers in the “HI” field. The actualcoordinates should be incremented by 1 due to offset byzero. For the match below, the actual coordinates are 198and 287.

seqno=254 gi|5031829|ref|NP_005542.1|ID Serine Protease Motif, cd00190PA C-[AVLS]-X(3,9)-[DSNAR]-X-[CG]-X-[GSR]-[DE]-[SAPG]-G-[GS]-[PAG]-[LFMV]HI (197 268) (276 276) (278 278) (280 286)

seqno=257 gi|4504875|ref|NP_002248.1|

Since the database has been formatted using the -oToption, we can retrieve the sequence of the matched regionusing fastacmd from standalone blast package:

fastacmd -d proteases -s 45580723 -L 282,301

which returns the subsequence containing the pattern:

>gi|45580723:282-301 haptoglobin-related protein ...CVGMSKYQEDTCYGDAGSAF

In addition to protein pattern matches, seedtop can searchfor patterns in a nucleotide sequence or database. A patternwhich is specified using IUPAC codes needs to be convert-ed to ProSite syntax. To search a single nucleotide sequencefor given patterns, use:

seedtop -i input_seq -p patmatch -k pat_file

To search a nucleotide database for given patterns, use:

seedtop -d database -p pattern -k pat_file

For questions regarding the BLAST services, please contactthe BLAST help desk at:

[email protected]

—TT

Using seedtop to locate a pattern in protein and nucleotidesequences

Seedtop, like blastall and formatdb, is a commandline pro-gram with parameters specified with a leading dash, fol-lowed by a one-letter parameter code. To find a pattern in aprotein sequence, we may use:

seedtop -i input -k pat -p patmchp -o pat_out

The file “pat” contains the pattern for a serine proteasemotif:

ID Serine Protease Motif, cd00190PA C-[AVLS]-X(3,9)-[DSNAR]-X-[CG]-X-[GSR]-[DE]-[SAPG]-G-[GS]-[PAG]-[LFMV]

The file “input” contains the sequence for human kallikrein1 pre-proprotein, NP_002248.1, then, upon completion ofthe search.

The parameter “-p”, for “program mode”, specifies thatthe search will be of a protein sequence; for nucleotidesequences use “-p pattern”.

The result of the search will be placed in file “pat_out”:

Name Serine Protease Active Site, cd00190Pattern C-[AVLS]-X(3,9)-[DSNAR]-X-[CG]-X-[GSR]-[DE]-[SAPG]-G-[GS]-[PAG]-[LFMV]At position 219 of query sequence

The format of the pattern file is rigid. There must be an idline starting with the letters “ID”, followed by a space andsome identifying free form text. The second line mustbegin with the letters “PA”, followed by a space and thepattern, in Prosite format.

To find pattern matches to a set of protein sequencesrequires an extra step, in which a file with multiple FASTAsequences is formatted with formatdb, another programfrom the standalone blast package. If the sequences to besearched are in the file “proteases”, we begin by formattingthe file using:

formatdb -i proteases -pT -oT

To perform a batch pattern match, we will call the databaseusing “-d database” in lieu of the “-i filename” option. Thefollowing command line searches the database “proteases”and identifies a pattern specified in the file “active_site”,

11 Volume 14, Issue 1 NCBI News

Using seedtop to find patterns in protein and nucleotide sequencesSeedtop is one of the programs included within the NCBI standalone BLAST package and is used to find matches toa pattern in a protein or nucleotide sequence database.

BLAST® Labc a a a t c c g t t c t t g a t c g t a c a t a g c g c a t g t c a g n c a a a t c| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |c a a a t c c a t t c t t g a t c g t a c a t g g c a c a t g t c a g t c a a a t c

Page 12: NCBI News · 2005-05-25 · Sequin, Entrez2, and Spidey. These programs are well-known to NCBI users because they are provided in executable format for many common computing platforms.

Records in Entrez Gene that have expression data view-able from GENSAT can be retrieved from Gene by thequery:

gene_gensat[filter]

Entrez Gene maintains several files, described in theshaded box, to help support the transition from use ofLocusLink.

If you are interested in being notified when changes aremade to Entrez Gene, subscribe to gene-announce:

www.ncbi.nlm.nih.gov/mailman/listinfo/gene-announce

Department of Health and Human ServicesPublic Health Service, National Institutes of HealthNational Library of MedicineNational Center for Biotechnology InformationBldg. 38A, Room 3S3088600 Rockville PikeBethesda, Maryland 20894

Official BusinessPenalty for Private Use $300

FIRST CLASS MAILPOSTAGE & FEES PAID

DHHS/NIH/NLMBETHESDA, MD

PERMIT NO. G-816

Volume 14, Issue 1NATIONAL INSTITUTES OF HEALTH ■ National Library of Medicine

NCBI News

LocusLink Retired

1. The Gene README file ftp.ncbi.nih.gov/gene/README

2. Gene Help www.ncbi.nlm.nih.gov/entrez/query/static/help/genehelp.html#gene_ftp

3. The LocusLink->Gene transition file.www.ncbi.nlm.nih.gov/entrez/query/static/help/LL2G.html

4. gene2xml, described in this README:ftp.ncbi.nih.gov/asn1-converters/documentation/gene2xml.txt

which facilitates converting the ASN.1 extraction of Entrez Gene to XML.

New Files Added to Entrez Gene

NCBI has stopped providing the LocusLink interface asof March 1, 2005. Only one file on the LocusLink ftpsite is still being updated (LL_tmpl.gz), and the update ofthat file will end on June 1, 2005.

During the transition period, subsets of data wereremoved from LL_tmpl to be reported only from EntrezGene. These include GeneRIF data and genes fromDrosophila melanogaster and Caenorhabiditis elegans. Also, newgene-specific data are being processed that are reportedonly in Entrez Gene. This includes (1) links to pathwayinformation from KEGG for genomes other thanhuman, (2) links to Reactome, and (3) access to GEN-SAT (Gene Expression Nervous System ATlas) (mouseonly):

www.ncbi.nlm.nih.gov/projects/gensat