NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1...

12
rithm. In the molecular sequence and structure databases the comparison or Neighboring algorithms are famil- iar to most biologists: NCBI BLAST for sequences and VAST for Structure records. In PubMed, an algorithm compares meaningful words and phrases from the title, abstract, and Medical Subject Headings (MeSH) to calculate a set of PubMed citations that share these characteristics. In PubMed, this pro- vides a rapid and powerful means of expanding a search and can find rele- vant articles that may be missed by the initial search. The utility of the new AbstractPlus display is easy to see when viewing the abstract of a 2006 article on improvements to the NCBI BLAST services by Jian Ye, Scott McGinnis, and Tom Madden (PMID: 16845079). This article along with another closely related article is NCBI has released two software packages, CDTree 3.0, a completely new program for manipulating and analyzing Conserved Domain (CD) records, and Cn3D 4.2, an updated version of the NCBI 3D structure viewer that is now closely integrated with CDTree. CDTree can be used either as a helper application for a web browser or as a stand-alone application for creating and modify- ing CDs and CD hierarchies. Together these two programs form a powerful suite for defining and investigating protein domains based on both sequence and structural alignments. Instructions for down- loading and installing the software package containing CDTree and CDTree and Cn3D 4.2: New Tools for Analyzing Protein Domains continued on page 4 Volume 15, Issue 2 Fall /Winter 2006/07 In this issue 1 PubMed Abstract Plus 1 CD Tree and Cn3D Release 2 Whole Genome Shotgun Growth 3 New BLAST View Options 7 New Genome Builds—Map Viewer 8 New Organisms in UniGene 10 RefSeq Release 22 10 GenBank Release 158 10 NCBI Courses 11 Submissions Corner 12 PubChem Grows to 15 Million NCBI News National Center for Biotechnology Information National Library of Medicine National Institutes of Health Department of Health and Human Services New PubMed AbstractPlus Display: Instant Access to Related Links PubMed's new AbstractPlus display shows the titles of the top five relat- ed Articles and is now the default display for single records. The new format provides seamless access to the powerful pre-computed similari- ties available as Related Articles in the PubMed database. These Pre-computed related items, sometimes called Neighbors, are records in the same Entrez database that are similar based on automated computational comparisons. Traditionally the Neighbors have been available through an item on the "Links" menu on displayed records. The display in Entrez ranks the Neighbors in descending order by the similarity metric as produced by the relevant comparison algo- Figure 1. AbstractPlus display of a recent article on improvements to the NCBI BLAST services found using the query of: mcginnis[auth] AND BLAST. The top five articles are highly relevant to the topic but only one is found by the original query. continued on page 6

Transcript of NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1...

Page 1: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

rithm. In the molecular sequence andstructure databases the comparisonor Neighboring algorithms are famil-iar to most biologists: NCBI BLASTfor sequences and VAST forStructure records. In PubMed, analgorithm compares meaningfulwords and phrases from the title,abstract, and Medical SubjectHeadings (MeSH) to calculate a setof PubMed citations that share thesecharacteristics. In PubMed, this pro-vides a rapid and powerful means ofexpanding a search and can find rele-vant articles that may be missed bythe initial search. The utility of thenew AbstractPlus display is easy tosee when viewing the abstract of a2006 article on improvements to theNCBI BLAST services by Jian Ye,Scott McGinnis, and Tom Madden(PMID: 16845079). This article alongwith another closely related article is

NCBI has released two softwarepackages, CDTree 3.0, a completelynew program for manipulating andanalyzing Conserved Domain (CD)records, and Cn3D 4.2, an updatedversion of the NCBI 3D structureviewer that is now closely integratedwith CDTree. CDTree can be usedeither as a helper application for aweb browser or as a stand-aloneapplication for creating and modify-ing CDs and CD hierarchies.Together these two programs form apowerful suite for defining andinvestigating protein domains basedon both sequence and structuralalignments. Instructions for down-loading and installing the softwarepackage containing CDTree and

CDTree and Cn3D 4.2:New Tools for AnalyzingProtein Domains

continued on page 4

Volume 15, Issue 2Fall /Winter 2006/07

In this issue

1 PubMed Abstract Plus

1 CD Tree and Cn3D Release

2 Whole Genome Shotgun Growth

3 New BLAST View Options

7 New Genome Builds—Map Viewer

8 New Organisms in UniGene

10 RefSeq Release 22

10 GenBank Release 158

10 NCBI Courses

11 Submissions Corner

12 PubChem Grows to 15 Million

NCBI NewsNational Center for Biotechnology InformationNational Library of MedicineNational Institutes of HealthDepartment of Health and Human Services

New PubMed AbstractPlusDisplay: Instant Access toRelated LinksPubMed's new AbstractPlus displayshows the titles of the top five relat-ed Articles and is now the defaultdisplay for single records. The newformat provides seamless access tothe powerful pre-computed similari-ties available as Related Articles inthe PubMed database.

These Pre-computed related items,sometimes called Neighbors, arerecords in the same Entrez databasethat are similar based on automatedcomputational comparisons.Traditionally the Neighbors havebeen available through an item onthe "Links" menu on displayedrecords. The display in Entrez ranksthe Neighbors in descending orderby the similarity metric as producedby the relevant comparison algo-

Figure 1. AbstractPlus display of a recent article on improvements to the NCBI BLAST services foundusing the query of: mcginnis[auth] AND BLAST. The top five articles are highly relevant to the topic butonly one is found by the original query.

continued on page 6

Page 2: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

can be seen in Table 1 below andTable 2 on page 8. As described inthe Summer/Fall 2004 issue of theNCBI news, the Whole GenomeShotgun Projects page that providesa list of WGS projects and acces-sions is available on the NCBIWebsite:

www.ncbi.nlm.nih.gov/projects/WGS/WGSprojectlist.cgi

The Entrez Genome ProjectsDatabase

The Entrez Genomes ProjectsDatabase provides convenient accessto all WGS genomes.

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj

The following query in the GenomeProjects page retrieves more than400 summaries of projects withWGS data.

has wgs[Properties]

Each of these Project Summarieslinks to a Project Overview page thatprovides links to the sequencing cen-ter involved and explains the motiva-tion for the project. Figure 1 showsthe Project Overview page for theEuropean rabbit. The data are linkedthrough the "Project data" menu.

NCBI News

NCBI News is distributed four times a year. We welcome communication from users of NCBI databases and software and invite suggestions for articles in future issues. Send corre-spondence to NCBI News at the address below. To subscribe to NCBINews, send your name and address toeither the street or E-mail address below.

NCBI NewsNational Library of MedicineBldg. 38A, Room 3S-3088600 Rockville PikeBethesda, MD 20894Phone: (301) 496-2475Fax: (301) 480-9241E-mail: [email protected]

EditorsDennis BensonDavid Wheeler

ContributorsMonica Romiti

WritersPeter CooperRana MorrisEric SayersRobert Yates

Editing and ProductionRobert Yates

Print & Web DesignRobert Yates

In 1988, Congress established the National Center for Biotechnology Information as part of the National Libraryof Medicine; its charge is to create information systems for molecular biology and genetics data and perform research in computational molecular biology.

The contents of this newsletter may be reprinted without permission. The mention of trade names, com-mercial products, or organizations does not imply endorsement by NCBI, NIH, or the U.S. Government.

NIH Publication No. 07-3272

ISSN 1060-8788ISSN 1098-8408 (Online Version)

The Growth in Number andDiversity of Whole GenomeShotgun Sequencing Projectsat the NCBI

2Volume 15, Issue 2 NCBI News

The number of Whole GenomeShotgun (WGS) sequencing projectswith data in GenBank continues togrow at a rapid pace. Projects includenot only single genomes from indi-vidual organisms but alsometagenomes-- whole genome shot-gun sequences from biological com-munities. There are now more than400 projects listed in the WGS direc-tory on the GenBank ftp site as ofJanuary 24, 2007.

ftp.ncbi.nlm.nih.gov/genbank/wgs

This article highlights some of therecent genome sequences andmetagenomes and shows how toaccess these important data using theEntrez system and the NCBI BLASTservices.

WGS sequence in both GenBankand the Entrez system are organizedby project, and each project isassigned a master accession thatbegins with a unique four letter pre-fix. All sequences belonging to thesame project have accessions thatshare the same four letter prefix.Examples of WGS master accessions

continued on page 6

Table 1. Whole genome shotgun assemblies of single organism genomes with select-ed recent examples.

Taxonomic Group

Number Examples Master Accession

Bacteria 269 Desulfovibrio vulgaris subsp. vulgaris DP4

AATW01000000

Archaea 4 Thermofilum pendens Hrk 5 AASJ00000000 Fungi 49 Amphibian chytrid AATT01000000 Green Plants 4 Black cottonwood AARH01000000 Animals 65

Nematodes 2 C. remanei strain PB4641 AAGD00000000 Mollusks 1 California sea hare AASC00000000

Insects 25 Yellow fever mosquito AAGE00000000 Echinoderms 1 Purple sea urchin AAGJ02000000

Sea Squirts 2 Ciona savignyi AACT01000000 Fishes 5 Three-spined stickleback AANH01000000

Birds 1 Chicken AADN02000000 Mammals 30 African elephant AAGU01000000

Page 3: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

Figure 2. New BLAST outputdisplay. The results of asearch with the crab-eatingmacaque (Macaca fascicu-laris) CDC20 mRNA(AB168636) against thedefault human genomic plustranscript database.

RID: 1168282777-19032-214373696615.BLASTQ3

A. The descriptions sectionof the output is a table withseparate sections forTranscripts and GenomicSequences. The search findsthe corresponding humanRefSeq transcript. It alsofinds matches to chromo-somes 9 and 1 in both thereference and alternateassemblies of the humangenome. The local metrics, Evalue and Max Score, identi-fy the best match as theretrocopy psuedogene onchromosome 9. The intron-less structure of the pseudogene gives a single high-scoringmatch. The total score metric makes it easy to find the func-tional eleven-exon gene on chromosome 1 even though thesmall individual exon matches give lower scores than thepseudogene. B. The alignment of the macaque mRNA to a contig on chromosome 1 from the reference humangenome. Segments are sorted by "Query start position" to show the matches to the exons in order of positionrather than E-value. C. Alignments displayed in the human map viewer are available by following the linkedgenomic sequence identifiers.

3 Volume 15, Issue 2 NCBI News

New Database and View Options forNucleotide BLAST ServicesThe nucleotide-nucleotide BLASTpages linked to the BLAST home-page now offer the human andmouse transcript and genomic refer-ence sequences as database options(Figure 1). A new output display forall searches provides a more organ-ized presentation of the searchresults and has powerful new sortingoptions (Figure 2). In addition to thetraditional sorting by Expect Value,results can now be sorted byMaximum Score, Total Score,Percent Query Coverage, andMaximum Percent Identity. The Boxbelow provides definitions for thesemetrics. Sorting options are alsoavailable for multiple hits within thealignments for each subject (data-base) sequence with the additionaloption of sorting by query or subjectposition-an option especially usefulfor ordering potential exons ongenomic matches (Figure 2 B).Matches to genomic sequences in thenew genome and transcript databasescan now be displayed in the mouseand human Map Viewers as with thespecialized genomic BLAST services(Figure 2 C).

These new database and displayoptions provide rapid access to themost popular annotated genomes atthe NCBI and expand the power ofBLAST as a genome annotation tool.

Local Extreme MetricsThese measures treat each aligned segment independently. Where there are multiple matches to the same subject (database) sequence, only the metric for thebest match is considered. The E(xpect) Value is the traditional BLAST statistic used to sort output by significance.E(xpect) Value

Max(imum) Score

Max(imum) Identity

the number of alignments expected by chance with a particular score or better. The expect value is the default sorting metric and normallygives the same sorting order as Max Score.

the highest alignment score of a set of aligned segments from the same subject (database) sequence. The score is calculated from the sumof the match rewards and the mismatch, gap open and extend penalties independently for each segment. This normally gives the samesorting order as the E Value.

the highest percent identity for a set of aligned segments to the same subject sequence.Total Metrics

These metrics are summed over or include all aligned segments for the same subject sequence. These are most useful for analyzing BLASTmatches to genomicsequences.Tot(al) Score

Query Coverage

the sum of alignment scores of all segments from the same subject sequence. This sorting order may help promote the position of mRNAmatches to genomic sequences where there are multiple exons. The Total Score is useful for distinguishing hits to functional multi-exongenes from those to the corresponding intronless retrotransposed psuedogenes (Figure 2).

the percent of the query length that is included in the aligned segments. This is calculated over all segments as with the Tot Score.

Figure 1. New database and displayoptions for NCBI Web BLAST servic-es. A. The human and mouse genomicand transcript databases can now beselected using the radio buttons on theBLAST form. The traditional BLASTdatabases are available through thepull-down list once the "Others (nr etc.)" radio button isselected. B. The improved display with the sorting capabili-ties is invoked through the "New View" checkbox that is nowselected by default on the "Format" section of the nucleotideBLAST forms.

A

B

C

A

B

Page 4: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

CDTree analyzes CD records

When launched from a CD web page,the initial CDTree views providestraightforward taxonomic and phylo-genetic analysis through a variety ofselection commands that highlightsubsets of sequences. Highlightedsequences become red, and this high-light automatically propagates to thesame sequences in all open viewers.In Figure 1, one branch of the phylo-genetic tree was selected by clickingon the appropriate tree node, high-lighting the corresponding sequencesin the Indent Tree and TaxonomyTrees. The transferred highlightsshow that all of these sequences arefrom eubacteria and represent 17 ofthe 43 total sequences in the align-ment.

that can launch CDTree either witheither the current CD only or theentire hierarchy (Figure 1). Whenlaunched from this page, CDTree bydefault opens four windows: theIndent Tree, the main window show-ing the CD hierarchy; the SequenceTree, showing a phylogenetic tree ofthe proteins within the CD hierarchy;and the Taxonomy Tree, showing thetaxonomy nodes of the same pro-teins.

Figure 1 shows several CDTree andCn3D 4.2 displays for the CD hierar-chy (Figure 1 a) of the nucleotidyltransferase superfamily (cd02156).Sequence and Taxonomy Tree viewsof one child CD of this hierarchy,the catalytic core of the arginyl-tRNAsynthetase (ArgRS_core, cd00671),are shown in Figure 1 b & c.

4Volume 15, Issue 2 NCBI News

CD Treecontinued from page 1

a

b

c

d

e

f

Figure 1. Hierarchy display for the CD family containing cd00671, the catalytic core domain of the arginyl-tRNA synthetase. Clicking the"Interactive Display with CDTree" button provides the option of loading either the single domain (cd00671) or the entire hierarchy intoCDTree. Selected displays in CDTree and Cn3D 4.2 for the CD family containing cd00671. a) the Indent Tree display, showing the CD hier-archy with cd00671 selected; b) the Sequence Tree display, showing a neighbor joining tree for cd00671 with one branch highlighted; c) theTaxonomy Tree display, showing the taxonomic nodes represented by the sequences in cd00671; d) the Cross Hits viewer, showing the dis-tribution of best BLAST hits for the six immediate child domains of the overall parent, cd00802; e) Alignment Viewer in Cn3D 4.2, showingthe cd00671 alignment with the active site residues highlighted in yellow; f) Structure Viewer in Cn3D 4.2 showing the two aligned structuresin cd00671 with the active site residues shown in yellow and the bound arginine in green. The red highlighting of the 17 sequences selectedin panel b is automatically propagated to panels a, c and d.

Cn3D 4.2 are available on theCDTree homepage.

www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml

CDTree Displays CDs

The most basic use of CDTree is toview one of the pre-defined CD hier-archies accessible on the summarypage for a curated CD record that,for example, can be found at

www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=cd02156

Above the Sub-family Hierarchy dis-play is now a button labeled"Interactive Display with CDTree"

Page 5: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

mat, which can then be imported intoCDTree as well.

The combination of CDTree andCn3D 4.2 provides a powerful soft-ware platform that integrates litera-ture, sequence, taxonomic, phyloge-netic, functional, and structural datafor a protein domain in an interlinkedset of views, allowing researchersboth to visualize these data morefully and to identify new domains anddomain features in poorly annotatedprotein families.

Volume 15, Issue 2 NCBI News5

Other analysis tools include theCross Hits viewer, the CDART view-er, and the CDTree Validator. TheCross Hits viewer shows at a glancethe quality of the hierarchy itself.Each CD is represented by a piechart where the colored wedges indi-cate the proportion of sequences inthat CD that has the best BLASTmatch to the CD that corresponds tothat color. Thus, a well defined hier-archy will look much like the one inFigure 1 D, where each pie for achild CD contains only its own color,while the parent contains representa-tives for each of its children. TheCDART viewer functions much likethe existing CDART web service,showing the domain architectures ofsequences in selected CDs. It allows,for example, the selection ofsequences based on a shared archi-tecture. The CDTree Validator per-forms several consistency checks onthe alignments in a CD hierarchy,requiring that child CDs maintain thealignment of the parent and that nosequence violates the block align-ment model of the CD.

CDTree updates CD records

Once a CD has been loaded,CDTree can automatically update thealignment with new sequences usingPosition Specific Iterative BLAST(PSI-BLAST) or standard protein-protein BLAST. After retrieving theBLAST results from NCBI, CDTreedistributes the sequences to theirbest matching CDs, and these newsequences become "pending" rows,in contrast to the "aligned" rowsalready part of the block model.

CDTree uses Cn3D 4.2 to View andEdit CD Alignments

A mouse click on the double-helixicon next to any CD in the IndentTree window launches Cn3D 4.2,where the alignment can be viewed

and edited (Figure 1E). This functionis available even for alignments thathave no 3D structures. If the align-ment does contain structural data,then these will be shown as a tem-plate to help guide the alignmentprocess (Figure 1 f). Functional fea-tures annotated by CDD curators canalso be highlighted on both the align-ment and structures, as is the activesite in Figure 1 E & F. If the CDcontains pending sequences, thesewill be placed in the Imports windowin Cn3D, where they can be alignedto the CD block model using any ofthe alignment algorithms in Cn3D.After this is complete, the Cn3D Savefunction exports this new alignmentback into CDTree. By continuing thiscycle, an entire hierarchy can beupdated, one CD at a time. The col-ored box to the right provides a listand description of several new fea-tures of Cn3D 4.2 in addition tothose related to the CDTree func-tions mentioned above. More infor-mation on Cn3D is available on theHomepage for the program.

www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml

CDTree creates CD records and CDhierarchies

One of the most powerful features ofCDTree is that it can create de novoCD records, either as the beginningof a new hierarchy or a new child inan existing hierarchy. Quite simply,any group of sequences can beselected and assigned to a new CD,which then can be assigned a parentif appropriate. Alternatively, CDTreecan start with a single proteinsequence, run a BLAST search, anduse the resulting alignment as thenucleus of a new CD record. A utilitynamed "fa2cd" that comes bundledwith CDTree can create CDs fromlegacy alignment data in FASTA for-

Selected New Features in Cn3D 4.2

Stereoview - Cn3D 4.2 can display customizablestereo images

PSSM Export - Cn3D 4.2 can export thePosition Specific Score Matrix (PSSM) calculat-ed for the current alignment in NCBI scorematformat, which can be added to an RPS-BLASTdatabase or used to begin a PSI-BLAST search.

New Alignment Algorithms - Cn3D 4.2 now canapply the Block Aligner to the entire set of pend-ing sequences at once, and can also align apending sequence using BLAST to the mostsimilar sequence in the alignment, rather thanthe master sequence. Cn3D 4.2 also contains anew alignment refiner that sequentially realignseach row in the existing alignment.

New Sequence Conservation Coloring Options -Cn3D 4.2 offers three new coloring options toassist in analyzing the quality of a block align-ment model:

—Block Fit - colors all blocks across the entirealignment by how well they fit to the PSSM

—Normalized Block Fit - like Block Fit exceptthat the colors are computed separately foreach block, thereby showing which sequenceshave the best/worst scores for each block

—Block Row Fit - like Block Fit except that col-ors are computed separately for each row,thereby showing which block has the best/worstscore for each sequence

Improved Highlighting - Cn3D 4.2 can highlightby blocks, can extend existing highlights toaligned columns, and can restrict existing high-lights to a single row. Highlights can also becached, or saved, and then can be recalledlater. Users may appreciate that clicking in thewhite space of the sequence window no longerremoves the highlighting, thereby saving somefrustration! Finally, when launched from withinCDTree, Cn3D 4.2 can exchange highlight mes-sages with CDTree. For example, clicking onthe double-helix icon next to a CD in the IndentTree window will send highlights from CDTree toCn3D, if that CD has been opened with theCn3D viewer already.

Page 6: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

6Volume 15, Issue 2 NCBI News

easily found by the following searchin PubMed.

McGinnis S [Author] AND BLAST

Figure 1 shows the AbstractPlus dis-play of this 2006 Nucleic AcidsResearch article. The first four of thetop five "Related Links" are recentarticles on NCBI databases and serv-

ices, including BLAST services. Thefirst of these is also found directly inthis sample search and is cited in thecurrent article. But the other fourhighly relevant articles would bemissed unless other searches are per-formed. Moreover, the links to theAbstractPlus displays of interestingrelated articles quickly expands thesearch into other areas. For example,the top link quickly leads to articles

PubMed AbstractPluscontinued from page 1

Whole Genome Shotguncontinued from page 2

about software packages for manag-ing BLAST output or software inter-faces for managing local installationsof BLAST and other bioinformaticssoftware.

Having the top five related articles onthe same page in the AbstractPlusdisplay saves valuable time andenhances the power of PubMed as adiscovery system.

Genome specific resources including,in some cases, a genome-specificBLAST service are linked under"Resource Links".

Individual Genome Sequences

WGS projects include over 250 bacte-ria and more than 120 eukaryotesfrom a wide range of taxonomicgroups as shown in Table 1. In addi-tion to the Rhesus macaque, reportedin the last issue of the NCBI news,notable additions in the animalsinclude the African elephant, the rab-bit, the guinea pig, the shrew, thehedgehog, and the domestic cat.There are also a number of firstgenomes from an interesting array oftaxonomic groups: a beetle, Triboliumcastaneaum; a marsupial, the short-tailed opossum; and a monotreme,the duck-billed platypus, and the firsttree species, the black cottonwood.Low coverage sequences of largegenomes like that of the elephantmay contain nearly a million individ-ual sequences in GenBank. MasterGenBank records that collect all ofthe WGS sequences for a particularproject are useful for retrieving theseprojects using the Entrez system. Thegenome project overview pages,described above, provide access tothe WGS data through these Masterrecords. Master records can be used

in the Entrez nucleotide database toretrieve the sequences for WGS proj-ects. For example, the followingquery retrieves the master records forall mammalian WGS projects fromthe nucleotide database.

wgs_master[Properties] AND mam-mals[Organism]

Searching biological features in WGSgenomes

The prokaryotic genomes and severalWGS projects for higher eukaryotesare available as annotated genomeswith reference sequences, generecords and, for the eukaryotes,sequence maps in the Map Viewer.The Rhesus macaque the Triboliumcastaneum, and the Populus trichocarpa

are three recent examples of organ-isms with fully annotated WGS-basedgenomes. Features of these annotatedgenomes may be searched effectivelyusing the Entrez Gene, Protein andNucleotide databases. The assemblies,Reference Sequence mRNAs andproteins, and other sequence collec-tions for annotated genomes areavailable for BLAST searches throughthe genomic BLAST pages linked tothe Map Viewer homepage.

www.ncbi.nlm.nih.gov/mapview

Many of the other WGS projects foreukaryotes are not currently availablewith annotated genes or proteins.These can be searched for these fea-

continued on page 8

Figure 1. The Genome Project summary page for the European rabbit. The page providesaccess to the WGS data through the "Project Data" menu and links to genome BLAST pagesunder "Resource Links".

Page 7: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

Volume 15, Issue 2 NCBI News7

economically important forestryresource. The poplar genome is wellsuited to serve as basis for compara-tive genomic studies of other broadleaf forest trees. Analysis of thisgenome has already provided impor-tant insights into the evolution ofthis group of plants into trees.

The genome assemblies and annota-tions for Trypanosoma brucei, thecausative agent of African sleepingsickness, and Theileria parva, thecausative agent of east coast fever--an African cattle disease, can now beviewed and searched in the mapviewer.

The Theileria genome is based on achromosome-by-chromosome wholegenome assembly produced by TheInstitute for Genomic Research(TIGR) and the InternationalLivestock Research Institute. Thesequence is anchored to the fourTheileria chromosomes. The currentannotation contains 4,089 genes andtheir predicted transcripts. Theileriabelongs to the Apicomplexa animportant group of human and ani-mal pathogens that includes themalaria parasites (Plasmodium). TheTheileria genome now joins two otherApicomplexan genomes, Plasmodiumfalciparum and Cryptosporidium parvum,in the Map Viewer. Comparativeanalysis of these genomes will con-tinue to provide insights into the par-asite life cycles, host-parasite interac-tions and potential drug and vaccinetargets.

The Trypanosoma genome is theassembly and annotation produced byan international consortium thatincludes the Sanger Center andTIGR. The sequence is of the elevenlarge chromosomes and does notinclude smaller mini and intermediatechromosomes. The mapped sequenceand annotation represent 10,252

genes and their transcripts. The T.brucei genome is the first try-panosome genome in the MapViewer. The sequencing and analysisof the genomes of T. cruzi, thecausative agent of chagas disease andLeishmania major, responsible for leish-maniasis, together with the T. bruceigenome should provide insights intothe biology of these parasites thatwill help combat these diseases.

Other access to data

In addition to access through theMap Viewer, access to sequences ofthe genome assemblies, transcripts,proteins and gene models for blackcottonwood, Theleiria parva andTrypanosoma brucei is provided throughthe RefSeq and Gene databases.GenBank and RefSeq records areavailable for searching in Entrez orvia NCBI's Web BLAST serviceswhere they are extensively integratedwith other resources and databases.

New Genome Builds andMap Viewer Displays

New to Map Viewer

Genetic maps for twenty four newplants, the first assembled treegenome for black cottonwood(Populus trichocarpa), and two assem-bled protozoan parasite genomes arenow available in the Map Viewer.

Plant Genetic Maps

The new plant genetic maps includeimportant food and economic cropssuch as the common bean, Brassicaoleracea—the source of cabbage,broccoli, cauliflower and Brusselssprouts—rye, rapeseed, cocoa,almond, and alfalfa. Also added aremaps of close relatives and probablewild ancestors of some of these andother important crops. A completelist of the new plants with geneticmaps can be found in Table 1. Thesegenetic maps are important plat-forms for the future development ofphysical maps and complete genomesequences for several of these plants.

Genome Sequences

The Map Viewer now hosts theblack cottonwood (Populus trichocarpa)genome assembly. This is the firstwoody plant genome assembly. Theblack cottonwood genome build 2.1is the Joint Genome Institute'sassembly of the 7.5 X whole genomeshotgun sequence. Availablesequence maps include NCBI contigs(the "Contig" map), the WGSsequences (the "Component" map),and the alignment of black cotton-wood RNA sequences (the"PthRNA" map), UniGene clusters(the "PthUniGene" map), andGnomon predicted gene models (the"ab-intio" map). The black cotton-wood is not only an importantexperimental model species but is an

Table 1. New Plants with genetic maps inMap Viewer.

New Plant Species in Map Viewer

Aegilops tauschiiAegilops umbellulataAllium cepa (onion)Brassica juncea (Indian mustard)Brassica napus (oilseed rape)Brassica Nigra (black mustard)Brassica oleracea (cabbage and others)Brassica rapa (field mustard)Capsicum annuum (cayenne pepper)Eragrostis tef (tef)Medicago sativa (alfalfa)Phaseolus vulgaris (French bean)Populus trichocarpa (black cottonwood)Prunus dulcis (almond)Secale cereale (rye)Setaria italica (foxtail millet)Solanum lycopersicoidesSolanum melongena (eggplant)Solanum peruvianum (Peruvian tomato)Sorghum bicolor (broomcorn)Theobroma cacao (cacao)Triticum turgidum (English wheat)Vigna radiata (mung bean)

Page 8: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

tures through sequence similarityusing the NCBI BLAST services. Insome cases genome-specific BLASTpages are linked to WGS projects(Figure 1). In all cases, WGS data areavailable as the "wgs" nucleotidedatabase in the pull-down list onnucleotide-nucleotide BLAST formslinked to the BLAST homepage.

www.ncbi.nlm.nih.gov/BLAST

Figure 2 shows the result of aBLAST search with the platypusbeta-2-microglobulin mRNA(AY125948) against the platypus wgssequence. The search identifies twoWGS records containing the fourexons of the platypus microglobulingene. This genomic sequence is notavailable in any other form at NCBI.

Environmental Sequences:Metagenomics

An interesting application of WGStechniques involves obtainingsequence information from entirebiological communities rather thanindividual species. Acquiring and ana-lyzing sequences obtained from bio-logical communities without isolationof individual clones has been termed‘Metagenomics’. Whole GenomeShotgun metagenomic studies ormetagenomes are important forassessing microbial diversity in allecosystems because the majority ofmicrobial species are unknown and

8Volume 15, Issue 2 NCBI News

Whole Genome Shotguncontinued from page 6

continued on next page

probably unculturable. Communitiesfrom unusual or extreme environ-ments seem particularly likely to berich sources of unknown organismsthat may have evolved interesting oruseful adaptations. Metagenomes mayprovide clues to the genetic and bio-chemical adaptations of these organ-isms. Perhaps more importantly, inthe same way that single organismgenomes can provide insights intothe specific metabolic pathways in theorganism, metagenomes may provideimportant insights into communitymetabolism.

Metagenome projects have addedsequences to GenBank from unusualenvironments and communitiesincluding acid mine drainage impact-ed streams, open water and deep seaocean communities, communitiesassociated with whale falls in thedeep ocean and the chemoautotroph-ic symbiotic community associatedwith an annelid worm lacking a diges-tive or excretory system. There arealso data from less exotic though stilllargely unknown communities such asthose in the human gut and farm soil.In addition, Metagenomic techniques

have been applied to obtainsequences of extinct organismsincluding Woolly Mammoth andNeanderthal man from well pre-served remains.

Environmental Genome Projects:Metagenomes

The simplest access to themetagenomes is through theEnvironmental Projects link on theright hand side of the EntrezGenome Projects Homepage. Morethan 30 projects are available at thetime of this writing. TheEnvironmental Projects can also beretrieved with the following queryfrom the Genome Projects homepageor from the search box on the NCBIhomepage:

type_environmental[Properties]

These metagenome projects havevarying amounts and types of data;some projects listed are in progressand have no data yet at NCBI, somehave only Trace Archive sequencesavailable and several have WGS dataavailable. Table 2 shows the environ-mental projects with WGS sequence

at NCBI. As described above, allProject Overview pages in theGenome Projects database pro-vide access to the data and otherlinked resources. In addition,each of the EnvironmentalProjects has links to two special-ized BLAST services; one thatcan search the nucleotidesequences and, in some cases,

Source Master Accessions Neanderthal Man CAAN01000000 Wooly Mammoth CAAM01000000 Gutless Worm Symbionts AASZ00000000 Bioreactor Sludge AATO, AATN00000000 Human Gut Biome AAQL, AAQK00000000 Grey Whale Carcasses AAGA, AAFZ, AAFY00000000 Farm Soil AAFX01000000 Acid Mine Drainage Biofilm AADL00000000, AAWO00000000 Global Ocean Sampling Expedition

AACY00000000

Mouse Gut Metagenomes AATA, AATB, AATC, AATD, AATE, AATF00000000

Table 2. Environmental sequencing projects (metagenomes)with data in GenBank. Environmental sequencing projects(metagenomes) with data in GenBank.

New Organisms inUniGeneThe Entrez UniGene database nowoffers over 1,844,162 transcript clus-ters, linked to nucleotide records, forover 70 animals and plants. Recentadditions to UniGene include: Aedesaegypti (yellow fever and dengue virus

mosquito) with 241,102 transcriptsequences in 15,182 clusters,Aquilegia formosa x Aquilegia pubescens(hybrid columbine) with 72,522 tran-script sequences in 7,675 clusters,Gossypium hirsutum (upland cotton)with 83,321 transcript sequences in10,845 clusters, Macaca fascicularis(crab-eating monkey) with 62,745

transcript sequences in 7,488 clus-ters, Oryctolagus cuniculus (rabbit) with10,827 transcript sequences in 3,766clusters, Pimephales promelas (fatheadminnow) with 237,026 transcriptsequences in 18,541 clusters, andTribolium castaneum (red flour beetle)with 27,233 transcript sequences in6,328 clusters.

Page 9: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

Volume 15, Issue 2 NCBI News9

proteins for the specific metagenome,and one that allows searches againstall or subsets of the metagenomes atonce.

Example: The Gutless WormMetagenome

The sediment dwelling marineannelid, Olavius, entirely lacks adigestive system and has a highlyreduced excretory system. This wormdepends on a consortium of at leastfour bacterial symbionts to provideits nutritional and excretory needs.The metagenome for this consortiumhas provided important insights intorelationships among the organisms inthis symbiotic community. The con-sortium contains two sulfur-oxidizinggamma proteobacteria (δ1 and δ3)and two sulfate- reducing delta pro-teobacteria (δ2 and δ4). The comple-

mentary metabolisms of the two setsof bacteria provide each other withappropriate electron donors andacceptors and the worm with organiccarbon and other nutrients while pro-cessing the worm's nitrogenouswaste.1 The genome-specific BLASTpage allows searches against themetagenome of the chemoautotroph-ic bacterial symbionts. TranslatingBLAST searches quickly confirm thepresence of the two sets of bacterialsymbionts. Figure 3 shows the resultsof a translating BLAST search with asulfite reductase (YP_387022) fromDesulfovibrio, a delta proteobacteri-um. The sulfite reductase findsmatches to all four of the symbionts.The best match, AASZ01000485, is asection of the partial assembly of thedelta 1 symbiont (DS021230).

Continued growth

The rapid influx ofWGS data in the

form of organism and communitygenomes will continue for the fore-seeable future. This new kind of dataprovides challenges for analyses andcompletely new perspectives as thescope increases beyond individualorganisms to community genomes,and pathways—even of extinct com-munities. This growing and vastamount of largely unannotatedsequence will continue to be mosteffectively searched through theNCBI BLAST services. The EntrezGenomes Project database will con-tinue to provide the most convenientmechanism to access the WGS andother genomes at NCBI.

1 Woyke T, et al. 2006 . Symbiosis insightsthrough metagenomic analysis of a micro-bial consortium. Nature, 443(7114):950-5.PMID: 16980956

Figure 2. A nucleotide-nucleotide BLAST search against the wgsdatabase using the platypus beta-2-microglobulin mRNA as aquery. The database was limited with the Entrez queryplatypus[Oganism]. The four hits to the two WGS sequences iden-tify the four exons of the platypus microglobulin gene. Use the fol-lowing RID to retrieve live results: 1163023234-11232-134418568822.BLASTQ4.

Figure 3. Results of a translating BLAST search (tblastn) againstthe Olavius metagenome using a Desulfovibrio sulfite reductaseprotein sequence (YP_387022) as a query. The results contain hitsfrom all four of the symbionts, the two delta and two gamma pro-teobacteria. The most similar sequence is from the delta 1 sym-biont . Use the following RID to retrieve live results: 1162499811-7887-105366644997.BLASTQ2.

Whole Genome Shotguncontinued from page 8

Page 10: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

is accessible via the Entrez searchand retrieval system. The flatfileand ASN.1 versions of the Releaseare found in the “genbank” and“ncbi-asn1” directories respectivelyat:

ftp.ncbi.nih.gov

Uncompressed, the Release 158 flat-files are 252 Gigabytes and theASN.1 version is about 217Gigabytes. The data can also bedownloaded at a mirror site:

bio-mirror.net/biomirror/genbank

10Volume 15, Issue 2 NCBI News

RefSeq Release 22 is now availableby anonymous FTP at:

ftp.ncbi.nih.gov/refseq/release

Release 22 includes genomic, tran-script, and protein sequences avail-able as of March 5, 2007, from4,187 organisms. The number ofRefSeq accessions in Release 22 and

their combined lengths is given inthe shaded box.

RefSeq releases are posted every twomonths, and the next release isscheduled for May, 2007. Releasenotes documenting the scope andcontent of the database are providedat:

RefSeq Release 22

GenomicRNAProtein

# of Accessions # of Basepairs/Residues840,657929,109

3,438,099

80,808,752,8871,632,375,6591,215,085,694

GenBank Release 158

GenBank Release 158 (February2007) contains over 67 millionsequence entries totaling more than71 billion base pairs. Release 159 isscheduled for April 2007. GenBank

NCBI CoursesNCBI Courses are a great way forresearchers, students, librarians, andteachers to keep up with the ongoingenhancements made to NCBI’smolecular biology resources.

The courses are offered free ofcharge at NCBI and at universitiesand research institutes throughoutthe United States. For detailedcourse descriptions and schedules ofupcoming courses, see the NCBIEducation Homepage at:

www.ncbi.nlm.nih.gov/Education

Enhanced Field Guide Course

March 29—30, 2007 9 AM—5 PM

NCBI, Bethesda, MD

This course is intended for biologistswho desire a more in-depth look atthe NCBI resources than can be pro-vided in the standard Field Guide.

To register, and for more informa-tion:

www.ncbi.nlm.nih.gov/Class/FieldGuide/FGPlus

A Field Guide to GenBank and NCBI

Molecular Biology Resources

April 25, 2007—at the RochesterInstitute of Technology, Rochester,NY

May 3, 2007—at the University ofCalifornia at San Francisco, SanFrancisco, CA

May 10, 2007—9 AM—5PM at theNational Library of Medicine,Bethesda, MD

The Field Guide is a lecture andhands-on computer workshop onGenBank and related databases cov-ering effective use of the Entrezdatabases and search service, theBLAST similarity search engine,genome data and related resources.

To register, and for more informa-tion:

www.ncbi.nlm.nih.gov/Class/FieldGuide/

NCBI Mini Course

May 2—3, 2007 at the WadsworthCenter, Albany, NY

May 10—11, 2007 at MIT,Cambridge, MA

May 15—16, 2007 at VirginiaPolytechnic Institute, Blacksburg, VA

NCBI bioinformatics mini-coursesare either problem based, such as"Identification of Disease Genes" orNCBI resource based such as"BLAST Quick Start". The coursesare 2.5 hours in length with the firsthour-and-a- half devoted to anoverview that is followed by a onehour hands-on session.

To register, and for more informa-tion:

www.ncbi.nlm.nih.gov/Class/minicourses/minischedule.html

NCBI Powerscripting CourseApril 24—27, 2007 at NCBI,Bethesda, MD

This 4-day course including both lec-tures and computer workshops oneffectively using the NCBI EntrezProgramming Utilities (E-utilities)within scripts to automate search andretrieval operations across the entiresuite of Entrez databases.

To register and for more informa-tion:

www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/course.html

ftp.ncbi.nih.gov/refseq/release/release-notes

For more information, visit theNCBI RefSeq Web Site at:

www.ncbi.nih.gov/RefSeq

Page 11: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

The Third Party Annotation (TPA)database allows researchers to submitrecords that provide additional bio-logical annotation but are based onexisting records in GenBank. Aspointed out in the recent NCBINews (Vol. 15, issue 1) these TPArecords can now include a widerrange of submission types includinginferential findings. This articleexplains how to submit these TPArecords to NCBI through the twostandard GenBank direct submissiontools, BankIt and Sequin. Beforebeginning either submission route,one or more sequences from one ofthe collaborating nucleotide sequencedatabases, DDBJ, EMBL, orGenBank must be identified as thesource data for the TPA record. TPArecords cannot be based onReference Sequences or other non-primary sequence data. All TPAannotations must have some direct orindirect experimental support. Theacceptable kinds of experimental evi-dence are described on the TPA sub-missions page.

www.ncbi.nlm.nih.gov/Genbank/TPA.html

Submitting TPA Records Using BankIt

The popular Web submission toolBankIt provides a simple interfacefor direct submission to GenBankand also allows submission to theTPA database.

www.ncbi.nlm.nih.gov/BankIt

As with GenBank submissions, theBankIt TPA submission route is bestfor simple submissions of smallnumbers of records. The Sequin sub-mission route, described below, is abetter choice for large or complicated

11 Volume 15, Issue 2 NCBI News

Submitting Third Party Annotation Records Using StandardSubmission Tools

Submission Corner

TPA submissions. To submit a TPArecord using BankIt, begin as usualby entering the length of thesequence and pressing the "New"button under "GenBank Submissionsby WWW" section of the BankItpage. This launches the Bankit Websubmissions form that is normallyused to prepare GenBank submis-sions. To specify a TPA submission,set the radio button to "No" toanswer the questions on the form:"Is This Primary Sequence Data?Have you determined and annotatedthe sequence you are submitting?".Follow the instructions on the formand enter the DDBJ/EMBL/GenBank accession numbers ofsource data and a concise descriptionof the supporting experimental evi-dence for the biological annotation inthe specified text areas. The rest ofthe submission process is the same asthat for standard GenBank data.Once the submission process is com-plete, the preliminary flatfile viewreports, in a temporary COMMENTfield, that this is a record from thethird party annotation database andlists the primary database accessionnumbers of the source data. TheCOMMENT will be removed as therecord is processed by the submis-sion staff at NCBI. The BankItacknowledgement message returnedby e-mail, contains the interim BankItnumber, designates the submission asa TPA, and includes the accessionsand concise evidence information.

Submitting TPA Records Using Sequin

The stand-alone submission andannotation tool, Sequin, also allowssubmission of TPA records. Sequinprovides a powerful platform forpropagating features across large and

complicated sets of data and savestime and effort for batches or largerecords. The Sequin Homepage pro-vides more information on down-loading and using the Sequin pro-gram.

www.ncbi.nlm.nih.gov/Sequin

Begin a GenBank or TPA submissionwith Sequin by clicking the "StartNew Submission" button on themain window of the program. Entercontact and author information andnavigate through the dialog windowsusing the "Next" and "Previous" but-tons as usual. Select the radio buttonfor "Third Party Annotation" on the"Sequence Format" dialog window.Enter a description of the biologicalexperiments used as evidence for theTPA in the "TPA Evidence" dialogwindow prompts. After importing thesequence(s) and adding remainingfeatures, enter the DDBJ/EMBL/GenBank accession number(s) usedin the as data sources in the columnson the "Assembly Tracking" dialog.Like the BankIt submission, the pre-liminary flatfile view in Sequin willshow the temporary COMMENTfield that indicates that the submis-sion is to the TPA database and liststhe primary accession number. Sendthe completed Sequin TPA submis-sion in an e-mail to

[email protected]

to complete the submission process.

TPA records that have been submit-ted to NCBI using BankIt or Sequinare released into the NCBI publicdatabases once the TPA data is pub-lished in a peer-reviewed journal. AtNCBI, TPA records are a part of thesequence databases available forsearching through the Entrez and theBLAST services.

—MR

Page 12: NCBI News...the corresponding human RefSeq transcript. It also finds matches to chromo-somes 9 and 1 in both the reference and alternate assemblies of the human genome. The local metrics,

Department of Health and Human ServicesPublic Health Service, National Institutes of HealthNational Library of MedicineNational Center for Biotechnology InformationBldg. 38A, Room 3S3088600 Rockville PikeBethesda, Maryland 20894

Official BusinessPenalty for Private Use $300

FIRST CLASS MAILPOSTAGE & FEES PAID

DHHS/NIH/NLMBETHESDA, MD

PERMIT NO. G-816

Volume 15, Issue 2NATIONAL INSTITUTES OF HEALTH ■ National Library of Medicine

NCBI News

PubChem Grows to 15Million Substances

PubChem Substance

PubChem BioAssay

Top Ten Sources for:

Source

ZINCChemDBDiscoveryGateThomson PharmaChemBridge

No. of Records

3,813,8923,564,9382,739,7652,230,474

433,971

Source

National Cancer InstituteStructural Genomics Consortium—OxfordNIH Chemical Genomics CenterBindingDBScripps Research InstituteDiabetic Complications ScreeningSan Diego Center for Chemical GenomicsNew Mexico Molecular Libraries Screening CenterPenn Center for Molecular DiscoveryColumbia University Molecular Screening Center

No. of BioAssays

173433019151411876

Source

ChemBankChemIDplusAsinexNational Cancer InstituteSpecs

No. of Records

413,586383,789362,469268,696200,744

With recent deposits from Specs andThomson Pharma, the PubChemSubstance database has now passedthe 15 million mark, includingdeposits from 53 organizations,including commercial, government,and academic institutions. PubChemSubstance currently contains15,450,812 records representing10,138,100 unique chemicals inPubChem Compound. ThePubChem BioAssay database hasexpanded to 362 assays from 21depositors, with more than 10 ofthese depositors new in 2006.

Detailed listings of the sources ofPubChem data are available at

pubchem.ncbi.nlm.nih.gov/sources