NCBI Molecular Biology Resources
description
Transcript of NCBI Molecular Biology Resources
NC
BI
Fie
ldG
uid
e
September 29, 2004 ICGEB
NCBI Molecular Biology Resources
A Field Guidepart 1
NC
BI
Fie
ldG
uid
e• About NCBI
• The NCBI Entrez System
• NCBI Sequence Databases
• NCBI Genomic Resources** Intermission **
• NCBI Precomputed Resources– Behind the scenes
NCBI Resources
NC
BI
Fie
ldG
uid
eThe National Institutes of Health
Bethesda, MD
NC
BI
Fie
ldG
uid
eThe National Center for
Biotechnology Information
• Created as a part of NLM in 1988– Establish public databases– Perform research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information
NC
BI
Fie
ldG
uid
e
Number of Users and Hits Per Day
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Nu
mb
er o
f U
sers
1997 1998 1999 2000 2001 2002 2003
Christmas &New Year’s
Days
Currently averaging10,000,000 to 35,000,000
hits per day!
NC
BI
Fie
ldG
uid
e
Countries of Origin
U.S.U.S.(.com, .net, (.com, .net, .org,.org,
..govgov, .us), .us)40%40%
Japan 6%Italy 4%
Canada 3%
Germany 3%
United Kingdom3%
Netherlands 2%
Spain 2%
Brazil 2%Sweden 1%Switzerland 1%Belgium1%
OtherOther14%14%
U.S.U.S.(.com, .net, (.com, .net, .org,.org,
..govgov, .us), .us)40%40%
Japan 6%Italy 4%
Canada 3%
Germany 3%
United Kingdom3%
Netherlands 2%
Spain 2%
Brazil 2%Sweden 1%Switzerland 1%Belgium1%
OtherOther14%14%
NC
BI
Fie
ldG
uid
e
Web Access: http://www.ncbi.nlm.nih.gov
NC
BI
Fie
ldG
uid
ehttp://www.ncbi.nlm.nih.gov/About/
index.html
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
A part of the NCBI Bookshelf
Part 1. The Databases
Part 3. Querying and Linking the Data
Part 2. Data Flow and Processing
Part 4. User Support
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases
NC
BI
Fie
ldG
uid
eThe Entrez System
Entrez
Nucleotide
PubMed
Protein
Taxonomy
Structure
Domains 3D DomainsJournal
s
PMC
OMIM
Books
PopSet
SNP
UniGene UniSTS
Genome
Gene
GEO
GEO Datasets
MeSH
CancerChromosomes
Homologene
NC
BI
Fie
ldG
uid
e
Taxonomy
NC
BI
Fie
ldG
uid
e
zebrafish
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
eThe Global Entrez search engine
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
Types of Databases
• Primary Databases– Original submissions by experimentalists– Database staff review and may organize the data, but we don’t
add/modify additional information– Records are “owned” and updated by their authors
• Examples: GenBank, SNP, GEO
• Derivative Databases– Human-curated (compilation and correction of data)
Examples: Gene(LocusLink), Structure & Literature databases
– Computationally-Derived
Example: UniGene
– Combination
Examples: RefSeq, Genome Assembly, Domain databases
NC
BI
Fie
ldG
uid
ePrimary vs. DerivativeSequence Databases
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATTAT
TC
CGAGA
ATTAT
TC
C
AT
GAGA
ATTC
C GAGA
ATTC
C
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCG
TATAGCCG TATAGCCGTATAGCCG TATAGCCGCAT
T
GAGA
ATTC
C GAGA
ATTC
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continuall
y by NCBI
Updated ONLY by submitters
NC
BI
Fie
ldG
uid
e
How to Query a Particular Database
(term1[tag delimiter] op term2[tag delimiter] op …)
tag delimiter = Entrez indexing field
op = AND, OR, NOT
OrganismJournalUser compoundsAuthor
Boolean operators MUST be in ALL CAPS!
Examples oftag delimiters
term1 term2
NC
BI
Fie
ldG
uid
e
Sample QueryBrauninger a c-src kinase
OrganismJournalUser compoundsAuthor
NC
BI
Fie
ldG
uid
e
Using Fields to Find RecordsAccessionAll FieldsAuthor EC/RN NumberFeature KeyFilterGene NameIssueJournalKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitleVolume
•Most useful search field [Organism]:–human[orgn] …or… bacteria[orgn]
•Useful search terms in [Properties] field:–srcdb: “source database” ( srcdb genbank[prop] )
–gbdiv: “genbank division” ( gbdiv est[prop] )
–biomol: “biomolecular type” ( biomol mrna[prop] )
NC
BI
Fie
ldG
uid
e
#1: thyroid peroxidase 335#2: thyroid peroxidase AND human[orgn] 291#3: thyroid peroxidase[title] AND human[orgn] 166
#4: #3 AND srcdb refseq[prop] 5#5: #3 AND srcdb ddbj/embl/genbank[prop] 161
#6: #5 AND gbdiv est[prop] 20#7: #5 AND gbdiv pri[prop] 141
#8: #7 AND biomol genomic[prop] 25#9: #7 AND biomol mrna[prop] 116
Using Field Limits
NC
BI
Fie
ldG
uid
eComplex searches you can do with
Preview/Index
How many rat Unigene clusters contain at least one mRNA?
rat [organism]
Terms used (and indexed) in Entrez fieldscan be searched to gain useful information!
1) Select the UniGene database.2) Find all the rat records.3) Find those that have ≥ 1 mRNAs. (“not 0”)NOT
NC
BI
Fie
ldG
uid
e
Complex Queries with Preview/Index
NOT 0 [mRNA Count]
NC
BI
Fie
ldG
uid
e
11ºº Sequence Database Sequence Database
GenBank
• Nucleotide only sequence database
• Archival in nature
• Submission of GenBank Data to NCBI– Direct submissions of individual records via Web
(BankIt, Sequin)
– Batch submissions of bulk sequences via Email
(EST, GSS, STS)
– FTP accounts for Sequencing Centers
NC
BI
Fie
ldG
uid
e
Seq
uen
ce R
eco
rds
(mil
lio
ns)
To
tal Base P
airs(b
illion
s)
GenBank
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
40Sequence recordsTotal base pairs
Release 143: 37.3 million records 41.8 billion nucleotides
Average doubling time ≈ 14 months
’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04
NC
BI
Fie
ldG
uid
e
• full release every two months• incremental and cumulative updates daily• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
GenBank
Release 143 August 2004 37,343,937 Records 41,808,045,653 Nucleotides >170,000 Species 160 Gigabytes 657 files
NC
BI
Fie
ldG
uid
e
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
The International Sequence Database Collaboration
SequinBankItftp
NC
BI
Fie
ldG
uid
e
EST (335) Expressed Sequence Tag GSS (116) Genome Survey SequenceHTG (61) High Throughput GenomicSTS (5) Sequence Tagged SiteHTC (6) High Throughput cDNA
PRI (28) Primate PLN (12) Plant and FungalBCT (10) Bacterial and Archeal INV (6) InvertebrateROD (13) RodentVRL (3) ViralVRT (7) Other VertebrateMAM (1) Mammalian (ex. ROD and PRI)PHG (1) PhageSYN (1) Synthetic (cloning vectors) UNA (1) Unannotated
Organization of GenBank:GenBank Divisions (gbdiv)
Records are divided into 17 Divisions.- 1 Patent (11 files)
- 5 High Throughput - 11 Traditional
Traditional Divisions: Traditional Divisions: • Direct Submissions (Sequin and BankIt)
• Accurate• Well characterized
BULK Divisions: BULK Divisions: • Batch Submission (Email and FTP)
• Inaccurate• Poorly characterized
NC
BI
Fie
ldG
uid
eFile Formats of the
Sequence Databases
Each sequence is represented bya text record called a flat file.
GenBank/GenPept (useful for scientists) FASTA (the simplest format)
ASN.1 & XML (useful for programmers)
NC
BI
Fie
ldG
uid
e
LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000DEFINITION Limulus polyphemus myosin III mRNA, complete cds.ACCESSION AF062069VERSION AF062069.2 GI:7144484KEYWORDS .SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus.REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In pressREFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USAREFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitterCOMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
References
DEFINITION Limulus polyphemus myosin III mRNA, complete cds.LOCUS AF0620069 3808 bp mRNA INV 02-MAR-2000
ORGANISM Limulus polyphemus Eukaryota;Metazoa;Arthropoda;Chelicerata;Merostomata; Xiphosura;Limulidae;Limulus.
A Traditional “GenBank” Record
Definition =TitleACCESSION AF062069VERSION AF062069.2 GI:7144484
NCBI’s Taxonomy
Accession.VersionGI Number
Accession Number
LengthmRNA = cDNADNA = genomic Division
Date ofmost recentmodification
NC
BI
Fie
ldG
uid
e
FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQBASE COUNT 201 a 689 c 782 g 1136 tORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa//
Lower down in the GenBank Record
/protein_id="AAC16332.2"/db_xref="GI:7144485"
GenPept Protein ID
Feature Table
NC
BI
Fie
ldG
uid
e
FASTA format
>gi|4680721|gb|AAA61217.2| thyroid peroxidase [Homo sapiens]MRALAVLSVTLVMACTEAFFPFISRGKELLWGKPEESRVSSVLEESKRLVDTAMYATMQRNLKKRGILSGAQLLSFSKLPEPTSGVIARAAEIMETSIQAMKRKVNLKTQQSQHPTDALSEDLLSIIANMSGCLPYMLPPKCPNTCLANKYRPITGACNNRDHPRWGASNTALARWLPPVYEDGFSQPRGWNPGFLYNGFPLPPVREVTRHVIQVSNEVVTDDDRYSDLLMAWGQYIDHDIAFTPQSTSKAAFGGGSDCQMTCENQNPCFPIQLPEEARPAAGTACLPFYRSSAACGTGDQGALFGNLSTANPRQQMNGLTSFLDASTVYGSSPALERQLRNWTSAEGLLRVHGRLRDSGRAYLPFVPPRAPAACAPEPGNPGETRGPCFLAGDGRASEVPSLTALHTLWLREHNRLAAALKALNAHWSADAVYQEARKVVGALHQIITLRDYIPRILGPEAFQQYVGPYEGYDSTANPTVSNVFSTAAFRFGHATIHPLVRRLDASFQEHPDLPGLWLHQAFFSPWTLLRGGGLDPLIRGLLARPAKLQVQDQLMNEELTERLFVLSNSSTLDLASINLQRGRDHGLPGYNEWREFCGLPRLETPADLSTAIASRSVADKILDLYKHPDNIDVWLGGLAENFLPRARTGPLFACLIGKQMKALRDGDWFWWENSHVFTDAQRRELEKHSLSRVICDNTGLTRVPMDAFQVGKFPEDFESCDSITGMNLEAWRETFPQDDKCGFPESVENGDFVHCEESGRRVLVYSCRHGYELQGREQLTCTQEGWDFQPPLCKDVNECADGAHPPCHASARCRNTKGGFQCLCADPYELGDDGRTCVD...
>gi|4680720|gb|M17755.2|HUMTPOC Homo sapiens thyroid peroxidase (TPO) mRNA, complete cdsGAGGCAATTGAGGCGCCCATTTCAGAAGAGTTACAGCCGTGAAAATTACTCAGCAGTGCAGTTGGCTGAGAAGAGGAAAAAAGAATGAGAGCGCTGGCTGTGCTGTCTGTCACGCTGGTTATGGCCTGCACAGAAGCCTTCTTCCCCTTCATCTCGAGAGGGAAAGAACTCCTTTGGGGAAAGCCTGAGGAGTCTCGTGTCTCTAGCGTCTTGGAGGAAAGCAAGCGCCTGGTGGACACCGCCATGTACGCCACGATGCAGAGAAACCTCAAGAAAAGAGGAATCCTTTCTGGAGCTCAGCTTCTGTCTTTTTCCAAACTTCCTGAGCCAACAAGCGGAGTGATTGCCCGAGCAGCAGAGATAATGGAAACATCAATACAAGCGATGAAAAGAAAAGTCAACCTGAAAACTCAACAATCACAGCATCCAACGGATGCTTTATCAGAAGATCTGCTGAGCATCATTGCAAACATGTCTGGATGTCTCCCTTACATGCTGCCCCCAAAATGCCCAAACACTTGCCTGGCGAACAAATACAGGCCCATCACAGGAGCTTGCAACAACAGAGACCACCCCAGATGGGGCGCCTCCAACACGGCCCTGGCACGATGGCTCCCTCCAGTCTATGAGGACGGCTTCAGTCAGCCCCGAGGCTGGAACCCCGGCTTCTTGTACAACGGGTTCCCACTGCCCCCGGTCCGGGAGGTGACAAGACATGTCATTCAAGTTTCAAATGAGGTTGTCACAGATGATGACCGCTATTCTGACCTCCTGATGGCATGGGGACAATACATCGACCACGACATCGCGTTCACACCACAGAGCACCAGCAAAGCTGCC...
NC
BI
Fie
ldG
uid
e
Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Human thyroid peroxidase mRNA, partial cds., and translated products" , source { org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo" ,
Abstract Syntax Notation: ASN.1
FASTA Nucleotide
FASTAProtein
GenPept GenBank
ASN.1
NC
BI
Fie
ldG
uid
e
Bulk Divisions
• Expressed Sequence Tag– 1st pass single read cDNA
• Genome Survey Sequence– 1st pass single read gDNA
• High Throughput Genomic– incomplete sequences of genomic clones
• Sequence Tagged Site– PCR-based mapping reagents
•Batch Submission and htg (email and ftp)•Inaccurate•Poorly Characterized
NC
BI
Fie
ldG
uid
eEST Division: Expressed Sequence
Tags
RNA gene products
nucleus30,000 genes
80-100,000 uniquecDNA clones in library
- isolate unique clones -sequence once from each end
make cDNA library
5’
3’
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
gbdiv_est[Properties]
NC
BI
Fie
ldG
uid
eGenome Sequencing - HTG, GSS,
(WGS)
Draft Sequence (HTG division)
shredding
Whole BAC insert (or genome)
cloning isolating
assembly
sequencing
GSS divisionor trace archive whole genome shotgun assemblies
(traditional division)
NC
BI
Fie
ldG
uid
eHTG Division: Honeybee Draft
Sequences
•Unfinished sequences of BACs•Gaps and unordered pieces•Finished sequences move to traditional GenBank division
NC
BI
Fie
ldG
uid
e
Other Primary Databases
• GEOGEO (Gene Expression Omnibus)– Searchable microarray data repository
• SNPSNP (Single Nucleotide Polymorphism)– Allelic variations (including minisatellites/
simple sequence repeats and insertions/ deletions)
NC
BI
Fie
ldG
uid
e
Submit and update data
Query the database:• gene identifiers• field information• sequence
Browse datasets
Download data
Redesigned
with
new features
NC
BI
Fie
ldG
uid
e
GPLPlatform
descriptions
GSMRaw/processedspot intensities
from a singleslide/chip
GSEGrouping of
slide/chip data“a single experiment”
GDSGrouping ofexperiments
Curated byNCBI
Submitted byExperimentalistsSubmitted by
Manufacturer*
Entrez GEOEntrez
GEO Datasets
NC
BI
Fie
ldG
uid
e
src1: CMV infected fibroblasts src2: uninfected fibroblasts
GSM827 : FHCMV-T-1GSM825 : FHCMV-T-2GSM828 : FHCMV-T-3
GSM829 : FHCMV-H-1GSM830 : FHCMV-H-2GSM831 : FHCMV-H-3
GSM832 : CMV_AD169-2GSM833 : CMV_AD169-3
GDS177: CMV infection of HFF cellsGDS177: CMV infection of HFF cells
Comparison of gene expression profiles of HFF cells infected with CMV strains
FHCRC non-commercial human 18K array
ExpressionExpression
NC
BI
Fie
ldG
uid
e
PRNP
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
SNP - GeneView