Introduction to Bioinformatics Databases
-
Upload
clarke-chan -
Category
Documents
-
view
39 -
download
2
description
Transcript of Introduction to Bioinformatics Databases
Introduction to Bioinformatics Databases
DNA RNA phenotypeprotein
Central dogma of molecular biology
A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems.
After Pace NR (1997) Science 276:734
Page 6
With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth.
0
10
20
30
40
50
60
70
1985
Growth of GenBank
Bas
e p
airs
of
DN
A (
bil
lio
ns)
Seq
uen
ces
(mil
lio
ns)
200019951990
December1982
June2006
Growth of the International NucleotideSequence Database Collaboration
Bas
e p
airs
of
DN
A (
bil
lio
ns)
http://www.ncbi.nlm.nih.gov/Genbank/
Base pairs contributed by GenBank EMBL DDBJ
DNA RNA protein
Central dogma of molecular biology
genome transcriptome proteome
Central dogma of bioinformatics and genomics
DNA RNA
cDNAESTsUniGene
phenotype
genomicDNAdatabases
protein sequence databases
protein
Fig. 2.2Page 20
GenBankEMBL DDBJ
There are three major public DNA databases
The underlying raw DNA sequences are identical
Page 16
GenBankEMBL DDBJ
Housedat EBI
EuropeanBioinformatics
Institute
There are three major public DNA databases
Housed at NCBINational
Center forBiotechnology
Information
Housed in Japan
Page 16
>300,000 species are represented in GenBank
Table 2-1
Taxonomy nodes at NCBI
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi8/06
The most sequenced organisms in GenBank
Homo sapiens 10.7 billion basesMus musculus 6.5bRattus norvegicus 5.6bDanio rerio 1.7bZea mays 1.4bOryza sativa 0.8bDrosophila melanogaster 0.7bGallus gallus 0.5bArabidopsis thaliana 0.5b
Updated 8-12-04GenBank release 142.0
Table 2-2Page 18
The most sequenced organisms in GenBank
Homo sapiens 11.2 billion basesMus musculus 7.5bRattus norvegicus 5.7bDanio rerio 2.1bBos taurus 1.9bZea mays 1.4bOryza sativa (japonica) 1.2bXenopus tropicalis 0.9bCanis familiaris 0.8bDrosophila melanogaster 0.7b
Updated 8-29-05GenBank release 149.0
Table 2-2Page 18
The most sequenced organisms in GenBank
Homo sapiens 12.3 billion basesMus musculus 8.0bRattus norvegicus 5.7bBos taurus 3.5bDanio rerio 2.5bZea mays 1.8bOryza sativa (japonica) 1.5bStrongylocentrotus purpurata 1.2bSus scrofa 1.0bXenopus tropicalis 1.0b
Updated 7-19-06GenBank release 154.0
Table 2-2Page 18
National Center for BiotechnologyInformation (NCBI)
www.ncbi.nlm.nih.gov
Page 24
Types of Data in GenBank
DNA level RNA level (cDNA) Protein sequences.…
www.ncbi.nlm.nih.govFig. 2.5Page 25
Fig. 2.5Page 25
PubMed is… • National Library of Medicine's search service• 16 million citations in MEDLINE• links to participating online journals• PubMed tutorial (via “Education” on side bar)
Page 24
Entrez integrates…
• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes
Page 24
Entrez is a search and retrieval system that integrates NCBI databases
Page 24
BLAST is…
• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases• 100,000 searches per day
Page 25
OMIM is…
•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•edited by Dr. Victor McKusick, others at JHU
Page 25
Books is…
• searchable resource of on-line books
Page 26
TaxBrowser is…
• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms
Page 26
Structure site includes…
• Molecular Modelling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)
Page 26
Accessing information on molecular sequences
Page 26
Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.
Page 26
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record
protein
DNA
RNA
Page 27
Four ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)
Page 27
4 ways to access protein and DNA sequences
[1] Entrez Gene with RefSeq
Entrez Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.
RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
Page 27
From the NCBI homepage, type “rbp4”and hit “Go”
revisedFig. 2.7Page 29
revisedFig. 2.7Page 29
By applying limits, there are now just two entries
revisedFig. 2.8Page 30
Entrez Gene (top of page)
Note that links tomany other RBP4 database entries are available
Entrez Gene (middle of page)
Entrez Gene (bottom of page)
Fig. 2.9Page 32
Fig. 2.9Page 32
Fig. 2.9Page 32
FASTA format
Fig. 2.10Page 32
FASTA format
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record
protein
DNA
RNA
Page 27
NCBI’s important RefSeq project: best representative sequences
RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.
RefSeq identifiers include the following formats:
Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735
Page 29-30
Accession Molecule NoteAP_123456 Protein Protein products; alternateNC_123456 Genomic Complete genomic moleculesNG_123456 Genomic Incomplete genomic regionsNM_123456 mRNA Transcript products; mRNA NM_123456789 mRNA Transcript products; 9-digit NP_123456 Protein Protein products; NP_123456789 Protein Protein products; 9-digit NR_123456 RNA Non-coding transcripts NT_123456 Genomic Genomic assembliesNW_123456 Genomic Genomic assemblies NZ_ABCD12345678 Genomic Whole genome shotgun dataXM_123456 mRNA Transcript productsXP_123456 Protein Protein productsXR_123456 RNA Transcript productsYP_123456 Protein Protein productsZP_12345678 Protein Protein products
NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences
Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a premierhuman genome web browser.
Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute,Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates.
We will encounter Ensembl as we study the human genome,BLAST, and other topics.
clickhuman
Species in EnsemblSpecies in Ensembl
FISHES
BIRDSREPTILES
MAMMALS PLACENTALS
MONOTREMES
MARSUPIALS
OTHER BIRDS
PALEOGNATHS
PASSERINES
CROCODILES
TURTLES
LIZARDS
AMPHIBIANS
TELEOSTS
SHARKS
RAYS
LATIMERIA
BICHIR/POLYPTERUS
LUNGFISHES
AGNATHANS
NON-VERTEBRATES
enterRBP4
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)
Page 33
ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system(ExPASy = Expert Protein Analysis System)
Visit http://www.expasy.ch/
Page 33
Fig. 2.11Page 33
Example of how to access sequence data:HIV-1 pol
There are many possible approaches. Begin at the mainpage of NCBI, and type an Entrez query: hiv-1 pol
Page 34
Page 34
Searching for HIV-1 pol:Following the “genome” link yields
a manageable three results
Example of how to access sequence data:HIV-1 pol
For the Entrez query: hiv-1 polthere are about 40,000 nucleotide or protein records(and >100,000 records for a search for “hiv-1”),but these can easily be reduced in two easy steps:
--specify the organism, e.g. hiv-1[organism]--limit the output to RefSeq!
Page 34
only 1 RefSeq
over 100,000nucleotide entriesfor HIV-1
Examples of how to access sequence data:histone
query for “histone” # results
protein records 21847RefSeq entries 7544
RefSeq (limit to human) 1108NOT deacetylase 697
At this point, select a reasonable candidate (e.g.histone 2, H4) and follow its link to Entrez Gene.There, you can confirm you have the right gene/protein.
8-12-06
Access to Biomedical Literature
Page 35
PubMed at NCBIto find literatureinformation
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries.
It has >14 million records dating back to 1966.
Page 35
PubMed search strategies
Try the tutorial (“education” on the left sidebar)
Use boolean queries (capitalize AND, OR, NOT)lipocalin AND disease
Try using “limits”
Try “Links” to find Entrez information and external resources
Obtain articles on-line via Welch Medical Library(and download pdf files):
http://www.welch.jhu.edu/
Page 35
lipocalin AND disease(60 results)
lipocalin OR disease(1,650,000 results)
lipocalin NOT disease(530 results)
1 AND 2
1 OR 2
1 NOT 2
1
1
1
2
2
2
Fig. 2.12Page 348/04