Introduction to Bioinformatics Databases

DNA RNA phenotypeprotein

Central dogma of molecular biology

A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems.

After Pace NR (1997) Science 276:734

Page 6

With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth.

0

10

20

30

40

50

60

70

1985

Growth of GenBank

Bas

e p

airs

of

DN

A (

bil

lio

ns)

Seq

uen

ces

(mil

lio

ns)

200019951990

December1982

June2006

Growth of the International NucleotideSequence Database Collaboration

Bas

e p

airs

of

DN

A (

bil

lio

ns)

http://www.ncbi.nlm.nih.gov/Genbank/

Base pairs contributed by GenBank EMBL DDBJ

DNA RNA protein

Central dogma of molecular biology

genome transcriptome proteome

Central dogma of bioinformatics and genomics

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

Fig. 2.2Page 20

GenBankEMBL DDBJ

There are three major public DNA databases

The underlying raw DNA sequences are identical

Page 16

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

Page 16

>300,000 species are represented in GenBank

Table 2-1

Taxonomy nodes at NCBI

http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi8/06

The most sequenced organisms in GenBank

Homo sapiens 10.7 billion basesMus musculus 6.5bRattus norvegicus 5.6bDanio rerio 1.7bZea mays 1.4bOryza sativa 0.8bDrosophila melanogaster 0.7bGallus gallus 0.5bArabidopsis thaliana 0.5b

Updated 8-12-04GenBank release 142.0

Table 2-2Page 18


Homo sapiens 11.2 billion basesMus musculus 7.5bRattus norvegicus 5.7bDanio rerio 2.1bBos taurus 1.9bZea mays 1.4bOryza sativa (japonica) 1.2bXenopus tropicalis 0.9bCanis familiaris 0.8bDrosophila melanogaster 0.7b


Table 2-2Page 18


Homo sapiens 12.3 billion basesMus musculus 8.0bRattus norvegicus 5.7bBos taurus 3.5bDanio rerio 2.5bZea mays 1.8bOryza sativa (japonica) 1.5bStrongylocentrotus purpurata 1.2bSus scrofa 1.0bXenopus tropicalis 1.0b


Table 2-2Page 18

National Center for BiotechnologyInformation (NCBI)

www.ncbi.nlm.nih.gov

Page 24

Types of Data in GenBank

DNA level RNA level (cDNA) Protein sequences.…

www.ncbi.nlm.nih.govFig. 2.5Page 25

Fig. 2.5Page 25

PubMed is… • National Library of Medicine's search service• 16 million citations in MEDLINE• links to participating online journals• PubMed tutorial (via “Education” on side bar)

Page 24

Entrez integrates…

• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes

Page 24

Entrez is a search and retrieval system that integrates NCBI databases

Page 24

BLAST is…

• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases• 100,000 searches per day

Page 25

OMIM is…

•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•edited by Dr. Victor McKusick, others at JHU

Page 25

Books is…

• searchable resource of on-line books

Page 26

TaxBrowser is…

• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms

Structure site includes…

• Molecular Modelling Database (MMDB)

• biopolymer structures obtained from

the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)

Page 26

Accessing information on molecular sequences

Page 26

Accession numbers are labels for sequences

NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.

Page 26

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record

protein

DNA

RNA

Page 27

Four ways to access DNA and protein sequences

[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)

Page 27

4 ways to access protein and DNA sequences

[1] Entrez Gene with RefSeq

Entrez Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.

RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)

Page 27

From the NCBI homepage, type “rbp4”and hit “Go”

revisedFig. 2.7Page 29

By applying limits, there are now just two entries


Entrez Gene (top of page)

Note that links tomany other RBP4 database entries are available

Entrez Gene (middle of page)

Entrez Gene (bottom of page)

Fig. 2.9Page 32

FASTA format

Fig. 2.10Page 32

FASTA format

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record

protein

DNA

RNA

Page 27

NCBI’s important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735

Page 29-30

Accession Molecule NoteAP_123456 Protein Protein products; alternateNC_123456 Genomic Complete genomic moleculesNG_123456 Genomic Incomplete genomic regionsNM_123456 mRNA Transcript products; mRNA NM_123456789 mRNA Transcript products; 9-digit NP_123456 Protein Protein products; NP_123456789 Protein Protein products; 9-digit NR_123456 RNA Non-coding transcripts NT_123456 Genomic Genomic assembliesNW_123456 Genomic Genomic assemblies NZ_ABCD12345678 Genomic Whole genome shotgun dataXM_123456 mRNA Transcript productsXP_123456 Protein Protein productsXR_123456 RNA Transcript productsYP_123456 Protein Protein productsZP_12345678 Protein Protein products

NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

Ensembl to access protein and DNA sequences

Try Ensembl at www.ensembl.org for a premierhuman genome web browser.

Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute,Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates.

We will encounter Ensembl as we study the human genome,BLAST, and other topics.

http://en.wikipedia.org/wiki/European_Bioinformatics_Institute

http://en.wikipedia.org/wiki/Sanger_Institute

http://en.wikipedia.org/wiki/Genome

http://en.wikipedia.org/wiki/Vertebrate

clickhuman

Species in EnsemblSpecies in Ensembl

FISHES

BIRDSREPTILES

MAMMALS PLACENTALS

MONOTREMES

MARSUPIALS

OTHER BIRDS

PALEOGNATHS

PASSERINES

CROCODILES

TURTLES

LIZARDS

AMPHIBIANS

TELEOSTS

SHARKS

RAYS

LATIMERIA

BICHIR/POLYPTERUS

LUNGFISHES

AGNATHANS

NON-VERTEBRATES

enterRBP4

Five ways to access DNA and protein sequences

[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)

Page 33

ExPASy to access protein and DNA sequences

ExPASy sequence retrieval system(ExPASy = Expert Protein Analysis System)

Visit http://www.expasy.ch/

Page 33

Fig. 2.11Page 33

Example of how to access sequence data:HIV-1 pol

There are many possible approaches. Begin at the mainpage of NCBI, and type an Entrez query: hiv-1 pol

Page 34

Searching for HIV-1 pol:Following the “genome” link yields

a manageable three results

Example of how to access sequence data:HIV-1 pol

For the Entrez query: hiv-1 polthere are about 40,000 nucleotide or protein records(and >100,000 records for a search for “hiv-1”),but these can easily be reduced in two easy steps:

--specify the organism, e.g. hiv-1[organism]--limit the output to RefSeq!

Page 34

only 1 RefSeq

over 100,000nucleotide entriesfor HIV-1

Examples of how to access sequence data:histone

query for “histone” # results

protein records 21847RefSeq entries 7544

RefSeq (limit to human) 1108NOT deacetylase 697

At this point, select a reasonable candidate (e.g.histone 2, H4) and follow its link to Entrez Gene.There, you can confirm you have the right gene/protein.

8-12-06

Access to Biomedical Literature

Page 35

PubMed at NCBIto find literatureinformation

PubMed is the NCBI gateway to MEDLINE.

MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries.

It has >14 million records dating back to 1966.

Page 35

PubMed search strategies

Try the tutorial (“education” on the left sidebar)

Use boolean queries (capitalize AND, OR, NOT)lipocalin AND disease

Try using “limits”

Try “Links” to find Entrez information and external resources

Obtain articles on-line via Welch Medical Library(and download pdf files):

http://www.welch.jhu.edu/

Page 35

lipocalin AND disease(60 results)

lipocalin OR disease(1,650,000 results)

lipocalin NOT disease(530 results)

1 AND 2

1 OR 2

1 NOT 2

1

1

1

2

2

2

Fig. 2.12Page 348/04

Introduction to Bioinformatics Databases

Documents

Transcript of Introduction to Bioinformatics Databases