Introduction to Bioinformatics Databases

72
roduction to Bioinformatics Databa

description

Introduction to Bioinformatics Databases. Central dogma of molecular biology. DNA. RNA. protein. phenotype. A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems. Page 6. After Pace NR (1997) Science 276:734. - PowerPoint PPT Presentation

Transcript of Introduction to Bioinformatics Databases

Page 1: Introduction to Bioinformatics Databases

Introduction to Bioinformatics Databases

Page 2: Introduction to Bioinformatics Databases

DNA RNA phenotypeprotein

Central dogma of molecular biology

A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems.

Page 3: Introduction to Bioinformatics Databases

After Pace NR (1997) Science 276:734

Page 6

With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth.

Page 4: Introduction to Bioinformatics Databases

0

10

20

30

40

50

60

70

1985

Growth of GenBank

Bas

e p

airs

of

DN

A (

bil

lio

ns)

Seq

uen

ces

(mil

lio

ns)

200019951990

December1982

June2006

Page 5: Introduction to Bioinformatics Databases

Growth of the International NucleotideSequence Database Collaboration

Bas

e p

airs

of

DN

A (

bil

lio

ns)

http://www.ncbi.nlm.nih.gov/Genbank/

Base pairs contributed by GenBank EMBL DDBJ

Page 6: Introduction to Bioinformatics Databases

DNA RNA protein

Central dogma of molecular biology

genome transcriptome proteome

Central dogma of bioinformatics and genomics

Page 7: Introduction to Bioinformatics Databases

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

Fig. 2.2Page 20

Page 8: Introduction to Bioinformatics Databases

GenBankEMBL DDBJ

There are three major public DNA databases

The underlying raw DNA sequences are identical

Page 16

Page 9: Introduction to Bioinformatics Databases

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

Page 16

Page 10: Introduction to Bioinformatics Databases

>300,000 species are represented in GenBank

Table 2-1

Page 11: Introduction to Bioinformatics Databases

Taxonomy nodes at NCBI

http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi8/06

Page 12: Introduction to Bioinformatics Databases
Page 13: Introduction to Bioinformatics Databases

The most sequenced organisms in GenBank

Homo sapiens 10.7 billion basesMus musculus 6.5bRattus norvegicus 5.6bDanio rerio 1.7bZea mays 1.4bOryza sativa 0.8bDrosophila melanogaster 0.7bGallus gallus 0.5bArabidopsis thaliana 0.5b

Updated 8-12-04GenBank release 142.0

Table 2-2Page 18

Page 14: Introduction to Bioinformatics Databases

The most sequenced organisms in GenBank

Homo sapiens 11.2 billion basesMus musculus 7.5bRattus norvegicus 5.7bDanio rerio 2.1bBos taurus 1.9bZea mays 1.4bOryza sativa (japonica) 1.2bXenopus tropicalis 0.9bCanis familiaris 0.8bDrosophila melanogaster 0.7b

Updated 8-29-05GenBank release 149.0

Table 2-2Page 18

Page 15: Introduction to Bioinformatics Databases

The most sequenced organisms in GenBank

Homo sapiens 12.3 billion basesMus musculus 8.0bRattus norvegicus 5.7bBos taurus 3.5bDanio rerio 2.5bZea mays 1.8bOryza sativa (japonica) 1.5bStrongylocentrotus purpurata 1.2bSus scrofa 1.0bXenopus tropicalis 1.0b

Updated 7-19-06GenBank release 154.0

Table 2-2Page 18

Page 16: Introduction to Bioinformatics Databases

National Center for BiotechnologyInformation (NCBI)

www.ncbi.nlm.nih.gov

Page 24

Page 17: Introduction to Bioinformatics Databases

Types of Data in GenBank

DNA level RNA level (cDNA) Protein sequences.…

Page 18: Introduction to Bioinformatics Databases

www.ncbi.nlm.nih.govFig. 2.5Page 25

Page 19: Introduction to Bioinformatics Databases

Fig. 2.5Page 25

Page 20: Introduction to Bioinformatics Databases

PubMed is… • National Library of Medicine's search service• 16 million citations in MEDLINE• links to participating online journals• PubMed tutorial (via “Education” on side bar)

Page 24

Page 21: Introduction to Bioinformatics Databases

Entrez integrates…

• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes

Page 24

Page 22: Introduction to Bioinformatics Databases

Entrez is a search and retrieval system that integrates NCBI databases

Page 24

Page 23: Introduction to Bioinformatics Databases

BLAST is…

• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases• 100,000 searches per day

Page 25

Page 24: Introduction to Bioinformatics Databases

OMIM is…

•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•edited by Dr. Victor McKusick, others at JHU

Page 25

Page 25: Introduction to Bioinformatics Databases

Books is…

• searchable resource of on-line books

Page 26

Page 26: Introduction to Bioinformatics Databases

TaxBrowser is…

• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms

Page 26

Page 27: Introduction to Bioinformatics Databases

Structure site includes…

• Molecular Modelling Database (MMDB)

• biopolymer structures obtained from

the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)

Page 26

Page 28: Introduction to Bioinformatics Databases

Accessing information on molecular sequences

Page 26

Page 29: Introduction to Bioinformatics Databases

Accession numbers are labels for sequences

NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.

Page 26

Page 30: Introduction to Bioinformatics Databases

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record

protein

DNA

RNA

Page 27

Page 31: Introduction to Bioinformatics Databases

Four ways to access DNA and protein sequences

[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)

Page 27

Page 32: Introduction to Bioinformatics Databases

4 ways to access protein and DNA sequences

[1] Entrez Gene with RefSeq

Entrez Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.

RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)

Page 27

Page 33: Introduction to Bioinformatics Databases

From the NCBI homepage, type “rbp4”and hit “Go”

revisedFig. 2.7Page 29

Page 34: Introduction to Bioinformatics Databases

revisedFig. 2.7Page 29

Page 35: Introduction to Bioinformatics Databases
Page 36: Introduction to Bioinformatics Databases
Page 37: Introduction to Bioinformatics Databases

By applying limits, there are now just two entries

Page 38: Introduction to Bioinformatics Databases

revisedFig. 2.8Page 30

Entrez Gene (top of page)

Note that links tomany other RBP4 database entries are available

Page 39: Introduction to Bioinformatics Databases

Entrez Gene (middle of page)

Page 40: Introduction to Bioinformatics Databases

Entrez Gene (bottom of page)

Page 41: Introduction to Bioinformatics Databases

Fig. 2.9Page 32

Page 42: Introduction to Bioinformatics Databases

Fig. 2.9Page 32

Page 43: Introduction to Bioinformatics Databases

Fig. 2.9Page 32

Page 44: Introduction to Bioinformatics Databases

FASTA format

Fig. 2.10Page 32

Page 45: Introduction to Bioinformatics Databases

FASTA format

Page 46: Introduction to Bioinformatics Databases

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record

protein

DNA

RNA

Page 27

Page 47: Introduction to Bioinformatics Databases
Page 48: Introduction to Bioinformatics Databases

NCBI’s important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735

Page 29-30

Page 49: Introduction to Bioinformatics Databases

Accession Molecule NoteAP_123456 Protein Protein products; alternateNC_123456 Genomic Complete genomic moleculesNG_123456 Genomic Incomplete genomic regionsNM_123456 mRNA Transcript products; mRNA NM_123456789 mRNA Transcript products; 9-digit NP_123456 Protein Protein products; NP_123456789 Protein Protein products; 9-digit NR_123456 RNA Non-coding transcripts NT_123456 Genomic Genomic assembliesNW_123456 Genomic Genomic assemblies NZ_ABCD12345678 Genomic Whole genome shotgun dataXM_123456 mRNA Transcript productsXP_123456 Protein Protein productsXR_123456 RNA Transcript productsYP_123456 Protein Protein productsZP_12345678 Protein Protein products

NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

Page 50: Introduction to Bioinformatics Databases

Ensembl to access protein and DNA sequences

Try Ensembl at www.ensembl.org for a premierhuman genome web browser.

Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute,Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates.

We will encounter Ensembl as we study the human genome,BLAST, and other topics.

Page 51: Introduction to Bioinformatics Databases

clickhuman

Page 52: Introduction to Bioinformatics Databases

Species in EnsemblSpecies in Ensembl

FISHES

BIRDSREPTILES

MAMMALS PLACENTALS

MONOTREMES

MARSUPIALS

OTHER BIRDS

PALEOGNATHS

PASSERINES

CROCODILES

TURTLES

LIZARDS

AMPHIBIANS

TELEOSTS

SHARKS

RAYS

LATIMERIA

BICHIR/POLYPTERUS

LUNGFISHES

AGNATHANS

NON-VERTEBRATES

Page 53: Introduction to Bioinformatics Databases

enterRBP4

Page 54: Introduction to Bioinformatics Databases
Page 55: Introduction to Bioinformatics Databases

Five ways to access DNA and protein sequences

[1] Entrez Gene with RefSeq[2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)[4] ExPASy Sequence Retrieval System (separate from NCBI)

Page 33

Page 56: Introduction to Bioinformatics Databases

ExPASy to access protein and DNA sequences

ExPASy sequence retrieval system(ExPASy = Expert Protein Analysis System)

Visit http://www.expasy.ch/

Page 33

Page 57: Introduction to Bioinformatics Databases

Fig. 2.11Page 33

Page 58: Introduction to Bioinformatics Databases
Page 59: Introduction to Bioinformatics Databases

Example of how to access sequence data:HIV-1 pol

There are many possible approaches. Begin at the mainpage of NCBI, and type an Entrez query: hiv-1 pol

Page 34

Page 60: Introduction to Bioinformatics Databases
Page 61: Introduction to Bioinformatics Databases

Page 34

Searching for HIV-1 pol:Following the “genome” link yields

a manageable three results

Page 62: Introduction to Bioinformatics Databases

Example of how to access sequence data:HIV-1 pol

For the Entrez query: hiv-1 polthere are about 40,000 nucleotide or protein records(and >100,000 records for a search for “hiv-1”),but these can easily be reduced in two easy steps:

--specify the organism, e.g. hiv-1[organism]--limit the output to RefSeq!

Page 34

Page 63: Introduction to Bioinformatics Databases

only 1 RefSeq

over 100,000nucleotide entriesfor HIV-1

Page 64: Introduction to Bioinformatics Databases

Examples of how to access sequence data:histone

query for “histone” # results

protein records 21847RefSeq entries 7544

RefSeq (limit to human) 1108NOT deacetylase 697

At this point, select a reasonable candidate (e.g.histone 2, H4) and follow its link to Entrez Gene.There, you can confirm you have the right gene/protein.

8-12-06

Page 65: Introduction to Bioinformatics Databases
Page 66: Introduction to Bioinformatics Databases

Access to Biomedical Literature

Page 35

Page 67: Introduction to Bioinformatics Databases

PubMed at NCBIto find literatureinformation

Page 68: Introduction to Bioinformatics Databases

PubMed is the NCBI gateway to MEDLINE.

MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries.

It has >14 million records dating back to 1966.

Page 35

Page 69: Introduction to Bioinformatics Databases
Page 70: Introduction to Bioinformatics Databases
Page 71: Introduction to Bioinformatics Databases

PubMed search strategies

Try the tutorial (“education” on the left sidebar)

Use boolean queries (capitalize AND, OR, NOT)lipocalin AND disease

Try using “limits”

Try “Links” to find Entrez information and external resources

Obtain articles on-line via Welch Medical Library(and download pdf files):

http://www.welch.jhu.edu/

Page 35

Page 72: Introduction to Bioinformatics Databases

lipocalin AND disease(60 results)

lipocalin OR disease(1,650,000 results)

lipocalin NOT disease(530 results)

1 AND 2

1 OR 2

1 NOT 2

1

1

1

2

2

2

Fig. 2.12Page 348/04