Biodatabases 101220022654-phpapp02

38
COT 6930 HPC and Bioinformatics Bioinformatics Resources and D atabases sree

Transcript of Biodatabases 101220022654-phpapp02

Page 1: Biodatabases 101220022654-phpapp02

COT 6930HPC and Bioinformatics

Bioinformatics Resources and Databases

sree

Page 2: Biodatabases 101220022654-phpapp02

DNA RNA

cDNAESTsUniGene

phenotype

GenomicDNADatabases

Protein sequence databases

protein

Protein structure databases

transcription translation

Gene expressiondatabase

Page 3: Biodatabases 101220022654-phpapp02

Gene

Page 4: Biodatabases 101220022654-phpapp02

Different transcripts can be related to the same gene!

Page 5: Biodatabases 101220022654-phpapp02

EST

Expressed Sequence Tags

Partial copies of mRNA found within a particular cell

Can be used to identify genitc regions; splicing patterns of genes; etc

Page 6: Biodatabases 101220022654-phpapp02

Outline Bioinformatics Databases

Primary databases Derived databases

Nucleotide databases GenBank (P), EMBL-Bank (P)

Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)

Other Examples RefSeq UniGene PubMed SNP OMIM

Page 7: Biodatabases 101220022654-phpapp02

Bioinformatics Databases

Information DNA sequences Conserved DNA domains Genomes Gene expression (ESTs, microarrays) Protein sequences Protein 3D structure Protein families Mutations / polymorphisms / SNPs Metabolic pathways Chemical compounds (ligands) Biomedical literature (journal papers, online books…)

Page 8: Biodatabases 101220022654-phpapp02

Primary public domain bioinformatics servers

Public DomainBioinformatics

Facilities

European BioinformaticsInstitute (EBI)

United Kingdom

National CenterFor Biotechnology

Information (NCBI)United States

GenomeNet

(KEGG & DDBJ)Japan

DatabasesAnalysis

ToolsDatabases

AnalysisTools

DatabasesAnalysis

Tools

Page 9: Biodatabases 101220022654-phpapp02

Major Databases

DNA sequences GenBank, RefSeq, UniGene

Protein sequences Swiss-Prot, PIR-PSD, GenPept, TrEMBL, RefSeq

Protein structure Protein Data Bank (PDB)

Gene expression Gene Expression Omnibus (GEO)

Biomedical publications PubMed / MedLine

Page 10: Biodatabases 101220022654-phpapp02

Bioinformatics Data Sources

Primary databases Original submissions by researchers Staff organizes information only Generally sequence oriented Examples

GenBank, PDB (Protein Data Bank)

Page 11: Biodatabases 101220022654-phpapp02

Bioinformatics Data Sources

Derived databases Compiled from data in primary databases Manually curated (human selection & correction)

Advantages – high quality Disadvantages – high expense, low volume Examples

Swiss-Prot, PIR-PSD, RefSeq

Computational derivation (automatically generated) Advantages – inexpensive, up-to-date Disadvantages – lower quality Examples

GenPept, TrEMBL, UniGene

Page 12: Biodatabases 101220022654-phpapp02

Outline Bioinformatics Databases

Primary databases Derived databases

Nucleotide databases GenBank (P), EMBL-Bank (P)

Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)

Other Examples RefSeq UniGene PubMed SNP OMIM

Page 13: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – GenBank

“GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences”

Database type Nucleotide sequences Primary database

Current Size (As of Aug. 2006): 65,369,091,950 (bps) 61,132,599 (sequences)

Access to GenBank Available for searching at NCBI via several methods

Such as BLAST search

http://www.ncbi.nlm.nih.gov/Genbank/

Page 14: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – GenBank Types of submissions to database

Genomic DNA High quality complete DNA sequence

mRNA / cDNA Partial or complete mRNA (or cDNA)

Expressed sequence tag (EST) A short sub-sequence of a transcribed spliced nucleotide sequence (mRNA)

(500-800bps) May represent portions of expressed genes Either protein-coding or not About 43 million ESTs are now available

Sequence tagged sites (STS) Short DNA sequences unique in genome

Genomic survey sequence (GSS) Single-pass genomic DNA

Third-party annotations of GenBank sequences

Page 15: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – EMBL-Bank

Europe's primary nucleotide sequence resource Primary databases Database type

Nucleotide sequences Primary database

http://www.ebi.ac.uk/embl/

Page 16: Biodatabases 101220022654-phpapp02

Outline Bioinformatics Databases

Primary databases Derived databases

Nucleotide databases GenBank (P), EMBL-Bank (P)

Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)

Other Examples RefSeq UniGene PubMed SNP OMIM

Page 17: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – Proteins

Protein sequence databases Once derived from laboratory experiments Now mostly based on predicted ORFs from DNA

Manual curation Swiss-Prot PIR-PSD

Computational derivation GenPept TrEMBL

Page 18: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – Swiss-Prot, PIR-PSD Database type

Protein sequences Derived database

Manually curated (non-redundant, annotated) Many annotations

Functions of the protein Domains and sites Secondary & quaternary structure Similarities to other proteins Variants

Swiss-Prothttp://ca.expasy.org/sprot/

PIR-PSDhttp://pir.georgetown.edu/pirwww/dbinfo/pir_psd.shtml

Page 19: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – GenPept, TrEMBL

Database type Protein sequences Computationally derived database Predicted (translating) coding sequences (CDS) from GenBank,

EMBL (i.e., gene product) GenPept

Download: ftp://ftp.ncifcrf.gov/pub/genpept/ Release 163 (as of 12/26/2007)

4,970,178 loci containing 1,517,599,916 residues TrEMBL

http://www.ebi.ac.uk/trembl/index.html

Page 20: Biodatabases 101220022654-phpapp02

Structure Databases

3-dimensional structures of proteins, nucleic acids, molecular complexes etc 3-d data is available due to techniques such as NMR

and X-Ray crystallography Protein Data Bank

Protein 3D structures Primary database (http://www.rcsb.org/pdb/home/home.do)

Page 21: Biodatabases 101220022654-phpapp02

Protein Data Bank: PDB

Page 22: Biodatabases 101220022654-phpapp02
Page 23: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – Connections

Page 24: Biodatabases 101220022654-phpapp02

Outline Bioinformatics Databases

Primary databases Derived databases

Nucleotide databases GenBank (P), EMBL-Bank (P)

Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)

Other Examples RefSeq UniGene PubMed SNP OMIM

Page 25: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – RefSeq

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. Information derived from GenBank records

Database type Nucleotide & protein sequences Derived database

Human curated (non-redundant, cross-linked) Data in RefSeq

Genomic DNA mRNAs & proteins for known genes, gene models Entire chromosomes Multiple organisms

http://www.ncbi.nlm.nih.gov/projects/RefSeq/ Example

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_015325

Page 26: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – UniGene

UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters

Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location

Database type Nucleotide sequences Computationally derived database

Partitioned into non-redundant gene-oriented clusters Gene-oriented view

Data in UniGene Clusters of genomic DNA & ESTs Multiple organisms

http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene

Page 27: Biodatabases 101220022654-phpapp02

Database type Biomedical papers Manually curated database

Service of the National Library of Medicine MEDLINE publication database

Over 17,000 journals 15 million citations since 1950

http://www.ncbi.nlm.nih.gov/PubMed/

Bioinformatic Databases – PubMed

Page 28: Biodatabases 101220022654-phpapp02

Bioinformatic Databases – Others

Gene expression ArrayExpress, Gene Expression Omnibus (GEO)

Multi-organism genomes Entrez Genome, HomoloGene, COGs, TIGR

Genetic variation & genetic diseases dbSNP, OMIM, CGAP

Metabolic pathways WIT, KEGG

Many more… Listed in journal “Nucleic Acids Research” each January

Page 29: Biodatabases 101220022654-phpapp02

Bioinformatic Databases: SNP Database

Single Nucleotide Polymorphisms (SNPs)

Single base difference in a single position among two different individuals of the same species

Play an important role in differentiation and disease

http://www.ncbi.nlm.nih.gov/projects/SNP/

Page 30: Biodatabases 101220022654-phpapp02

Sickle Cell Anemia

Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin

Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/

Page 31: Biodatabases 101220022654-phpapp02

Healthy Individual

>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA

ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA

GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG

AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

Page 32: Biodatabases 101220022654-phpapp02

Diseased Individual

>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA

ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA

GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]

MVHLTPVEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG

AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

Page 33: Biodatabases 101220022654-phpapp02

Disease Databases

Genes are involved in disease Many diseases are well studied Description of diseases and what is known about them is

stored A good place to start when you want to know about a certain

disease Linked to PubMed, the OMIM Morbid Map

OMIM - Online Mendelian Inheritance in Man “A catalog of human genes and genetic disorders maintained by

Johns Hopkins University” http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim

Page 34: Biodatabases 101220022654-phpapp02
Page 35: Biodatabases 101220022654-phpapp02

Putting it All Together

Each Database contains specific information

Like other biological systems also these databases are interrelated

Page 36: Biodatabases 101220022654-phpapp02

GENOMIC DATAGenBank

DDBJ

EMBL

ASSEMBLED GENOMES

GoldenPath

WormBase

TIGR

PROTEIN

PIR

SWISS-PROT

STRUCTUREPDB

MMDB

SCOP

LITERATURE

PubMed

PATHWAYKEGG

COG

DISEASE

LocusLink

OMIM

OMIA

GENESRefSeq

AllGenes

GDBSNPs

dbSNP

ESTs

dbEST

unigene

MOTIFS

BLOCKS

Pfam

Prosite

GENE EXPRESSION

Stanford MGDB

NetAffx

ArrayExpress

Page 37: Biodatabases 101220022654-phpapp02

Where to get started? NCBI ENTREZ

A search engine that provides access and links between various databases

ENTREZ

PubMed GenBank Proteindatabases

Genomes SNP Taxonomy OMIM

http://www.ncbi.nlm.nih.gov/sites/gquery

Page 38: Biodatabases 101220022654-phpapp02

Outline Bioinformatics Databases

Primary databases Derived databases

Nucleotide databases GenBank (P), EMBL-Bank (P)

Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)

Other Examples RefSeq UniGene PubMed SNP OMIM