Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell...

30
Overview of current biological databases Qi Sun Computational Biology Service Unit Cornell University

Transcript of Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell...

Page 1: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Overview of current biological databases

Qi Sun

Computational Biology Service Unit

Cornell University

Page 2: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.
Page 3: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Web Server Database Server

SOAP

HTTP

FTP

SQL

Platforms for Bioinformatics

Page 4: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

LinuxApacheMysqlPerl/Python/PHP

WindowsASP.NETSQL ServerC#

Open source Micorsoft

Platforms for Bioinformatics

Page 5: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Archival database (GenBank, GenPept)

vs

Computer algorithm generated database (Unigene)

vs

Manually curated database (RefSeq)

Public Database - 1

NCBI Sequence Data Model

Page 6: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

The NCBI Data Model

Genbank- A DNA centered database

Page 7: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

1. LOCUS (obsolete)2. Accession (version)3. GI

Identifier:

Page 8: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Features

Page 9: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

GenPept- A protein centered database

Page 10: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

FTP sites:

GenBank: ftp://ftp.ncbi.nih.gov/genbank/

GenPept: ftp://ftp.ncifcrf.gov/pub/genpept/

Page 11: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Problems with Genbank and Genpept

• It does not distinguish the sequence categories.

• Lot of redundancy.• Same gene could be deposited into the database many times with different names

• Different version of the same gene could be submitted many times with different accession number.

• The features of genbank record could be chaotic.

Page 12: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Archival database (GenBank, GenPept)

vs

Computer algorithm generated database (Unigene)

vs

Curated database (RefSeq, Locuslink ...)

Public Database - 1

NCBI Sequence Databases

Page 13: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

UniGenea non-redundant set of gene-oriented clusters

GenBankmRNAs

GenBank genomic CDSs

dbESTESTs

Unigene

Page 14: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Hs for humanMm for mouseRn for ratBt for cowDr for zebrafishDm for fruitflyAga for mosquitoXl for frogAt for cressHv for barleyOs for riceTa for wheatsZm for maize

Unigene identifier

Examples:

Mm.213407

Hs.13303

At.138

Page 15: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Archival database (GenBank, GenPept)

vs

Computer generated database (Unigene)

vs

Curated database (RefSeq, Gene ...)

NCBI Sequence Databases

Public Database - 1

Page 16: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

NCBI human genome annotation pipeline

The refseq incorporate the predicted transcript and protein sequences, experimentally identified mRNA sequences, EST sequences.

Page 17: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Refseq Accession Numbers:

NT_123456 constructed genomic contigs

NM_123456 mRNAs

NP_123456 proteins

NC_123456 chromosomes

XM_123456 predicted mRNA

XP_123456 predicted protein

Page 18: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Genome sequence available

Refseqacc: NP_123456, et al

EST sequence available

Unigeneacc: Hs.13303, et al

Genbankacc: AP33493, et al

Refseq? Unigene? Genbank?

Page 19: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Go to the web

Page 20: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Files that you can download from the NCBI gene database

gene_infogene2refseqgene2go

Page 21: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

NCBI Search engine

Entrez• boolean operators “AND” “OR” “NOT”• entrez tags• using limits• MeSH terms

Batch Entrez

search by accession list

Page 22: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Other Sequence Databases:

Genomic DNA: Ensembl Genome annotation database(http://www.ensembl.org, HTTP, FTP, MySQL interface)

Protein: Uniprot(http://www.pir.uniprot.org/ )

Page 23: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

KEGG database go to the web

Page 24: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Public Database - 2

GOGene Ontology

1. Molecular Function2. Biological Process3. Cellular Component

http://www.geneontology.org

Page 25: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Public Database - 2

Page 26: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Public Database - 2

Molecular Function 3674

Biological Process 8150

Cellular Component 5575

GO3673

Page 27: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

GO Example 1:

Biological Process

Page 28: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

GO Example 2:

Molecular Function

Page 29: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Smn: survival motor neuronGene ID: 39844

Gene Ontology Annotation

Page 30: Computational Biology Service Unit Cornell UniversityComputational Biology Service Unit Cornell University. Web Server Database Server SOAP HTTP FTP SQL Platforms for Bioinformatics.

Public Database - 4

Species Specific Databases

•Arabidopsis – TAIR• Yeast – SGD• Fly – FLYBASE• Worm – WORMBASE• Mouse – MGD