Linkedinmarketingoverviewjasminesandler 13499986125162-phpapp02-121011184038-phpapp02
Biodatabases 101220022654-phpapp02
-
Upload
sreekanth-gali -
Category
Education
-
view
1.197 -
download
0
Transcript of Biodatabases 101220022654-phpapp02
COT 6930HPC and Bioinformatics
Bioinformatics Resources and Databases
sree
DNA RNA
cDNAESTsUniGene
phenotype
GenomicDNADatabases
Protein sequence databases
protein
Protein structure databases
transcription translation
Gene expressiondatabase
Gene
Different transcripts can be related to the same gene!
EST
Expressed Sequence Tags
Partial copies of mRNA found within a particular cell
Can be used to identify genitc regions; splicing patterns of genes; etc
Outline Bioinformatics Databases
Primary databases Derived databases
Nucleotide databases GenBank (P), EMBL-Bank (P)
Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)
Other Examples RefSeq UniGene PubMed SNP OMIM
Bioinformatics Databases
Information DNA sequences Conserved DNA domains Genomes Gene expression (ESTs, microarrays) Protein sequences Protein 3D structure Protein families Mutations / polymorphisms / SNPs Metabolic pathways Chemical compounds (ligands) Biomedical literature (journal papers, online books…)
Primary public domain bioinformatics servers
Public DomainBioinformatics
Facilities
European BioinformaticsInstitute (EBI)
United Kingdom
National CenterFor Biotechnology
Information (NCBI)United States
GenomeNet
(KEGG & DDBJ)Japan
DatabasesAnalysis
ToolsDatabases
AnalysisTools
DatabasesAnalysis
Tools
Major Databases
DNA sequences GenBank, RefSeq, UniGene
Protein sequences Swiss-Prot, PIR-PSD, GenPept, TrEMBL, RefSeq
Protein structure Protein Data Bank (PDB)
Gene expression Gene Expression Omnibus (GEO)
Biomedical publications PubMed / MedLine
Bioinformatics Data Sources
Primary databases Original submissions by researchers Staff organizes information only Generally sequence oriented Examples
GenBank, PDB (Protein Data Bank)
Bioinformatics Data Sources
Derived databases Compiled from data in primary databases Manually curated (human selection & correction)
Advantages – high quality Disadvantages – high expense, low volume Examples
Swiss-Prot, PIR-PSD, RefSeq
Computational derivation (automatically generated) Advantages – inexpensive, up-to-date Disadvantages – lower quality Examples
GenPept, TrEMBL, UniGene
Outline Bioinformatics Databases
Primary databases Derived databases
Nucleotide databases GenBank (P), EMBL-Bank (P)
Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)
Other Examples RefSeq UniGene PubMed SNP OMIM
Bioinformatic Databases – GenBank
“GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences”
Database type Nucleotide sequences Primary database
Current Size (As of Aug. 2006): 65,369,091,950 (bps) 61,132,599 (sequences)
Access to GenBank Available for searching at NCBI via several methods
Such as BLAST search
http://www.ncbi.nlm.nih.gov/Genbank/
Bioinformatic Databases – GenBank Types of submissions to database
Genomic DNA High quality complete DNA sequence
mRNA / cDNA Partial or complete mRNA (or cDNA)
Expressed sequence tag (EST) A short sub-sequence of a transcribed spliced nucleotide sequence (mRNA)
(500-800bps) May represent portions of expressed genes Either protein-coding or not About 43 million ESTs are now available
Sequence tagged sites (STS) Short DNA sequences unique in genome
Genomic survey sequence (GSS) Single-pass genomic DNA
Third-party annotations of GenBank sequences
Bioinformatic Databases – EMBL-Bank
Europe's primary nucleotide sequence resource Primary databases Database type
Nucleotide sequences Primary database
http://www.ebi.ac.uk/embl/
Outline Bioinformatics Databases
Primary databases Derived databases
Nucleotide databases GenBank (P), EMBL-Bank (P)
Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)
Other Examples RefSeq UniGene PubMed SNP OMIM
Bioinformatic Databases – Proteins
Protein sequence databases Once derived from laboratory experiments Now mostly based on predicted ORFs from DNA
Manual curation Swiss-Prot PIR-PSD
Computational derivation GenPept TrEMBL
Bioinformatic Databases – Swiss-Prot, PIR-PSD Database type
Protein sequences Derived database
Manually curated (non-redundant, annotated) Many annotations
Functions of the protein Domains and sites Secondary & quaternary structure Similarities to other proteins Variants
Swiss-Prothttp://ca.expasy.org/sprot/
PIR-PSDhttp://pir.georgetown.edu/pirwww/dbinfo/pir_psd.shtml
Bioinformatic Databases – GenPept, TrEMBL
Database type Protein sequences Computationally derived database Predicted (translating) coding sequences (CDS) from GenBank,
EMBL (i.e., gene product) GenPept
Download: ftp://ftp.ncifcrf.gov/pub/genpept/ Release 163 (as of 12/26/2007)
4,970,178 loci containing 1,517,599,916 residues TrEMBL
http://www.ebi.ac.uk/trembl/index.html
Structure Databases
3-dimensional structures of proteins, nucleic acids, molecular complexes etc 3-d data is available due to techniques such as NMR
and X-Ray crystallography Protein Data Bank
Protein 3D structures Primary database (http://www.rcsb.org/pdb/home/home.do)
Protein Data Bank: PDB
Bioinformatic Databases – Connections
Outline Bioinformatics Databases
Primary databases Derived databases
Nucleotide databases GenBank (P), EMBL-Bank (P)
Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)
Other Examples RefSeq UniGene PubMed SNP OMIM
Bioinformatic Databases – RefSeq
The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. Information derived from GenBank records
Database type Nucleotide & protein sequences Derived database
Human curated (non-redundant, cross-linked) Data in RefSeq
Genomic DNA mRNAs & proteins for known genes, gene models Entire chromosomes Multiple organisms
http://www.ncbi.nlm.nih.gov/projects/RefSeq/ Example
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_015325
Bioinformatic Databases – UniGene
UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters
Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location
Database type Nucleotide sequences Computationally derived database
Partitioned into non-redundant gene-oriented clusters Gene-oriented view
Data in UniGene Clusters of genomic DNA & ESTs Multiple organisms
http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene
Database type Biomedical papers Manually curated database
Service of the National Library of Medicine MEDLINE publication database
Over 17,000 journals 15 million citations since 1950
http://www.ncbi.nlm.nih.gov/PubMed/
Bioinformatic Databases – PubMed
Bioinformatic Databases – Others
Gene expression ArrayExpress, Gene Expression Omnibus (GEO)
Multi-organism genomes Entrez Genome, HomoloGene, COGs, TIGR
Genetic variation & genetic diseases dbSNP, OMIM, CGAP
Metabolic pathways WIT, KEGG
Many more… Listed in journal “Nucleic Acids Research” each January
Bioinformatic Databases: SNP Database
Single Nucleotide Polymorphisms (SNPs)
Single base difference in a single position among two different individuals of the same species
Play an important role in differentiation and disease
http://www.ncbi.nlm.nih.gov/projects/SNP/
Sickle Cell Anemia
Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin
Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/
Healthy Individual
>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
Diseased Individual
>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
MVHLTPVEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
Disease Databases
Genes are involved in disease Many diseases are well studied Description of diseases and what is known about them is
stored A good place to start when you want to know about a certain
disease Linked to PubMed, the OMIM Morbid Map
OMIM - Online Mendelian Inheritance in Man “A catalog of human genes and genetic disorders maintained by
Johns Hopkins University” http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim
Putting it All Together
Each Database contains specific information
Like other biological systems also these databases are interrelated
GENOMIC DATAGenBank
DDBJ
EMBL
ASSEMBLED GENOMES
GoldenPath
WormBase
TIGR
PROTEIN
PIR
SWISS-PROT
STRUCTUREPDB
MMDB
SCOP
LITERATURE
PubMed
PATHWAYKEGG
COG
DISEASE
LocusLink
OMIM
OMIA
GENESRefSeq
AllGenes
GDBSNPs
dbSNP
ESTs
dbEST
unigene
MOTIFS
BLOCKS
Pfam
Prosite
GENE EXPRESSION
Stanford MGDB
NetAffx
ArrayExpress
Where to get started? NCBI ENTREZ
A search engine that provides access and links between various databases
ENTREZ
PubMed GenBank Proteindatabases
Genomes SNP Taxonomy OMIM
http://www.ncbi.nlm.nih.gov/sites/gquery
Outline Bioinformatics Databases
Primary databases Derived databases
Nucleotide databases GenBank (P), EMBL-Bank (P)
Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P)
Other Examples RefSeq UniGene PubMed SNP OMIM