2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes...
-
date post
15-Jan-2016 -
Category
Documents
-
view
218 -
download
0
Transcript of 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes...
2010.09-28 IST Computational Biology 1
Information Retrieval Biological Databases 2
Pedro Fernandes
Instituto Gulbenkian de Ciência, Oeiras PT
2010.09-28 IST Computational Biology 2
Sizing Biological Information
This week (20 Sept. 2010) the EMBL Database contained 298+109 nucleotides in 195,945,264 entries.
2010.09-28 IST Computational Biology 3
Sizing Biological Information
Release 2010_09 of 10-Aug-10 of UniProtKB/Swiss-Prot contains 519348 sequence entries, comprising 183273162 amino acids abstracted from 191032 references.
998 sequences have been added since release 2010_08, the sequence data of 160 existing entries has been updated and the annotations of 480770 entries have been revised.
Protein existence (PE): entries % Evidence at protein level 70514 13.6% Evidence at transcript level 67192 12.9% Inferred from homology 365712 70.4% Predicted 14317 2.8% Uncertain 1613 0.3%
2010.09-28 IST Computational Biology 4
Sizing Biological Information
2010.09-28 IST Computational Biology 5
Sizing Biological Information
2010.09-28 IST Computational Biology 6
Protein Structures
X-RAY5907
2
NMR 8588
ELECTRON MICROSCOPY 306
HYBRID 26
other 147
Total6813
9
RSCB - PDB
0
10000
20000
30000
40000
50000
60000
70000
80000
19
80
20
10
2010.09-28 IST Computational Biology 7
Data deluge, where from
Sequencing (NGS, SMS) Microarray experiments Parallelized drug screening and testing Other
2010.09-28 IST Computational Biology 8
Gene Ontology – towardsconsistent descriptions
The need to produce consistent effective searches
Uniform terminology Controlled
vocabulary Hierarchical
relations
2010.09-28 IST Computational Biology 9
Gene Ontology
2010.09-28 IST Computational Biology 10
Specialized Search tools
Searching on specific fields is relatively easy
Using keywords allows indexed searching on text fields
Searching sequence data is more complexSimilarity search:
BLAST is a fast way of searching sequence data for similarity
Some databases of nucleotide or protein sequences are formatted for BLAST
2010.09-28 IST Computational Biology 11
Interoperability
Adherence to standards Minimal experiment descriptions Ontological concerns Integration Warehousing
2010.09-28 IST Computational Biology 12
Bibliography DBs
Pubmed (Medline) “Entrez” searching Data Mining in text Tagged text to avoid loss (Utopia
doucuments).
2010.09-28 IST Computational Biology 13
Medical Subject Headings
Part of the NLM/Pubmed effort. MESH is a seacheable database. Controlled Vocabulary
Disambiguation Term relationships
Spelling: Hemoglobin or Haemoglobin?Context: NMR spectrocopy or imaging?
2010.09-28 IST Computational Biology 14
More on bibliography
Web of knowledge b-on Institutional repositories
PubCrawler (alerts) http://www.pubcrawler.ie
2010.09-28 IST Computational Biology 15
Structural Protein DBs
Primary
Coordinates from X-ray diffraction, NMR, etc
Composition from UniprotKB Properties from annotations
2010.09-28 IST Computational Biology 16
Specialized DBs
Binding sites SNPs
2010.09-28 IST Computational Biology 17
Classification of Proteins
CATHClassification, Architecture, Topology, Homologyhttp://www.biochem.ucl.ac.uk/bsm/cath_new/
SCOPStructural Classification of Proteinshttp://scop.mrc-lmb.cam.ac.uk/scop/
2010.09-28 IST Computational Biology 18
Integrated DBs
Built to aggregate other databases Provide common search Calculate cross linking tables
Interpro http://www.ebi.ac.uk/interpro–Results from integrating several
derivative databases such as PRINTS; PROSITE; SMART; ProDom; Pfam; TIGRfam
2010.09-28 IST Computational Biology 19
Knowledge bases
Uniprot (Swissprot/PIR/TREmbl) ENSEMBL (genome centered) GeneCards (gene centered)
2010.09-28 IST Computational Biology 20
GeneCards
2010.09-28 IST Computational Biology 21
GeneCards
2010.09-28 IST Computational Biology 22
GeneCards – expression data
2010.09-28 IST Computational Biology 23
Clinical
OMIM Mendelian inheritance, human diseases HGMDMutations and associated human diseases dbSNPSNPs in >1% incidence
2010.09-28 IST Computational Biology 24
The synchronization issue
Many copies of public databases (version control)
Content update on primary and derived databases influences integration
Inconsistencies are slow to resolve Indexes need frequent recalculation
2010.09-28 IST Computational Biology 25
Purifying content
Efforts are in place to enhance contents of derived databases
For example, manual curation of genomic databases in specific sectors, such as eukariots, human, plants, etc.
2010.09-28 IST Computational Biology 26
HAVANA
Manual annotation by chromosome in human genome.
2010.09-28 IST Computational Biology 27
ENCODE
Project to review functional parts of the human genome in fine detail