2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes...

2010.09-28 IST Computational Biology 1

Information Retrieval Biological Databases 2

Pedro Fernandes

Instituto Gulbenkian de Ciência, Oeiras PT


Sizing Biological Information

This week (20 Sept. 2010) the EMBL Database contained 298+109 nucleotides in 195,945,264 entries.



Release 2010_09 of 10-Aug-10 of UniProtKB/Swiss-Prot contains 519348 sequence entries, comprising 183273162 amino acids abstracted from 191032 references.

998 sequences have been added since release 2010_08, the sequence data of 160 existing entries has been updated and the annotations of 480770 entries have been revised.

Protein existence (PE): entries % Evidence at protein level 70514 13.6% Evidence at transcript level 67192 12.9% Inferred from homology 365712 70.4% Predicted 14317 2.8% Uncertain 1613 0.3%


Protein Structures

X-RAY5907

2

NMR 8588

ELECTRON MICROSCOPY 306

HYBRID 26

other 147

Total6813

9

RSCB - PDB

0

10000

20000

30000

40000

50000

60000

70000

80000

19

80

20

10


Data deluge, where from

Sequencing (NGS, SMS) Microarray experiments Parallelized drug screening and testing Other


Gene Ontology – towardsconsistent descriptions

The need to produce consistent effective searches

Uniform terminology Controlled

vocabulary Hierarchical

relations


Gene Ontology


Specialized Search tools

Searching on specific fields is relatively easy

Using keywords allows indexed searching on text fields

Searching sequence data is more complexSimilarity search:

BLAST is a fast way of searching sequence data for similarity

Some databases of nucleotide or protein sequences are formatted for BLAST


Interoperability

Adherence to standards Minimal experiment descriptions Ontological concerns Integration Warehousing


Bibliography DBs

Pubmed (Medline) “Entrez” searching Data Mining in text Tagged text to avoid loss (Utopia

doucuments).


Medical Subject Headings

Part of the NLM/Pubmed effort. MESH is a seacheable database. Controlled Vocabulary

Disambiguation Term relationships

Spelling: Hemoglobin or Haemoglobin?Context: NMR spectrocopy or imaging?


More on bibliography

Web of knowledge b-on Institutional repositories

PubCrawler (alerts) http://www.pubcrawler.ie


Structural Protein DBs

Primary

Coordinates from X-ray diffraction, NMR, etc

Composition from UniprotKB Properties from annotations


Specialized DBs

Binding sites SNPs


Classification of Proteins

CATHClassification, Architecture, Topology, Homologyhttp://www.biochem.ucl.ac.uk/bsm/cath_new/

SCOPStructural Classification of Proteinshttp://scop.mrc-lmb.cam.ac.uk/scop/


Integrated DBs

Built to aggregate other databases Provide common search Calculate cross linking tables

Interpro http://www.ebi.ac.uk/interpro–Results from integrating several

derivative databases such as PRINTS; PROSITE; SMART; ProDom; Pfam; TIGRfam


Knowledge bases

Uniprot (Swissprot/PIR/TREmbl) ENSEMBL (genome centered) GeneCards (gene centered)


GeneCards


GeneCards – expression data


Clinical

OMIM Mendelian inheritance, human diseases HGMDMutations and associated human diseases dbSNPSNPs in >1% incidence


The synchronization issue

Many copies of public databases (version control)

Content update on primary and derived databases influences integration

Inconsistencies are slow to resolve Indexes need frequent recalculation


Purifying content

Efforts are in place to enhance contents of derived databases

For example, manual curation of genomic databases in specific sectors, such as eukariots, human, plants, etc.


HAVANA

Manual annotation by chromosome in human genome.


ENCODE

Project to review functional parts of the human genome in fine detail

2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes...

Documents

Transcript of 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes...