2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes...

27
2010.09-28 IST Computational Biology 1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes...

Page 1: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 1

Information Retrieval Biological Databases 2

Pedro Fernandes

Instituto Gulbenkian de Ciência, Oeiras PT

Page 2: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 2

Sizing Biological Information

This week (20 Sept. 2010) the EMBL Database contained 298+109 nucleotides in 195,945,264 entries.

Page 3: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 3

Sizing Biological Information

Release 2010_09 of 10-Aug-10 of UniProtKB/Swiss-Prot contains 519348 sequence entries, comprising 183273162 amino acids abstracted from 191032 references.

998 sequences have been added since release 2010_08, the sequence data of 160 existing entries has been updated and the annotations of 480770 entries have been revised.

Protein existence (PE): entries % Evidence at protein level 70514 13.6% Evidence at transcript level 67192 12.9% Inferred from homology 365712 70.4% Predicted 14317 2.8% Uncertain 1613 0.3%

Page 4: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 4

Sizing Biological Information

Page 5: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 5

Sizing Biological Information

Page 6: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 6

Protein Structures

X-RAY5907

2

NMR 8588

ELECTRON MICROSCOPY 306

HYBRID 26

other 147

Total6813

9

RSCB - PDB

0

10000

20000

30000

40000

50000

60000

70000

80000

19

80

20

10

Page 7: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 7

Data deluge, where from

Sequencing (NGS, SMS) Microarray experiments Parallelized drug screening and testing Other

Page 8: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 8

Gene Ontology – towardsconsistent descriptions

The need to produce consistent effective searches

Uniform terminology Controlled

vocabulary Hierarchical

relations

Page 9: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 9

Gene Ontology

Page 10: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 10

Specialized Search tools

Searching on specific fields is relatively easy

Using keywords allows indexed searching on text fields

Searching sequence data is more complexSimilarity search:

BLAST is a fast way of searching sequence data for similarity

Some databases of nucleotide or protein sequences are formatted for BLAST

Page 11: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 11

Interoperability

Adherence to standards Minimal experiment descriptions Ontological concerns Integration Warehousing

Page 12: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 12

Bibliography DBs

Pubmed (Medline) “Entrez” searching Data Mining in text Tagged text to avoid loss (Utopia

doucuments).

Page 13: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 13

Medical Subject Headings

Part of the NLM/Pubmed effort. MESH is a seacheable database. Controlled Vocabulary

Disambiguation Term relationships

Spelling: Hemoglobin or Haemoglobin?Context: NMR spectrocopy or imaging?

Page 14: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 14

More on bibliography

Web of knowledge b-on Institutional repositories

PubCrawler (alerts) http://www.pubcrawler.ie

Page 15: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 15

Structural Protein DBs

Primary

Coordinates from X-ray diffraction, NMR, etc

Composition from UniprotKB Properties from annotations

Page 16: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 16

Specialized DBs

Binding sites SNPs

Page 17: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 17

Classification of Proteins

CATHClassification, Architecture, Topology, Homologyhttp://www.biochem.ucl.ac.uk/bsm/cath_new/

SCOPStructural Classification of Proteinshttp://scop.mrc-lmb.cam.ac.uk/scop/

Page 18: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 18

Integrated DBs

Built to aggregate other databases Provide common search Calculate cross linking tables

Interpro http://www.ebi.ac.uk/interpro–Results from integrating several

derivative databases such as PRINTS; PROSITE; SMART; ProDom; Pfam; TIGRfam

Page 19: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 19

Knowledge bases

Uniprot (Swissprot/PIR/TREmbl) ENSEMBL (genome centered) GeneCards (gene centered)

Page 20: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 20

GeneCards

Page 21: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 21

GeneCards

Page 22: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 22

GeneCards – expression data

Page 23: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 23

Clinical

OMIM Mendelian inheritance, human diseases HGMDMutations and associated human diseases dbSNPSNPs in >1% incidence

Page 24: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 24

The synchronization issue

Many copies of public databases (version control)

Content update on primary and derived databases influences integration

Inconsistencies are slow to resolve Indexes need frequent recalculation

Page 25: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 25

Purifying content

Efforts are in place to enhance contents of derived databases

For example, manual curation of genomic databases in specific sectors, such as eukariots, human, plants, etc.

Page 26: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 26

HAVANA

Manual annotation by chromosome in human genome.

Page 27: 2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

2010.09-28 IST Computational Biology 27

ENCODE

Project to review functional parts of the human genome in fine detail