Biosemantics group Martijn Schuemie. Overview The biosemantics group Ontology assembly Concept...

Post on 13-Jan-2016

217 views 0 download

Tags:

Transcript of Biosemantics group Martijn Schuemie. Overview The biosemantics group Ontology assembly Concept...

Biosemantics group

Martijn Schuemie

Overview

The biosemantics group

Ontology assembly

Concept tagging

Homonym disambiguation

Concept profile creation

Nucleolus

Biosemantics group

ErasmusMC University Medical Center Rotterdam

Department of Medical Informatics

Biosemantics group

Jan Kors

Barend Mons

Erik van Mulligen

Martijn Schuemie

Rob Jelier

Kristina Hettne

Antoinne van Veldhoven

Biosemantics group

Biosemantics

Molecular Biology

High througput experiment data (genomics and proteomics)

Gene and protein databases, MEDLINE, Gene Ontology

Biosemantics

Concept-based text-mining

Interpretation of experiment data

Knowledge discovery

Ontology assembly

Entrez Gene Swiss-Prot HUGO

Combination

Add spelling variationsABC1 -> ABC-1DEF3 -> DEF-III

Remove highly ambiguous terms

CO2, membrane-boundobesity, open reading frame

P=37%, R=76%

P=50%, R=75%

Concept tagging

MEDLINE text Malaria fever is a disease. It is spread by mosquitos.

Sentence splitting [Malaria fever is a disease.] [It is spread by mosquitos.]

Tokenization [Malaria] [fever] [is] [a] [disease]

Word normalisation [malaria] [fever] [be] [a] [disease]

Concept mapping [malaria fever] C24530 [disease] C12634

Homonym disambiguationPSA -> Prostate Specific Antigen or Poultry Science Association?

Concept profile of text

Homonym disambiguation

Some simple rules:• Is it likely that a term has multiple meanings?

- 3-letter-acronym (e.g. PSA): highly likely- long forms (e.g. Prostate Specific Antigen): highly unlikely- terms that refer to several concepts by definition

• Is a synonym found? (e.g. “KLK3 (PSA)”)

• Is a keyword found? (e.g. “PSA is secreted by the prostate”)

These simple rules change performance from P=50%, R=75% to P=71%, R=71%.

Homonym disambiguation

Concept profile of text containing PSA

Concept profile of Prostate Specific Antigen

Concept profile of Phosphoserine Aminotransferase

Unknown meaning

Similarity?

Previous tests showed an overall accuracy of 93%

Concept profile creation

Concept profile of textConcept profile of textConcept profile of text Concept profile of concept

TextTextText Concept

- From databases- By concept mapping

Concept profile creation

Binary

Log likelihood

X IDF

Uncertainty cf.

Concept profile creation

Profile of gene ESR1:

estrogen receptor 1

breast neoplasm 0.5

BRCA1 0.34

PGR 0.30

Estrogen 0.28

BRCA2 0.25

TP53 0.15

gene suppressor tumor 0.12

genetics polymorphism 0.12

genetic predisposition to disease 0.10

female 0.05

Concept profile comparison

Concept profile comparison

Concept Name Weight RAB27B MYRIP MLPH RAB27A

RAB27A 52.17 0.61 0.74 0.73 1

MLPH 11.16 - 0.44 1 0.29

Myosin Type V 7.22 0.04 0.68 0.4 0.22

Melanosomes 6.7 0.12 0.3 0.47 0.27

RAB27B 4.06 1 0.14 - 0.11

MYRIP 2.98 0.07 1 0.09 0.06

Melanocytes 2.73 0.13 0.14 0.28 0.17

Myosins 2.33 0.04 0.38 0.22 0.12

Myosin Heavy Chains 1.72 - 0.46 0.18 0.09

GTP Phosphohydrolases 1.31 0.17 0.23 0.04 0.08

Actins 1.17 0.05 0.32 0.12 0.06

Exocytosis 0.87 0.08 0.12 0.08 0.12

Secretory Vesicles 0.68 0.07 0.16 0.06 0.09

Carrier Proteins 0.59 - 0.11 0.17 0.09

Organelles 0.54 0.11 - 0.12 0.09

rab GTP-Binding Proteins 0.52 0.16 - 0.04 0.12

Nucleolus

• main function: ribosome biogenesis

• over 700 proteins identified and classified into 8 main categories

MEDLINE article

Nucleolus – Concept profiles

Concept profile of textConcept profile of textConcept profile of text Concept profile of protein

Protein- From databases

MEDLINE articleMEDLINE article

Nucleolus – Concept profiles

BLAST (Basic Local Alignment Search Tool)

Query: nucleolar protein

Results: homologs in• human• mouse• fruitfly• yeast

Nucleolus – Concept profiles

Minimum Maximum Mean

Human 0 9 1.66

Mouse 0 10 1.37

Fruitfly 0 5 0.7

Yeast 0 8 1.21

Articles 1 1046 91.31

Homologs used

Articles used

Nucleolus – fun with protein profiles

• 2D visualization of high-dimensional space

• Automatic functional annotation of proteins

• Finding similar proteins

Nucleolus - visualisationFunction unknow nChaperonesChromatin structureFibrous proteinsmRNA metabolismOthersRibosomal proteinsRibosome biogenesisTranslation

SRPPARN

Exosome comp. 10

O43390P98179

Q8N220Multi-Dimensional Scaling

Nucleolus – Assigning GO terms

MEDLINE article

Concept profile of textConcept profile of textConcept profile of text Concept profile of GO term

GO term- From GO

MEDLINE articleMEDLINE article

Nucleolus – Assigning GO terms

AuC : Area under Curve

Category AuC pChaperones 1.00 <.001Chromatin Structure 0.98 <.001Fibrous proteins 0.97 <.001mRNA metabolism 0.72 <.001Others 0.81 <.001Ribosomal proteins 0.97 <.001Ribosome biogenesis 0.69 <.001Translation 0.88 <.001

Nucleolus – Assigning GO terms

1. Manual assignment to one category only

e.g. SFRS protein kinase 1 plays a role in splicing,but is also in kinase

2. Assumptions do not always hold• Sequence homology ≠ function homology• Concept co-occurrence ≠ functional relationship

3. Homonyms

‘Mistakes’ in automatic annotation

Nucleolus – Finding new proteins

Concept profile ofnucleolar protein

Concept profile ofhuman protein

Concept profile ofhuman protein

Concept profile ofhuman protein

Nucleolus – Finding new proteins

60S ribosomal protein L3-likeProbable ATP-dependent RNA helicase DDX4ATP-dependent RNA helicase DDX3Y Guanine nucleotide binding protein-like 3 Importin-11 (importin beta family)Putative Brix domain containing protein 1PProbable ATP-dependent RNA helicase DDX20 (Gemin 3)60S acidic ribosomal protein P0Helicase SKI2WATP-dependent RNA helicase DDX3940S ribosomal protein S20Probable ATP-dependent RNA helicase DDX6Probable ATP-dependent RNA helicase DDX23 Double-stranded RNA-binding protein Staufen homolog 1ATP-dependent RNA helicase DDX25Probable nucleolar complex protein 14Eukaryotic initiation factor 4A-IIATP-dependent RNA helicase DDX19B40S ribosomal protein S3

Ribosomal proteinDEAD-boxDEAD-boxFound in nucleolusAssociated with nucleolar p.DEAD-boxDEAD-boxDEAD-boxFound in nucleolusDEAD-boxRibosomal proteinDEAD-boxDEAD-boxIndirect evidence DEAD-boxNucleolarDEAD-boxDEAD-boxRibosomal protein