Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell,...

27
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri

Transcript of Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell,...

Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies

Joyce A. Mitchell, Ph.D.

National Library of Medicine

University of Missouri

2

Research Collaborators

Olivier Bodenreider, M.D., Ph.D. Alexa T. McCray, Ph.D. Allen C. Browne

3

Research Goals

Investigating methods of connecting the disease and genomic information.

Overall goals are to:– Overcome difficulties traversing multiple information

resources– Examine coverage of Unified Medical Language System®

(UMLS®), Gene OntologyTM (GO), LocusLink-OMIM– Develop methods to use ontologies more effectively– Present data in understandable manner

4

Background – UMLS

NLM developed, maintains Purpose: facilitate retrieval & integration of

information from multiple biomedical sources Interrelates 60 biomedical terminologies

– MeSH, SNOMED, Read Codes, ICD, etc– No vocabulary focused on molecular biology

1.5 million English terms; 800,000 concepts; 134 Semantic Types; 54 Semantic Relationships

5

Background – Gene Ontology

GO Consortium developed, maintains Purpose:

– promoting cross-species methodologies for functional comparisions– Allows annotation of molecular information on genes, gene products– “an essential start to creating a shared language of biology” **

Focused on – molecular function (5626 terms)– biological processes (4677 terms)– cellular components (1077 terms)

Two semantic relations (is-a and part-of)

**Genome Research 2001; 11:1425-33.

6

Background - LocusLink

Curated, gene-centered resource of National Center for Biotechnology Information (NLM)

Gene names, gene product names, gene product functions, and reference sequences (DNA, RNA, protein)

Associates phenotype (diseases) to the genotype via Online Mendelian Inheritance in Man (OMIM)

Online links to major bioinformatics knowledge bases and the literature

7

Specific Questions

This study looked at coverage in UMLS of1. 1244 genes associated with human diseases

2. 1702 diseases associated with the genes

3. 11,380 Gene Ontology terms

4. 38,832 genes/gene products in GO database (141,071 names)

5. Associations of genes and their functions in UMLS

6. Representation of gene function in GO compared to the UMLS

8

Methods

LocusLink query: – human genes whose sequence is known and associated

with disease (1244 loci) LocusLink data:

– Genes/gene products (official names, synonyms, symbols)– Phenotypes (diseases) (1702 diseases)

GO data: – all concepts (ontology terms), excluding obsolete terms

(11,380 terms)– Gene products from all species (134,646 unique names,

38,832 genes)

9

Methods

LocusLink and GO terms mapped to UMLS concepts – normalization used– mappings constrained by semantic type

LocusLink loci studied for relationships in UMLS– Gene/GP – phenotype – Gene/GP – molecular function– Gene/GP – biological process– Gene/GP – cellular component

For specific genes compared annotations in GO to representation in UMLS

10

Results - 1

For 1244 genes from LocusLink– 18% found in the UMLS

Official gene name 20% 244/1244

Official gene symbol 16% 200/1244

Alias symbol 15% 394/2669

Gene product 18% 266/1460

Preferred product 18% 266/1460

Alias protein 24% 339/1425

11

Results - 2

For 1702 phenotypes (diseases) corresponding to 1244 genes– 34% found in the UMLS (575/1244)

Most frequent single gene diseases covered– Huntington Disease– Cystic Fibrosis– Marfan Syndrome– Phenylketonuria– Achondroplasia

12

Results - 3

GO terms found in MeSH 2764 terms GO terms found in SNOMED 1366 terms

GO terms found overall: 27% 3062/11,380

Molecular function 44% 2435/5626

Biological process 5% 256/4677

Cellular component 35% 370/1077

13

Results - 4

For 134,646 unique gene names in GO database

Full name 11% 4392/38,832

Symbol 2% 1167/60,381

Synonym 6% 1964/35,433

14

Results - 5

LocusLink – UMLS Relationship Categories found overall: 72%

Genes

&

gene products

Phenotype 64% 754/1182

M. Function 85% 1192/1409

B. Process 61% 762/1240

C. component 76% 841/1107

15

Results - 5

Type of Relationship Associative 613 Co-occurrence 3353 Hierarchical 1168G/GP and Assoc Co-oc Hier

Phenotype 275 724 5

M. Function 206 1069 933

B. Process 57 737 147

C. Component 75 823 83

16

Results - 6

Representation of gene function in GO compared to the UMLS

17

Neurofibromin 2 – merlin in GO

18

GeneOntology

CellularComponent

Biologicalprocess

MolecularFunction

Cell

Membrane IntracellularCell growth and/or

maintenance

CytoplasmPlasma

MembraneCell

ProliferationObsolete

Negative control ofcell proliferation

StructuralProtein

TumorSuppressor

Cytoskeleton

MERL_HUMAN

19

Proteins

Neoplasm Proteins Cell Cycle Proteins Proteins by Body Part

Tumor Suppressor Proteins Membrane Proteins

Neurofibromin 2

Growth SuppresorProteins

Merlin, Drosophila

20

Discussion

21

Best & Worst Mappings

Best mapping categories Molecular function (GO) 44% Cellular component (GO) 35% Phenotype (LL) 34%

Worst mapping categories Gene synonym (GO) 6% Biological process (GO) 5% Gene symbol(GO) 2%

22

Only 34% of diseases?

In OMIM-LL, diseases are subdivided by genetic causes but not in UMLS

E.g. Limb Girdle Muscular DystrophyLGMD is represented in UMLS A SNOMED term in MeSH it is an entry term for muscular dystrophies MeSH notes for MD: A general term for a group of

inherited disorders which are characterized by progressive degeneration of skeletal muscles (ed, 2000)

23

Limb Girdle Muscular Dystrophy – genetic types

LGMD type Gene Name LGMD type Gene Name

1A Myotilin 2C Sarcoglycan-gamma

1B Lamin A/C 2D Sarcoglycan-alpha

1C Caveolin-3 2E Sarcoglycan-beta

1D Unknown 2F Sarcoglycan-delta

2A Calpain-3 2G Telethonin

2B Dysferlin 2H TRIM32

2I Fukutin-related protein

24

Only 5% of Biological Processes?

Only 256 of the biological processes mapped to terms in UMLS. In GO, processes are elaborated & organism specific Example: UMLS - Mitotic spindle GO

– Mitotic spindle assembly– Mitotic spindle assembly (sensu Saccharomyces)– Mitotic spindle assembly (sensu Fungi)– Mitotic spindle checkpoint– Mitotic spindle elongation– Mitotic spindle orientation– Mitotic spindle positioning– Mitotic spindle positioning and orientation

25

Why so few gene names and synonyms mapped?

Official gene names have metadata and comments. – dystrophin (muscular dystrophy, Duchenne and Becker types),

includes DXS143, DXS164, DXS206, DXS230, DXS239, DXS 268, DXS269, DXS270 DXS272

No single source has all names and synonyms GO synonym field contains IPI number for well

known genes, does not match UMLS (useful cross reference but not a synonym)

Symbols are short acronyms and match poorly

26

Summary 1

UMLS needs improvement in molecular biology domain but has considerable content:– 27% of GO concepts map – 34% of single gene diseases– Existing UMLS terms come primarily from MeSH

and SNOMED

Overall, positive mapping for 13,000 terms

27

Summary continued

If the terms are in UMLS, it is possible to find a relationship between genes and phenotypes and gene function much of the time.

UMLS does better with the human genes (20%+) than with genes from all organisms (11%)

UMLS and GO representations complement each other.