1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1,...

1

A text-mining analysis of the human phenome

Marc A van Driel1, Jorn Bruggeman2, Gert Vriend1, Han G Brunner*,3 and Jack AM Leunissen2

European Journal of Human Genetics (2006) 14, 535-542

1Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegenthe Netherlands; 2Department of Bioinformatics, Wageningen University and Research Centre; 3Department of Human Genetics, University Medical Centre Nijmegen

Speaker: Yu-Ching Fang

Advisors: Hsueh-Fen Juan and Hsin-His Chen

2

Outline

• Introduction

• Methods

• Results

• Discussion

3

Introduction

• Functional annotation of genes is an important challenge once the sequence of a genome has been completed.

• Previous studies have correlated various attributes of human genes with the chance of causing a disease.

4

Introduction (cont.)

• But, few attempts have been made to systematically classify relationships between genes and proteins at the phenotype level.

5

Introduction (cont.)

• The Online Mendelian Inheritance in Man (OMIM) database contains human disease phenotype data and record-based textual information, one gene or one genetic disorder per record.

• Goal: Systematic grouping of genes by their associated phenotypes from the OMIM database.

6

Methods – The OMIM database

• Full text (TX) field: 5132 (disease)/16357

7

Methods – The OMIM database (cont.)

• Clinical synopsis (CS) field

8

Creation of ‘feature vectors’

• MeSH terms and their components are concepts.

• MeSH concepts serve as phenotype features characterizing OMIM records.

Ex: OMIM_1->[MeSH_1,MeSH_2,…]

9

Refinement of the feature vectors

• MeSH concepts can be very broad like ‘Eye’ or more specific like ‘Retina’.

• A concepts hierarchy that describes relationships such as ‘Eye’-’Retina’-’Photoreceptors’.

• Retina is a hyponym of Eye.

10

Refinement of the feature vectors (cont.)

• To ensure that the concepts eye and retina are recognized as similar, the MeSH hierarchy was used to encode this similarity in the feature vectors by increasing the value of all hypernyms.

rc: relevance of concept c

rc,counted: count of the concept c in a document

rhypo’s: relevance of the concept c’s hyponym

nhypo,c: the number of the concept c’s hyponyms

11


• Example of concept expansion using the MeSH hierarchical structure.

12


• Not all concepts in the OMIM records are equally informative.

• Ex: ‘retina pigment epithelium’ occurs rarely, and thus provides more specific information than very frequently terms such as ‘Brain’.

• Inverse document frequency measuregwc: inverse document frequency or global weight of concept c

N: 5080

nc: the number of records that contain concept c

13


• Not all OMIM records contain equally extensive descriptions (record size differences).

• These differences will make a comparison between records difficult because the diversity and the frequency of concepts in the larger records will exceed those in the smaller records.

rc: relevance of concept c

rmf: the frequency of the most occurring MeSH concept in that record

14

Comparing OMIM records

• The similarity between OMIM records can be quantified by comparing the feature vectors that are expanded and corrected.

• Similarities between feature vectors were determined by the cosines of their angles.

s(X,Y): the similarity between the feature vectors X and Y

xi, yi: concept frequencies

15

Results – Comparing OMIM records

• 5080/5132 OMIM records could match one or more MeSH terms.

• The 5080x5080 pair-wise feature vector similarities form phenomap (All to all similarities).

Most phenotype-phenotype pairs have a low similarity score.

16

Comparing OMIM records - The best scores for all phenotypes in the disease phenotype data set

• For each OMIM record, the most similar of the other 5079 records was identified.

• Moderately similar phenotype pairs might still yield reasonable hypotheses.

Ex: ‘Fibromuscular Dysplasia of Arteries’ and ‘Cardiomyopathy, Familial Hypertrophic’ have 0.31 similarity score

17

Comparing OMIM records (cont.)

• Conclusion: The more phenotypes resemble each other, the more likely they are to share an interaction.

18

Discussion

• Developed a text-mining approach to map relationships between more than 5000 human genetic disease phenotypes from the OMIM database.

• Phenotype clustering reflects the modular nature of human disease genetics. Thus, the phenomap may be used to predict candidate genes for diseases.

1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1,...

Documents

Transcript of 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1,...