1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1,...

18
1 A text-mining analysis of the human phenome Marc A van Driel 1 , Jorn Bruggeman 2 , Gert Vrie nd 1 , Han G Brunner *,3 and Jack AM Leunissen 2 European Journal of Human Genetics (2006) 14, 535-542 1 Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegenthe Netherlands; 2 Department of Bioinform atics, Wageningen University and Research Centre; 3 Departmen t of Human Genetics, University Medical Centre Nijmegen Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan and Hsin-H is Chen

Transcript of 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1,...

Page 1: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

1

A text-mining analysis of the human phenome

Marc A van Driel1, Jorn Bruggeman2, Gert Vriend1, Han G Brunner*,3 and Jack AM Leunissen2

European Journal of Human Genetics (2006) 14, 535-542

1Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegenthe Netherlands; 2Department of Bioinformatics, Wageningen University and Research Centre; 3Department of Human Genetics, University Medical Centre Nijmegen

Speaker: Yu-Ching Fang

Advisors: Hsueh-Fen Juan and Hsin-His Chen

Page 2: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

2

Outline

• Introduction

• Methods

• Results

• Discussion

Page 3: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

3

Introduction

• Functional annotation of genes is an important challenge once the sequence of a genome has been completed.

• Previous studies have correlated various attributes of human genes with the chance of causing a disease.

Page 4: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

4

Introduction (cont.)

• But, few attempts have been made to systematically classify relationships between genes and proteins at the phenotype level.

Page 5: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

5

Introduction (cont.)

• The Online Mendelian Inheritance in Man (OMIM) database contains human disease phenotype data and record-based textual information, one gene or one genetic disorder per record.

• Goal: Systematic grouping of genes by their associated phenotypes from the OMIM database.

Page 6: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

6

Methods – The OMIM database

• Full text (TX) field: 5132 (disease)/16357

Page 7: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

7

Methods – The OMIM database (cont.)

• Clinical synopsis (CS) field

Page 8: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

8

Creation of ‘feature vectors’

• MeSH terms and their components are concepts.

• MeSH concepts serve as phenotype features characterizing OMIM records.

Ex: OMIM_1->[MeSH_1,MeSH_2,…]

Page 9: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

9

Refinement of the feature vectors

• MeSH concepts can be very broad like ‘Eye’ or more specific like ‘Retina’.

• A concepts hierarchy that describes relationships such as ‘Eye’-’Retina’-’Photoreceptors’.

• Retina is a hyponym of Eye.

Page 10: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

10

Refinement of the feature vectors (cont.)

• To ensure that the concepts eye and retina are recognized as similar, the MeSH hierarchy was used to encode this similarity in the feature vectors by increasing the value of all hypernyms.

rc: relevance of concept c

rc,counted: count of the concept c in a document

rhypo’s: relevance of the concept c’s hyponym

nhypo,c: the number of the concept c’s hyponyms

Page 11: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

11

Refinement of the feature vectors (cont.)

• Example of concept expansion using the MeSH hierarchical structure.

Page 12: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

12

Refinement of the feature vectors (cont.)

• Not all concepts in the OMIM records are equally informative.

• Ex: ‘retina pigment epithelium’ occurs rarely, and thus provides more specific information than very frequently terms such as ‘Brain’.

• Inverse document frequency measuregwc: inverse document frequency or global weight of concept c

N: 5080

nc: the number of records that contain concept c

Page 13: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

13

Refinement of the feature vectors (cont.)

• Not all OMIM records contain equally extensive descriptions (record size differences).

• These differences will make a comparison between records difficult because the diversity and the frequency of concepts in the larger records will exceed those in the smaller records.

rc: relevance of concept c

rmf: the frequency of the most occurring MeSH concept in that record

Page 14: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

14

Comparing OMIM records

• The similarity between OMIM records can be quantified by comparing the feature vectors that are expanded and corrected.

• Similarities between feature vectors were determined by the cosines of their angles.

s(X,Y): the similarity between the feature vectors X and Y

xi, yi: concept frequencies

Page 15: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

15

Results – Comparing OMIM records

• 5080/5132 OMIM records could match one or more MeSH terms.

• The 5080x5080 pair-wise feature vector similarities form phenomap (All to all similarities).

Most phenotype-phenotype pairs have a low similarity score.

Page 16: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

16

Comparing OMIM records - The best scores for all phenotypes in the disease phenotype data set

• For each OMIM record, the most similar of the other 5079 records was identified.

• Moderately similar phenotype pairs might still yield reasonable hypotheses.

Ex: ‘Fibromuscular Dysplasia of Arteries’ and ‘Cardiomyopathy, Familial Hypertrophic’ have 0.31 similarity score

Page 17: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

17

Comparing OMIM records (cont.)

• Conclusion: The more phenotypes resemble each other, the more likely they are to share an interaction.

Page 18: 1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

18

Discussion

• Developed a text-mining approach to map relationships between more than 5000 human genetic disease phenotypes from the OMIM database.

• Phenotype clustering reflects the modular nature of human disease genetics. Thus, the phenomap may be used to predict candidate genes for diseases.