Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b
description
Transcript of Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b
Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment
Kati Iltanena, Sami Kiviharjua, Lida Aoa, Martti Juholaa, Ilmari Pyykköb
aSchool of Information Sciences, University of Tampere, FinlandbSchool of Medicine, University of Tampere, Finland
Kati Iltanen, Medinfo 2013 2
Introduction
Aim of the study: to examine applicability of association rules for analysing effects of genetic and environmental factors on age-related hearing impairment (ARHI) To possibly generate new hypotheses for medical research
Association analysis Data mining approach to discover items (variable-value pairs)
frequently co-occurring in data
Association rules of the form “A → B” generated from frequent item sets
Capability to do a complete search efficiently
Kati Iltanen, Medinfo 2013 3
Introduction
Challenge High-dimensional data result in a very large number of
association rules. Rules may be overlapping
Postprocessing is needed
Focus of the study: to develop an approach to cluster, summarise and represent association rules for easier exploration
Kati Iltanen, Medinfo 2013 4
ARHI data
Originate from a European multicentre study on ARHI Collected in nine medical centres from seven European
countries (e.g. Van Laer et al., 2008)
2428 cases: females and males aged 53 to 67
The cases represent the best and the worst hearing thirds of their population at high frequencies (2, 4 and 8 KHz) 1241 cases with ARHI
Cases having pathologies (other than ARHI) possibly influencing hearing ability were excluded
Kati Iltanen, Medinfo 2013 5
ARHI data
764 variables
42 phenotypes and environmental factors
Phenotypes: e.g. gender, age, body mass index, blood pressure, diabetes, cardiovascular disease and renal failure
Environmental and life style factors: e.g. use of ototoxic medication, exposure to chemicals, exposure to noise, alcohol use, and tobacco smoking
722 single nucleotide polymorphisms (SNPs) from 70 candidate genes
Kati Iltanen, Medinfo 2013 6
Arhi rules
LHS Zhighbest>0.147
Genotype, phenotype, environmental variables
From 1 to 3 items
“Has a hearing impairment”
Zhighbest: averaged gender and age independent Z-score of high frequencies (2, 4 and 8 KHz) for the better hearing ear
0.147: a threshold value given by the expert physician
Rules were mined with Magnum Opus from RuleQuest Research.
Form for rules:
Kati Iltanen, Medinfo 2013 7
Interestingness measures used for association rules Support
Confidence
Lift
Statistical significance: Fisher exact test
Arhi rules
)()( BAPBAs
)()( ABPBAc
)(/)()( BPABPBAl
Kati Iltanen, Medinfo 2013 8
Clustering ARHI rules Measure of similarity or closeness between two
association rules
proportion of cases matched by both rules among cases matched by either one or both rules(a variant of a measure presented by Gupta et al., 1999)
ji
ji
jiRR
RR)R,R(s
Intersection of R22 and R26: 187 cases (Both R22 and R26 hold for 187 cases.)Union of R22 and R26: 190 casesThe similarity between R22 and R26: 187/190≈0.98
Kati Iltanen, Medinfo 2013 9
Clustering ARHI rules
Clustering method based on graph-theoretic techniques Implemented using Matlab, Java and PostgreSQL
A connected component(a threshold of 0.3 used for the similarity measure).
Rule graph Rules - nodes Similarities between rules - weights of
edges between nodes
Similarities above chosen threshold - connections between nodes
One connected component is a rule subset or cluster. Clustering – searching for connected
components
Kati Iltanen, Medinfo 2013 10
Rules represented in html documents Program implemented using Matlab
Rule subset information is given at different levels of details
Overall summary listing for rule subsets Number of rules, coverage, main item
Summarising rule subsets
Kati Iltanen, Medinfo 2013 11
Summarising rule subsets
At the next level, rule subset information is enlarged with the information about the other items.
Kati Iltanen, Medinfo 2013 12
Representing rule subsets
Gene colouring
Marking items of special interest Important SNPs from earlier studies
Ordering items in rules on the basis of item frequencies
Kati Iltanen, Medinfo 2013 13
Representing rule subsets
Ordering rules in clusters on the basis of item frequencies
Kati Iltanen, Medinfo 2013 14
Representing rule subsets
Kati Iltanen, Medinfo 2013 15
Representing rule subsets
Similarities between the rules in a similarity matrix
“Solvent exposure” rules
“Noisy workplace”rules
Highly overlapping rules
Kati Iltanen, Medinfo 2013 16
Summary statistics of ARHI rules
1-item LHS 2-item LHS 3-item LHS
Size of search space 2231 2.48535·106 1.84332·109
Minimum support threshold 50 cases 50 cases 1%
Minimum confidence threshold 60% 70% 90%
Number of rules 6 77 518
Total coverage 48.3% 86.6 % 96.5%
Support 3.4 - 13.4% 2.1 - 7.8% 1 - 2%
Confidence 60 - 67.9% 70 - 80.6% 90 - 100%
Lift 1.17 - 1.33 1.37 - 1.58 1.76 - 1.96
Common threshold values: lift 1, Fisher exact test: α = 0.01
Kati Iltanen, Medinfo 2013 17
Conclusions
Developed approach simplified the rule exploration by grouping together
• the rules concerning the same items
• the rules concerning the same phenomenon
enabled the recognition of the overlapping rules• possibly suggesting more complex interactions
Association analysis detected factors found significant in previous studies concerning
this ARHI data enabled more exhaustive analysis of more complex patterns
• However, the problem of multiple testing has to be remembered.
gave new interesting information to the expert physician• especially rules concerning osteoporosis
Kati Iltanen, Medinfo 2013 18
References and acknowledgments
The authors are grateful to Baur M, Bille M, Bonaconsa A, Cremers CW, Demeester K, Dhooge I, Diaz-Lacava AN, Espeso A, Fransen E, Hannula S, Hendrickx JJ, Huygen PL, Huyghe J, Huyghe JR, Jensen M, Konings A, Kremer H, Kunst S, Lacava A, Lemkens N, Manninen M, Mazzoli M, Mäki-Torkko E, Orzan E, Parving A, Pawelczyk M, Pfister M, Rajkowska E, Sliwinska-Kowalska M, Sorri M, Steffens M, Stephens D, Topsakal V, Tropitzsch A, Van Camp G, Van de Heyning PH, Van Eyken E, Van Laer L, Verbruggen K, and Wienker TF, for the possibility to use the ARHI data.
Acknowledgments
References
Gupta et al., Distance based clustering of association rules In: Intelligent Engineering Systems Through Artificial Neural Networks (Proceedings of ANNIE 1999), ASME Press, 1999, pp. 759-764.
Van Laer et al., The grainyhead like 2 gene (GRHL2) alias TFCP2L3, is associated with age-related hearing impairment. Hum Mol Genet 2008: 15: 159-69.