MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E....
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E....
MML, inverse learning and medical data-sets
Pritika Sanghi
Supervisors: A./Prof. D. L. DoweDr P. E. Tischer
2
Overview
What is this project about? Bayesian Networks and their limitations Some techniques
Factor Analysis Minimum Message Length (MML) Decision Trees & Graphs Logistic Regression
Improving Bayesian Networks What is being done in this project?
3
What is this project about?
The aim of the project is to enhance Bayesian Networks in general and then apply them to certain medical data-sets.
These data-sets have a large number of attributes and small number of cases.
This makes it difficult to model these data-sets using Bayesian Networks.
4
Bayesian Networks
A popular tool for Data Mining.
Model data to infer the probability of a certain outcome.
They represent the frequency distributions for the values that an attribute can take as Conditional Probability Distributions.
P(WS)
0.75
P(GO)
0.50
WS GO P(S | WS, GO)
T T
T F
F T
F F
0.01
0.80
0.40
0.99
S P(A|S)
T
F
0.95
0.00
5
Bayesian Networks - Limitations When a child node depends on a large
number of parent attributes, the conditional probability distribution (CPD) becomes very complex. 2n rows in the CPD for n binary parent attributes.
This makes the process of creating the CPD and inferring something from it once created very time consuming.
A more compact representation for CPDs is required.
6
Factor Analysis
Multiple attributes may be defined by a common factor.
The Wallace and Freeman model for Single Factor Analysis will be implemented.
This serves as dimensionality reduction.
The validity of the program built will be checked using the data-sets specified in the Wallace and Freeman paper.
Attributes A and B have a common factor F1.
Attributes C, D and E have a common factor F2.
7
Factor AnalysisHeight-Weight of Footy Players
0
20
40
60
80
100
120
165 170 175 180 185 190 195 200 205
Height
Wei
gh
t
Weight
0
20
40
60
80
100
120
0 20 40 60 80 100 120
Actual Weight
Pre
dic
ted
Wei
gh
t
Height
165
170
175
180
185
190
195
200
205
165 170 175 180 185 190 195 200 205
Actual Height
Pre
dic
ted
Hei
gh
t
8
Factor Analysis
Data Attribute related term Standard Deviation
xnk = μk + аk νn + σk rnk
Mean Record related term Random variates N(0,1)
Size Height Weight
Large Tall AverageLarge Short Heavy
Medium Average AverageSmall Short Light
The equation for Single Factor analysis as defined by Wallace and Freeman is:
9
The Minimum Message Length (MML) Principle Models the data as a two-part message consisting of
hypothesis H and the data it encodes, D. The best model is the one with minimum message
length. This is done by maximising the posterior probability of
the hypothesis given the data, -log Pr(H|D), as the message length is negative log likelihood of the probability.
Message is represented as:
Hypothesis Data
10
Decision Trees and Graphs
Graphical way of representing the output attribute in terms of the input attributes.
Used to model the Conditional Probability Distribution of the Bayesian Network.
Graphs are generalisations of decision trees. They merge similar sub-trees.
11
Logistic Regression
Mathematical modelling approach used for describing the dependence of a variable on other attributes.
Will be used to define the probability of a discrete target attribute as a function of continuous attributes.
f(z) = 1 / (1+e-z) + c
12
Improving Bayesian Networks Comley and Dowe (2003, 2004) based on the
ideas from Dowe and Wallace (1998) commenced the work of enhancing Bayesian Networks and introduced Generalised Bayesian Networks.
This project will extend on their work by applying some of the techniques described before on Bayesian Networks.
13
What is being done in this project? Refinement to Generalised Bayesian Networks.
Specifically,First the MML - Single Factor Analysis will be added to Bayesian Networks.Then, Logistic Regression will be looked into.
The Generalised Bayesian Networks will then be used to infer models from some medical data-sets such as breast cancer data-sets.
If time permits, which it almost definitely won’t, other methods of dimensionality reduction and/or decision graphs will be pursued.
14
References
J W Comley and D L Dowe: General Bayesian Networks and Asymmetric Languages, Proceedings of the 2003 Hawaii International Conference on Statistics and Related Fields (HICS 2003), Honolulu, Hawaii, USA, 5-8 June 2003, ISSN: 1539-7211, pp 1 - 18.
J. W. Comley and D. L. Dowe: Minimum Message Length and Generalised Bayesian Nets with Asymmetric Languages, in P. D. Grunwald, I. J. Myung and M. A. Pitt (ed), Advances in Minimum Description Length: Theory and Applications, MIT Press. To be published 2004.
D L Dowe, C S Wallace: Kolmogorov complexity, minimum message length and inverse learning, in W Robb (ed), Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), Queensland, Australia, 6-10 July, 1998, p 144.
C S Wallace and P R Freeman: Single factor analysis by MML estimation, J Royal Stat. Soc. B. 54, 1, 195-209, 1992.
16
Thank You
Any questions?