Post on 30-Dec-2018
31 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Recap: SUM RULE
P(A or B) = P(A ∪ B) = P(A) + P(B) with A, B disjoint.
P(A or B) = P(A) + P(B) - P(A ∩ B) otherwise CONDITIONAL PROBABILITY OF X GIVEN Y
P(X=A | Y=A) PRODUCT RULE
P( X=A,Y=A ) = P( X=A | Y=A) * P( Y=A ) INDEPENDENT EVENTS
P(X,Y) = P(X) P(Y), i.e. P(X | Y) = P(X) Note that: P(A or B) = P(B or A)
P(A, B) = P(B, A) P(A | B) ≠ P(B | A)
Probabilities
32 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Clearly P(X, Y) = P(Y, X) by symmetry of ∩ -- BUT P(X | Y) ≠ P(Y | X) P(X | Y) P(Y) = P(Y | X) P(X) by definition and product rule
Note again that if the knowledge about Y does not change the probability of X, i.e. P(X | Y) = P(X) then the two events are said to be independent, and P(X,Y) = P(X) P(Y), as in the case of picking two cards from different decks (or reinserting the card after each test).
It follows
P(X | Y) = P(Y | X) P(X) Bayes' Theorem P(Y)
Important result. Informally: By knowing the probability of X given Y, and the probability of X and Y, I can derive the probability of Y given X. We will see an example soon.
Probabilities: Bayes' Theorem
33 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Summing up: SUM RULE P(A or B) = P(A ∪ B) = P(A) + P(B) with A, B disjoint.
P(A or B) = P(A) + P(B) - P(A ∩ B) otherwise CONDITIONAL PROBABILITY OF X GIVEN Y P(X=A | Y=A) PRODUCT RULE P( X=A,Y=A ) = P( X=A | Y=A) * P( Y=A ) INDEPENDENT EVENTS P(X,Y) = P(X) P(Y), i.e. P(X | Y) = P(X) BAYES' THEOREM P(X | Y) = P(Y | X) P(X)
P(Y)
Probabilities
34 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
We have Box and Fruit, random variables. In the red box (B=r) there are 2 apples (a) and 6 oranges (o). In the blue box (B=b) there are 3
apples and 1 orange. - If I pick a fruit from the red box, what would you expect? - How can you express this? - Map all conditional probabilities of fruit | box. - WORKING Hypothesis: P(B=r) = 4/10
P(B=b) = 6/10
Apples and oranges in boxes [Bishop]
35 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
0.4 0.6 P(F=a) = ?
It is the sum of the events "pick an apple from red" or "pick an apple from blue"
P( (F=a, B=r) or (F=a, B=b) ) = P(F=a, B=r) + P(F=a, B=b) = (sum rule) P(F=a | B=r) P(B=r) + P(F=a | B=b) P(B=b) = (product rule) 0.25 * 0.4 + 0.75 * 0.6 = 0.55
Hence P(F=o) = 1 - 0.55 = .45 (sum rule)
Apples and oranges in boxes [Bishop]
36 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Let us try to "invert" our reasoning.
Suppose the two boxes have the same probability. If I observe an orange, on which box would you bet on?
Apples and oranges in boxes [Bishop]
37 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
0.4 0.6
Can we make this more precise, even in the original case, where boxes have associated probabilities ? How ? Which probability are we looking for ?
P(B=r | F=o) = P(F=o | B=r) P(B=r) = (Bayes' theorem) P(F=o)
(0.75 * 0.4 ) / 0.45 = 0.66666...
P(B=b | F=o) ?
Apples and oranges in boxes [Bishop]
38 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
0.4 0.6
Are B and F independent ?
P(F=o | B= r) = 0.75 ≠ 0.45 = P(F=o)
Apples and oranges in boxes [Bishop]
39 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
0.4 0.6 A note on the pre- post- "inversion" we are going through...
Suppose the two boxes have the same probability. If I observe an orange, on which box would you bet ? Probably, on the red one. (0.75 ) Remember that, without any knowledge of the fruit, one knows that the blue one is more probable (0.6) Via the Bayes' theorem, a subsequent observation (F=o) modifies our prior probability into a posterior probability.
Apples and oranges in boxes [Bishop]
40 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
P(rain) = 0.4 P(grass is wet) = 0.6 (rain, watering, ...)
P(R, W) ?
Independent? P(W | R) = 1 ≠ 0.6 = P(W) No.
Easily P(W | R) = 1 so P(R, W) = P(W | R) * P(R) = 1 * 0.4 = 0.4
P(R | W) ? P(R | W) = P(W, R)/P(W) = 0.4 / 0.6 = 0.66
NOTE: P(A | B) cannot be calculated from P(A) and P(B) alone.
Grass and rain
41 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Thomas Bayes (1701 - 1761) Logic and theology at Edinburgh. FRS 1742. Probability relevant for gambling and the new concept of insurance. "Essay towards solving a problem in the doctrine of chances" (1764 - three years after his death). [Then Laplace independently rediscovered the theory in a general form].
(Informally) An interpretation of probability as measure of uncertainty, based on so-far observed facts that can be revised. Remember the case of oranges and boxes.
Pierre-Simon Laplace (1749-1827) "Probability theory is nothing else but common sense reduced to calculation". Discussion about the same ideas of Bayes (inverse probability calculation), with applications to life expectancy, jurisprudence, planetary masses, error estimation (1812).
Probabilities
42 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Outline.
1. Introduction 2. Probability 3. Bayesian classification
43 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• Bayesian Classifiers are statistical classifiers – based on Bayes� Theorem (see following slides)
• They can predict the probability that a particular sample is a member of a particular class
• Perhaps the simplest Bayesian Classifier is known as the Naïve Bayesian Classifier based on an independence assumption (see later on …)
• In very simple terms, this means that we assume that values given for one variable are not influenced by values given to another variable. No relationship exists between them
• Although the independence assumption is often a bold assumption to make, performance is still often comparable to Decision Trees and Neural Network classifiers (explore on wikipedia! and references therein about "the surprisingly good performance of Naive Bayes in classification" )
What is a Bayesian Classifier?
44 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Bayes� Classifier: An Example Examples of things we can derive from our dataset:
• 4 males took advantage of the Mag.(azine) Promo and these 4 males represent 2/3 of the total male population,
• 3/4�s of females purchased the Mag. Promo.
Mag. Promotion TV Promotion Life Insurance Promotion Credit Card Insurance Sex Y N N N M Y Y Y N F N N N N M Y Y Y Y M Y N Y N F N N N N F Y N Y Y M N Y N N M Y N N N M Y Y Y Y F
For our example, let�s use sex as the output attribute whose value is to be predicted
45 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Bayes� Classifier: An Example Suppose we want to classify a new instance (or customer), called Lee. We are told the
following holds true for our new customer, i.e. this is our evidence E =
Mag. Promo = Y and TV Promo = Y and LI Promo = N and C.C. Ins. = N
We want to know if Lee is male (H1) or female (H2). Note that there was no example of YYNN in the data. We apply Bayes� classifier and compute a probability for each hypothesis
Mag. Promotion TV Promotion Life Insurance Promotion Credit Card Insurance Sex Y N N N M Y Y Y N F N N N N M Y Y Y Y M Y N Y N F N N N N F Y N Y Y M N Y N N M Y N N N M Y Y Y Y F
46 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• Firstly we list the distribution of the output attribute values for each input attribute. • This is done using a distribution table.
Bayes� Classifier: An Example
Mag Promo TV Promo LI Promo C.C. InsSex M F M F M F M FY 4 3 2 2 2 3 2 1N 2 1 4 2 4 1 4 3Ratio: Yes/Total 4/6 3/4 2/6 2/4 2/6 3/4 2/6 1/4Ratio: No/Total 2/6 1/4 4/6 2/4 4/6 1/4 4/6 3/4
So for example, 4 males answered Y to the Mag Promo
2 out of the total of 6 males answered Y to the LI Promo
Ratio for Y/T and N/T sum to 1 for each column
47 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
H1: Lee is male
P(sex = M | E) = P(E | sex = M) P(sex = M) P(E)
Starting with P(E | sex = M) … This is
P(Mag. Promo = Y, TV Promo = Y, LI Promo = N, C.C. Ins = N | sex = M) We have (mathematical justification):
P( E1 ∩ E2 ∩ E3 ∩ E4 | M ) P(M) = P (A|B)P(B) =P(A∩ B) P( E1 ∩ E2 ∩ E3 ∩ E4 ∩ M ) = P(A∩ B) = P (A|B)P(B) P( E1 | E2 ∩ E3 ∩ E4 ∩ M ) P( E2 ∩ E3 ∩ E4 ∩ M ) = * P( E1 | M) P( E2 ∩ E3 ∩ E4 ∩ M ) = … P( E1 | M) P( E2 | M) P( E3 | M) P( E4 | M) P( M )
* Assumption: E1, … E4 are conditionally independent given C, i.e.. the information added
by knowing that E2, … E4 have happened does not add much to P(E1 | C) and is forgotten. This is not always correct, it is an approximation, but often works well (and fast!).
Bayes� Classifier: Back to hypothesis H1 and H2
Bayes’ Theorem
48 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
H1: Lee is male
P(sex = M | E) = P(E | sex = M) P(sex = M) P(E)
Then we can calculate the conditional probability values for each piece of evidence as
explained:
P(Mag. Promo = Y | sex = M) = 4/6 P(TV Promo = Y | sex = M) = 2/6 P(LI Promo = N | sex = M) = 4/6 P(C.C. Ins = N | sex = M) = 4/6
and P(sex = M) = 6/10 = 3/5
These values are easily obtained from our distribution table. It follows:
P(E | sex = M) P(sex = M) = (4/6) * (2/6) * (4/6) * (4/6) * (3/5) = 8/81 * 3/5
Bayes� Classifier: Back to hypothesis H1 and H2
Bayes’ Theorem
49 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Analogously for H2: Lee is female
P(sex = F | E) = P(E | sex = F) P(sex = F) P(E)
And we have
P(Mag. Promo = Y | sex = F) = 3/4 P(TV Promo = Y | sex = F) = 2/4 P(LI Promo = N | sex = F) = 1/4 P(C.C. Ins = N | sex = F) = 3/4
P(sex = F) = 2/5
It follows:
P(sex = F | E) = (9/128) * (2/5) P(E)
Bayes� Classifier: Back to hypothesis H1 and H2
Bayes’ Theorem
50 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Finally,
P(sex = F | E) = (9/128) * (2/5) < (8/81) * (3/5) = P(sex = M | E) P(E) P(E)
Hence, Bayes� classifier tells us that Lee is most likely a male. Calculating also the value of P(E), i.e. the (conditionally independent !) probabilities of
Mag. Promo, TV Promo, not LI Promo and not CC Promo
P(E) = (7/10) * (4/10) * (5/10) * (7/10) = 0.098 we have:
P(sex = F | E) = 0.2815 < 0.5926 = P(sex = M | E)
Bayes� Classifier: Back to hypothesis H1 and H2
51 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
1. What proportion of Glasgow customers buy books? 2. What proportion of all customers buy DVDs? 3. Given a new customer that we knows buys Videos, is it more likely that they
live in Glasgow or Stirling? Classifying according to further evidence !
CDs Books DVDs Videos Region
Y Y N N Stirling
Y N Y N Glasgow
Y N Y Y Glasgow
Y Y Y N Glasgow
N N Y N Stirling
N Y Y Y Stirling
Y N Y N Stirling
Y Y Y Y Glasgow
Items Bought from Amazon
52 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
After a bit of feature extraction ... 1. P(Region = G, Books = Y) = 2/4 = 1/2 2. P(DVDs = Y) = 7/8 3. ??
CDs G S
Books G S
DVDs G S
Videos G S
4 2 0 2
2 2 2 2
4 3 0 1
2 1 2 3
1 ½ 0 ½
½ ½ ½ ½
1 ¾ 0 ¼
½ ¼ ½ ¾
Y N
Ratio Y/Tot
Ratio N/Tot
Items Bought from Amazon
53 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
P(Glasgow | Videos) = P(Videos | Glasgow) P(Glasgow) / P(videos) = 1/2 * 1/2 / 3/8 = 2/3
CDs G S
Books G S
DVDs G S
Videos G S
4 2 0 2
2 2 2 2
4 3 0 1
2 1 2 3
1 ½ 0 ½
½ ½ ½ ½
1 ¾ 0 ¼
½ ¼ ½ ¾
Y N
Ratio Y/Tot
Ratio N/Tot
Items Bought from Amazon
54 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
P(Stirling | Videos) = P(Videos | Stirling) P(Stirling) / P(videos) = 1/3 * 1/2 / 3/8 = 1/3
CDs G S
Books G S
DVDs G S
Videos G S
4 2 0 2
2 2 2 2
4 3 0 1
2 1 2 3
1 ½ 0 ½
½ ½ ½ ½
1 ¾ 0 ¼
½ ¼ ½ ¾
Y N
Ratio Y/Tot
Ratio N/Tot
Items Bought from Amazon
55 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
P(Stirling | Videos) = P(Videos | Stirling) P(Stirling) / P(videos) = 1/3 * 1/2 / 3/8 = 1/3 P(Glasgow | Videos) = P(Videos | Glasgow) P(Glasgow) / P(videos) = 1/2 * 1/2 / 3/8 = 2/3 ... most likely from Glasgow! Note: P(Stirling | Videos) + P(Glasgow | Videos) = 1
Exercise. Is in general true that P(a|r) + P(b|r) = 1 ? Note that in our case a ∪ b = T. (One either comes from Glasgow or from Stirling, but not from both places. Indeed a ∩ b = ∅, too). Is it true when assuming a ∪ b = T ?
(by def) = P(a,r) /P(r) + P(b,r)/P(r) (by sum) = P(a U b, r) / P(r) = P(T,r)/P(r) = P(r)/P(r) =1
Note: Revisit the Naïve Bayesian Classifier example about promotions. Does the above result hold there? If not, why ? How is the situation different here ?
Items Bought from Amazon
56 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Why use Bayesian Classifiers?
• There are several classification methods, no one has been found to be superior over all others in every case (i.e. a data set drawn from a particular domain of interest)
• Methods can be compared based on: – accuracy – interpretability of the results – robustness of the method with different datasets – training time – scalability
57 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
An Important Remark
Mind the difference between
calculating probabilities from a known set of outcomes, for example, tossing a coin heads when we know the two outcomes are heads or tails, and calculating the probability of an event from data, and the event is not directly expressed by data.
The probabilities calculated from data are estimates, not true values. If we tossed a coin 10 times to generate data, we might easily get 6 heads and 4 tails. Without knowing about how coins work, we would estimate the probability of getting heads as 6/10 – not a bad estimate, but incorrect. The more data we have, the more reliable our estimates get.
Results can be dramatically sensitive to the specific evidence you have. E.g. suppose you have evidence that P(Head)=0.6, then you can estimate the probability of getting 6 heads in a row, even if it didn’t happen in your data
=0.6 * 0.6 * 0.6 * 0.6 * 0.6 *0.6 = 0.047, i.e. 1 in 21 tries! (with P(Head)=0.5 is 0.016)
58 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Bayesian Belief Networks
59 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Modern pattern recognition based on probabilities, mainly on sum and product rules.
All could be treated algebraically.
However, graphical models offer advantages, in terms of
• visualisation, e.g. structure and relationships; • communication, easily to grasp; • expressiveness, e.g. graphical manipulations corresponding to
mathematical operations.
Probabilistic graphical models.
60 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• What are they?
– Bayesian Belief Networks (BBNs) are a way of modelling probabilities based on data or knowledge to allow probabilities of new events to be estimated
• What are they used for?
– Intelligent decision aids, data fusion, intelligent diagnostic aids, automated free text understanding, data mining
• Where did they come from?
– Cross fertilization of ideas between the artificial intelligence, decision analysis, and statistic communities
Bayesian Belief Networks
61 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Definition of a Bayesian Network
• Factored joint probability distribution as a directed graph:
• structure for representing knowledge about uncertain variables • computational architecture for computing (propagating) the impact of evidence
on beliefs
• Knowledge structure:
• random variables are depicted as nodes • arcs represent probabilistic dependence between variables • conditional probabilities encode the strength of the dependencies
• Computational architecture:
• computes posterior probabilities given evidence about selected nodes • exploits probabilistic independence for efficient computation
B F
62 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Visit to Asia
Tuberculosis
Tuberculosis or Cancer
XRay Result Dyspnea
Bronchitis Lung Cancer
Smoking
Patient Information
Medical Difficulties
Diagnostic Tests
Example from Medical Diagnostics
63 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• Relationship knowledge is modeled by deterministic functions, logic and conditional probability distributions
Patient Information
Diagnostic Tests
Visit to Asia
Tuberculosis
Tuberculosis or Cancer
XRay Result Dyspnea
Bronchitis Lung Cancer
Smoking Tuber
Present
Present
Absent
Absent
Lung Can
Present
Absent
Present
Absent
Tub or Can
True
True
True
False
Medical Difficulties Tub or Can
True
True
False
False
Bronchitis
Present
Absent
Present
Absent
Present
0.90
0.70
0.80
0.10
Absent
0.l0
0.30
0.20
0.90
Dyspnea
Example from Medical Diagnostics
64 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• Propagation algorithm processes relationship information to provide likelihood information for occurrence of each state for each node
TuberculosisPresentAbsent
1.0499.0
XRay ResultAbnormalNormal
11.089.0
Tuberculosis or CancerTrueFalse
6.4893.5
Lung CancerPresentAbsent
5.5094.5
DyspneaPresentAbsent
43.656.4
BronchitisPresentAbsent
45.055.0
Visit To AsiaVisitNo Visit
1.0099.0
SmokingSmokerNonSmoker
50.050.0
Example from Medical Diagnostics
65 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• As a finding is entered, the propagation algorithm updates the beliefs attached to each relevant node in the network
• Interviewing the patient produces the information that �Visit to Asia� is �Visit� • This finding propagates through the network and the belief functions of several nodes are updated
TuberculosisPresentAbsent
5.0095.0
XRay ResultAbnormalNormal
14.585.5
Tuberculosis or CancerTrueFalse
10.289.8
Lung CancerPresentAbsent
5.5094.5
DyspneaPresentAbsent
45.055.0
BronchitisPresentAbsent
45.055.0
Visit To AsiaVisitNo Visit
100 0
SmokingSmokerNonSmoker
50.050.0
Example from Medical Diagnostics
66 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
TuberculosisPresentAbsent
5.0095.0
XRay ResultAbnormalNormal
18.581.5
Tuberculosis or CancerTrueFalse
14.585.5
Lung CancerPresentAbsent
10.090.0
DyspneaPresentAbsent
56.443.6
BronchitisPresentAbsent
60.040.0
Visit To AsiaVisitNo Visit
100 0
SmokingSmokerNonSmoker
100 0
• Further interviewing of the patient produces the finding �Smoking� is �Smoker� • This information propagates through the network
Example from Medical Diagnostics
67 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
TuberculosisPresentAbsent
0.1299.9
XRay ResultAbnormalNormal
0 100
Tuberculosis or CancerTrueFalse
0.3699.6
Lung CancerPresentAbsent
0.2599.8
DyspneaPresentAbsent
52.147.9
BronchitisPresentAbsent
60.040.0
Visit To AsiaVisitNo Visit
100 0
SmokingSmokerNonSmoker
100 0
• Finished with interviewing the patient, the physician begins the examination • The physician now moves to specific diagnostic tests such as an X-Ray, which results in a �Normal� finding which propagates through the network
• Note that the information from this finding propagates backward and forward through the arcs
Example from Medical Diagnostics
68 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
TuberculosisPresentAbsent
0.1999.8
XRay ResultAbnormalNormal
0 100
Tuberculosis or CancerTrueFalse
0.5699.4
Lung CancerPresentAbsent
0.3999.6
DyspneaPresentAbsent
100 0
BronchitisPresentAbsent
92.27.84
Visit To AsiaVisitNo Visit
100 0
SmokingSmokerNonSmoker
100 0
• The physician also determines that the patient is having difficulty breathing, the finding �Present� is entered for �Dyspnea� and is propagated through the network
• The doctor might now conclude that the patient has bronchitis and does not have tuberculosis or lung cancer
Example from Medical Diagnostics
69 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Applications • Industrial
• Processor Fault Diagnosis - by Intel • Auxiliary Turbine Diagnosis - GEMS
by GE • Diagnosis of space shuttle
propulsion systems - VISTA by NASA/Rockwell
• Situation assessment for nuclear power plant - NRC
• Military • Automatic Target Recognition -
MITRE • Autonomous control of unmanned
underwater vehicle - Lockheed Martin
• Assessment of Intent
• Medical Diagnosis • Internal Medicine • Pathology diagnosis - Intellipath by
Chapman & Hall • Breast Cancer Manager with
Intellipath
• Commercial • Financial Market Analysis • Information Retrieval • Software troubleshooting and advice
- Windows 95 & Office 97 • Pregnancy and Child Care -
Microsoft • Software debugging - American
Airlines� SABRE online reservation system
70 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Glandular fever
• Suppose we know that on average, 1% of the population have had glandular fever.
P(had_GF) = 0.01
• Suppose we have a test for having had glandular fever such that:
– For a person who has had GF the test would give a positive result with probability 0.977
– For a person who has not had GF the test would give a negative result with probability 0.926
Q: How could this information be represented as a BBN Q: How could the BBN be used to find out new information?
71 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Step 1: "feature extraction"
List information we are given, determine information we can deduce from that information. • "on average, 1% of the population have had glandular fever"
P(had GF) = 0.01 => P(not had GF) = 0.99 • "for a person who has had GF the test would give a positive result with probability 0.977"
Note this is not "a person has had GF"! but
P(+ve test | person has had GF) = 0.977
from which we have that the probability of having had GF but receiving a –ve test result is
P(-ve test | person has had GF) = 1 – p(+ve test | person has had GF)] = 0.023
72 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Step 1: "feature extraction"
• “for a person who has not had GF the test would give a negative result with probability 0.926”
P(-ve test | person has not had GF) = 0.926
from which
P(+ve test | person has not had GF) = 1 – p(-ve test | person has not had GF) = 0.074
Summing up:
P(had GF) = 0.01 P(not had GF) = 0.99
P(+ve test | had GF) = 0.977 P(-ve test | had GF) = 0.023
P(+ve test | not had GF) = 0.074 P(-ve test | not had GF) = 0.926
73 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Define the structure of the BBN. Important issue! Nodes are random variables. Two choices: 1. Top-level nodes. These have not any probabilistic dependency. Could be understood
as observations. 2. Dependency relationships. Describe our interpretation of dependencies in our model.
Both the choices contribute to the structure of the network. Different choices are generally possible. The final structure must not have cycles in the dependency relationship. I.e. the directed graph must be and acyclic (DAG) in order to guarantee efficient propagation algorithms.
In our example: • Nodes: - Had_GF
- Test_Result • Relationships: - Had_GF node influences state of Test_Result node
Step 2: BBN construction
74 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
1 parent related to 1 child model
75 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Finally,
3. Establish values for Conditional Probability Tables (CPTs), i.e. a representation of the conditional probability for the values of the random variable represented by a node, conditioned by the parent random variable.
Application of a formula. Values must be consistent! E.g. rows (union of all the possible values) should sum to 1.
Test +ve True+ Test –ve False
Had GF True 0.977 (97.7%) 0.023 (2.3%)
Had GF False 0.074 (7.4%) 0.926 (92.6%)
Step 2: BBN construction
76 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
True 0.01 (1%)
False 0.99 (99%) Had GFNode
True
False
Test +ve Node
True
False
Test +ve True+ Test –ve False
Had GF True 0.977 (97.7%) 0.023 (2.3%)
Had GF False 0.074 (7.4%) 0.926 (92.6%)
Step 2: BBN construction
77 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Use the network to determine new information.
Examples of information we may wish to determine:
Q1: Given a person has had GF, what is the probability of a negative test result? Q2: What is the probability of a +ve test result? Q3: Given a positive test, what is the probability that the person has had GF? We will start by looking the calculation that these questions require. This is actually
computed by the network, once it has been properly designed.
Step 3: Use of the BBN
78 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q1: Given a person has had GF, what is the probability of a negative test? Well, this is easy as we have now determined this information in Step 2. Formulated in terms of probability, we wish to find out: P(-ve test | has had GF) Looking back a few slides, P(-ve test | has had GF) = 0.023
Step 3: Use of the BBN
79 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q2: What is the probability of a +ve test result? Well, we need to think of all the possible situations that could happen which would lead
to a +ve test result. Situation 1. +ve test result and had GF:
p(+ve test result ∩ had GF) Situation 2. +ve test result and not had GF:
p(+ve test result ∩ not had GF) We don’t know the above information, BUT we can now use what we know of conditional
probability to calculate it…
Step 3: Use of the BBN
80 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q2: What is the probability of a +ve test result? Again from P(A ∩ B) = P(B|A) P(A):
P(+ve test ∩ had GF) = P(had GF | +ve test) * P(+ve test) P(+ve test ∩ not had GF) = P(not had GF | +ve test) * P(+ve test)
Still, P(+ve test) has not yet been defined. BUT from P(A ∩ B) = P(B ∩ A), then
P(had GF ∩ +ve test) = P(+ve test | had GF) * P(had GF) = 0.977 * 0.01 = 0.00977
P(not had GF ∩ +ve test) = P(+ve test | not had GF) * P(not had GF) = 0.074 * 0.99 = 0.07326
Step 3: Use of the BBN
81 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q2: What is the probability of a +ve test result? Finally, by the sum rule:
P(+ve test) = P(+ve test ∩ had GF) + P(+ve test ∩ had GF) = 0.00977 + 0.07326 = 0.08303
If we chose a random person and gave them a test, they would have a 0.08 probability of showing positive.
Step 3: Use of the BBN
82 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q3: Given a positive test, what is the probability that the person has had GF?
P(has had GF | +ve test) ?
Again, this isn’t a piece of information that has been already defined. We can caluclate
the answer by exploiting the network and Bayes’ theorem:
P(A | B) = P(B | A) P(A) / P(B)
P(has had GF | +ve test) = P( +ve test |has had GF) P(has had GF) / P(+ve test) = (0.977 * 0.01) / P(+ve test) = 0.118
Given that, from Q2 we have P(+ve test) = 0.08303
Step 3: Use of the BBN
83 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q3: Given a positive test, what is the probability that the person has had GF?
P(has had GF | +ve test) = 0.118
So the test isn’t so useful after all:
– remember that 7% of those who’ve never had GF get a positive result – and they far out number those people who have had it and got a positive result!
Step 3: Use of the BBN
84 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
A BBN model with a hierarchy of 3 nodes
85 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
True 0.01 (1%)
False 0.99 (99%)
Had GFNode
True
False
Test +ve Node
True
False
Test +ve True+ Test –ve False
Had GF True 0.977 (97.7%) 0.023 (2.3%)
Had GF False 0.074 (7.4%) 0.926 (92.6%)
Facts we deduced from the last time…
P(-ve test | has had GF) = 0.023
P(+ve test) = 0.08303
P(had GF | +ve test) = 0.118
Extension to the Glandular Fever Model
86 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• Suppose we are informed that the school nurse sends home 80% of students that have a positive GF test.
• She also sends home 5% of students for other medical reasons (i.e. students that have not had a positive GF test).
Q1. How do we incorporate this new information into our network?
Q2. What is the probability of being sent home? Q3. Given that a child is sent home, what is the probability of them having had a
negative test?
Extension to the Glandular Fever Model
87 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q1. How do we incorporate new information?
True 0.01 (1%)
False 0.99 (99%)
Had GFNode
True
False
Test +ve Node
True
False
Test +ve True+ Test –ve False
Had GF True 0.977 (97.7%) 0.023 (2.3%)
Had GF False 0.074 (7.4%) 0.926 (92.6%)
Sent home
True
False
Sent home True Sent home False
Test +ve True 0.8 (80%) 0.2 (20%)
Test +ve False 0.05 (5%) 0.95 (95%)
88 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
We need to add up all the combinations of scenarios which would lead us to being sent home:
- due to a +ve test - due to another reason
P(home) = P(home | +ve test) * P(+ve test) + P(home | -ve test) * P(-ve test) = 0.8 * P(+ve test) + 0.05 * P(-ve test)
We need to calculate P(+ve test) and P(-ve test). We already know
P(+ve test) = P(+ve test | GF) * P(GF) + P(+ve test | not had GF) * P(not had GF) = 0.00977 + 0.07326 = 0.08303
From which P(-ve test) = 1 -0.08303 = 0.91697
And finally P(home) = 0.8 * P(+veTest) + 0.05 * P(-ve Test) = 0.8 * 0.08303 + 0.05 * 0.91697 = 0.1122725 = 0.112 (3 d.p.)
Q2. What is the probability of being sent home?
89 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
We’ve already done a similar calculation to this before:
P(-ve test | home) = P(home| -ve test) * P(-ve test) / P(home)
= (0.05 * 0.91697) / 0.1122725 = 0.408 (3 d.p.)
Q3. Given that a child is sent home, what is the probability of them having had a negative test?
90 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Things to remember about calculations in BBNs
1. Know parent information, want to find out child node information: Use conditional probability
2. Know child information want to find out parent node information:
Use Bayes’ Theorem 3. If have more than 1 parent to a node, remember parents are independent (unless they
share an arc). Thus P(parent A ∩ parent B) = P(parent A) * P(parent B)
91 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
A BBN model with 2 parents and 1 child
92 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• Diane does her shopping each week. The bill for her shop is sometimes under £30, sometimes £30 or over. Two factors influence the cost of her shopping: whether she takes her 2 year old son with her, and whether she takes her 40 year old husband with her.
• If we know Diane has gone shopping by herself, the likelihood that the bill will be less than £30 is 90%. If we know that Diane was accompanied only by her husband, the likelihood that the bill will be less than £30 is 70%, and if we know that Diane took only her son, the likelihood that the bill will be less than £30 is 80%. Given we know both son and husband accompanied Diane to the shops, then the likelihood that the bill is under £30 reduces to 60%.
• 50% of the time Diane’s husband accompanies her to the shops. 60% of the time Diane is accompanied by her son.
How do you imagine a BBN could represent the above information?
Diane’s shopping model (2p 1c)
93 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
• Diane does her shopping each week. The bill for her shop is sometimes under £30, sometimes £30 or over. Two factors influence the cost of her shopping: whether she takes her 2 year old son with her, and whether she takes her 40 year old husband with her.
• If we know Diane has gone shopping by herself, the likelihood that the bill will be less than £30 is 90%. If we know that Diane was accompanied only by her husband, the likelihood that the bill will be less than £30 is 70%, and if we know that Diane took only her son, the likelihood that the bill will be less than £30 is 80%. Given we know both son and husband accompanied Diane to the shops, then the likelihood that the bill is under £30 reduces to 60%.
• 50% of the time Diane’s husband accompanies her to the shops. 60% of the time Diane is accompanied by her son.
How do you imagine a BBN could represent the above information?
Diane’s shopping model (2p 1c)
94 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Questions: Q1. What is the probability of the bill being under £30? Q2. Given that the bill is under £30, what is the probability that Diane’s husband (with or
without her son) accompanied her to the shops?
Diane’s shopping model (2p 1c)
95 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Firstly, let’s list what we know, ...
P(under_30 | no husband ∩ no son) = 0.9 P(under_30 | husband ∩ no son) = 0.7 P(under_30 | no husband ∩ son) = 0.8 P(under_30 | husband ∩ son) = 0.6 P(husband) = 0.5 P(son) = 0.6 ... and what we can deduce ( 1-P(...) ) P(not_under_30 | no husband ∩ no son) = 0.1 P(not_under_30 | husband ∩ no son) = 0.3 P(not_under_30 | no husband ∩ son) = 0.2 P(not_under_30 | husband ∩ son) = 0.4 P(no_husband) = 0.5 P(no_son) = 0.4
Diane’s shopping model (2p 1c)
96 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
CPT True False
Hub T Son T 60% 40%
Hub T Son F 70% 30%
Hub F Son T 80% 20%
Hub F Son F 90% 10%
Diane’s shopping model (2p 1c)
97 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Firstly, list situations that lead to bill being under £30
P(under_30) = P(under_30 | no husband ∩ no son) * P(no husband ∩ no son) + P(under_30 | no husband ∩ son) * P(no husband ∩ son) + P(under_30 | husband ∩ no son) * P(husband ∩ no son) + P(under_30 | husband ∩ son) * P(husband ∩ son)
What do we know about the presence of Diane’s husband and the presence of her son? Are these 2 events dependent on each other in any way? No… These events are independent. Independent events are where the occurrence of one
event does not impact on the occurrence of another event:
P(A | B) = P(A) therefore, if A and B are independent events,
P(A, B) = P(A | B) * P(B) = P(A) * P(B)
Q1. What is the probability of the bill being under £30?
98 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Firstly, list situations that lead to bill being under £30
P(under_30) = P(under_30 | no husband ∩ no son) * P(no husband ∩ no son) + P(under_30 | no husband ∩ son) * P(no husband ∩ son) + P(under_30 | husband ∩ no son) * P(husband ∩ no son) + P(under_30 | husband ∩ son) * P(husband ∩ son)
P(no husband ∩ no son) = P(no husband) * P(no son) = 0.5 * 0.4 = 0.2
P(no husband ∩ son) = P(no husband) * P(son) = 0.5 * 0.6 = 0.3
P(husband ∩ no son) = P(husband) * P(no son) = 0.5 * 0.4 = 0.2
P(husband ∩ son) = P(husband) * P(son) = 0.5 * 0.6 = 0.3
Q1. What is the probability of the bill being under £30?
99 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Firstly, list situations that lead to bill being under £30
P(under_30) = P(under_30 | no husband ∩ no son) * P(no husband ∩ no son) + P(under_30 | no husband ∩ son) * P(no husband ∩ son) + P(under_30 | husband ∩ no son) * P(husband ∩ no son) + P(under_30 | husband ∩ son) * P(husband ∩ son)
P(under_30) = (0.9 * 0.2) + (0.8 * 0.3) + (0.7 * 0.2) + (0.6 * 0.3) = 0.74
Q1. What is the probability of the bill being under £30?
100 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Q2. Given that the bill is under £30, what is the probability that Diane’s husband (with or without her son)
accompanied her to the shops? 2 scenarios:
P(husband | under £30) = P(husband ∩ son | under £30) + P(husband ∩ not son | under £30)
P(husband ∩ son | under £30) = { using Bayes’) = P(under £30 | P(husband ∩ son)) * P(husband ∩ son) / P(under £30) = (0.6 * 0.3) / 0.74 = 0.243r (r = recurring)
P(husband ∩ not son | under £30) = {using Bayes’} = P(under £30 | P(husband ∩ not son)) * P(husband ∩
not son))/ P(under £30) = (0.7 * 0.2) / 0.74 = 0.189r
Overall:
P(husband | under £30) = 0.243r + 0.189r = 0.432r (3 d.p.)
101 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
BBNs: • Development of propagation algorithms followed by availability of easy to use
commercial software • Growing number of creative applications, e.g. dementia diagnosis, cancer care
symptom modelling, likelihood of car purchase, ... • Different from other knowledge-based systems tools because uncertainty is handled
in mathematically rigorous yet efficient and simple way • Different from other probabilistic analysis tools because of
- network representation of problems, - use of Bayesian statistics, and - the synergy between these.
• Issue: How to build a network ?
Why are BBNs interesting?
102 Computing Science & Mathematics University of Stirling
CSC9T6 Information Systems Computing Science & Mathematics University of Stirling
Outline.
1. Introduction 2. Probability 3. Bayesian classification 4. Bayesian Belief Networks