Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for...
-
Upload
belinda-richards -
Category
Documents
-
view
218 -
download
0
description
Transcript of Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for...
Bayesian decision theory: A framework for making decisions when uncertainty exit
1Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
2Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Modeling data as random variablesExample: coin toss Given sufficient knowledge, we could use Newton’s laws of motion to calculate the result of each toss with minimal uncertainty
In conjunction with our model, analysis of experimental trajectories will probably reveal why the coin is unfair if heads and tails do not occur with equal probability
Alternative: Accept doubt about result of toss. Treat result as random variable X subject to P(X=x). Use P(X=x) to make rational decision about result of next toss.Assume that we are not interested in why the coin is unfair if that is the case.“The reason is in the data”
Statistical Analysis of Coin-Toss Data• Let heads = 1; tails = 0• Boolean random variables obey Bernoulli statistics
P (x) = poX (1 ‒ po)(1 ‒ X) po = probability of heads
• Given a sample of N tosses, an unbiased estimator of po is the fraction of tosses that show heads.
• Prediction of next toss:Heads if po > ½, Tails otherwise
3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
xxx
ppP
PCCC |
|
4
posterior
Class likelihoodprior
normalization
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Prior is information relevant to classifying that is independent of attributes xClass likelihood is probability that member of class C will have attribute xAssign client with attribute x to class C if P(C|x) > 0.5
Review: Bayes’ Rule for binary classification
Review: Bayes’ Rule: K>2 Classes
K
kkk
iiiii
CPCp
CPCpp
CPCpCP
1
|
|||x
xx
xx
xxx | max | if to attributesh client witassign
1 and 01
kkii
K
iii
CPCPC
CPCP
5
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
μxμx 1
2/12/Σ
21exp
Σ21)C|P(x Td
6
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
With class labels rit,
estimators are
Review: Estimating priors and class likelihoods from data
Number of examples in a class is an estimate of its prior.If we assume members of a class are Gaussian distributed,then mean and covariance parameterize class likelihood.
tti
Ti
tt i
tti
i
tti
ttt
ii
tti
i
rr
rr
Nr
CP
mxmx
xm
S
)(
d
i i
iid
ii
d
d
iii
xCxpCp1
2
1
2/1 21exp
)2(
1||
x
7
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Review: naïve Bayes classification
Each class is characterized by a set of means and variances for the components of the attributes in that class.
A simpler model results from assuming that components of x are independent random variables. Covariance matrix is diagonal and p(x|C) is product of probabilities for each component of x.
• Actions: αi assigning x to Ci of K classes• Loss λik occurs if we take αi when x belongs to Ck
• Expected risk (Duda and Hart, 1973)
xx
xx
|min| if choose
||
kkii
k
K
kiki
RR
CPR
1
8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Minimizing risk given attributes x
Special case: correct decisions no loss and error have equal cost: “0/1 loss function”
kiki
ik if if
10
x
x
xx
|
|
||
i
ikk
K
kkiki
CP
CP
CPR
1
1
9
For minimum risk, choose the most probable class
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Add rejection option: don’t assign a class
10 ,otherwise 1
1K if if 0
i
ki
ik
xxx
xx
|||
||
iik
ki
K
kkK
CPCPR
CPR
11
1
reject otherwise,
1| and || if choose xxx ikii CPikCPCPC
10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
risk of no assignment
risk of choosing Ci
1- is risk making some assignment
xx
xx
|min| if choose
||
kkii
k
K
kiki
RR
CPR
1
R(1|x) = 11 P(C1|x) + 12 P(C2|x) = 10 P(C2|x)
R(2|x) = 21 P(C1|x) + 22 P(C2|x) = P(C1|x)
Choose C1 if R(1|x) < R(2|x), which is true if
10 P(C2|x) < P(C1|x), which becomes
P(C1|x) > 10/11 using normalization of posteriors
Consequence of erroneously assigning instance to C1 is so bad that we choose C1 only when we are virtually certain it is correct.
Example of risk minimization with 11 = 22 = 0, 12 = 10, and 21 = 1
Loss λik occurs if we take αi when x belongs to Ck
13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
K
kkk
iiiii
CPCp
CPCpp
CPCpCP
1
|
|||x
xx
xx
Bayes’ classifier based on neighbors
Consider data set with N examples, Ni of which belong to class i; P(Ci) = Ni
Given a new example x, draw a hyper-sphere of volume V in attribute space, centered on x and containing precisely K training examples, irrespective of their class.
Suppose this sphere contains ni examples from class i, then p(x|Ci)P(Ci) = V-1(ni/Ni)Ni = V-1ni
K
i
1i
1-
i-1
1
n
nV
nV
|
|| K
k
K
kkk
iii
CPCp
CPCpCPx
xx
14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Using Bayes’ rule we find posteriors p(Ck|x) = nk/K
Assign x to the class with highest posterior, which is the class with the highest representation among the K training examples in the hyper-sphere centered on x
K=1 (nearest neighbor rule) assign x to the class of nearest neighbor in the training data.
Bayes’ classifier based on neighbors
15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Usually chose K from a range values based on validation error
In 2D, we can visualize the classification by applying KNN to every point in the (x1,x2) plane. As K increases expect fewer islands and smoother boundaries
Bayes’ classifier based on K nearest neighbors (KNN)
Analysis of binary classification: beyond the confusion matrix
Quantities defined by binary confusion matrix
Let C1 be positive class, C2 be negative class, N be # of instancesError rate = (FP+FN)/N = 1-accuracyFalse positive rate = FP / (FP+TN) = fraction of C2 instances misclassifiedTure positive rate = TP / (TP+FN) = fraction of C1 instances correctly classified
18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Receiver operating characteristic (ROC) curve
Let C1 be positive classLet q be the threshold of P(C1|x) for assignment of x to C1If q is near 1, rare assignments to C1 have high probability of being correct
both FP-rate and TP-rate are smallAs q decreases both FP-rate and TP-rate increaseFor every value of q, (FP-rate, TP-rate) is point on ROC curve
20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Chance alone
marginal success
ROC curves
Drawing ROC curvesAssume C1 is the positive class. Rand all examples by decreasing P(C1|x)In decreasing rank order, move up 1/P(C1) for each positive example and move right 1/P(C2) for each negative example
If all examples are correctly classified, ROC curve will be in upper left.
If P(C1|x) is not correlated with class labels, ROC curve will be close to the diagonal
Performance with reduced attribute set is slightly improved
Slight improvement
Misclassified malignant cases decreased by 2