Bayesian Decision Theory (Sections 2.1-2.2)
description
Transcript of Bayesian Decision Theory (Sections 2.1-2.2)
![Page 1: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/1.jpg)
Bayesian Decision Theory
(Sections 2.1-2.2)
• Decision problem posed in probabilistic terms
• Bayesian Decision Theory–Continuous Features
• All the relevant probability values are known
![Page 2: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/2.jpg)
![Page 3: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/3.jpg)
![Page 4: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/4.jpg)
![Page 5: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/5.jpg)
Probability Density
Jain CSE 802, Spring 2013
![Page 6: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/6.jpg)
Course OutlineMODEL INFORMATION
COMPLETE INCOMPLETE
Supervised Learning
Unsupervised Learning
Nonparametric Approach
Parametric Approach
Nonparametric Approach
Parametric Approach
Bayes Decision Theory
“Optimal” Rules
Plug-in Rules
Density Estimation
Geometric Rules (K-NN, MLP)
Mixture Resolving
Cluster Analysis (Hard, Fuzzy)
![Page 7: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/7.jpg)
Introduction
• From sea bass vs. salmon example to “abstract” decision making problem
• State of nature; a priori (prior) probability• State of nature (which type of fish will be observed next) is
unpredictable, so it is a random variable
• The catch of salmon and sea bass is equiprobable
• P(1) = P(2) (uniform priors)
• P(1) + P( 2) = 1 (exclusivity and exhaustivity)
• Prior prob. reflects our prior knowledge about how likely we are to observe a sea bass or salmon; these probabilities may depend on time of the year or the fishing area!
![Page 8: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/8.jpg)
• Bayes decision rule with only the prior information
• Decide 1 if P(1) > P(2), otherwise decide 2
• Error rate = Min {P(1) , P(2)}
• Suppose now we have a measurement or feature on the state of nature - say the fish lightness value
• Use of the class-conditional probability density
• P(x | 1) and P(x | 2) describe the difference in lightness feature between populations of sea bass and salmon
![Page 9: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/9.jpg)
Amount of overlap between the densities determines the “goodness” of feature
![Page 10: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/10.jpg)
• Maximum likelihood decision rule
• Assign input pattern x to class 1 if
P(x | 1) > P(x | 2), otherwise 2
• How does the feature x influence our attitude (prior) concerning the true state of nature?
• Bayes decision rule
![Page 11: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/11.jpg)
• Posteriori probability, likelihood, evidence
• P(j , x) = P(j | x)p (x) = p(x | j) P (j)
• Bayes formula
P(j | x) = {p(x | j) . P (j)} / p(x)
where
• Posterior = (Likelihood. Prior) / Evidence
• Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities sum to 1
• P(x | j) is called the likelihood of j with respect to x; the category j for which P(x | j) is large is more likely to be the true category
2j
1jjj )(P)|x(P)x(P
![Page 12: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/12.jpg)
![Page 13: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/13.jpg)
• P(1 | x) is the probability of the state of nature being 1
given that feature value x has been observed
• Decision based on the posterior probabilities is called the Optimal Bayes Decision rule
For a given observation (feature value) X:
if P(1 | x) > P(2 | x) decide 1
if P(1 | x) < P(2 | x) decide 2
To justify the above rule, calculate the probability of error:
P(error | x) = P(1 | x) if we decide 2
P(error | x) = P(2 | x) if we decide 1
![Page 14: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/14.jpg)
• So, for a given x, we can minimize te rob. Of error, decide 1 if
P(1 | x) > P(2 | x); otherwise decide 2
Therefore:
P(error | x) = min [P(1 | x), P(2 | x)]
• Thus, for each observation x, Bayes decision rule minimizes the probability of error
• Unconditional error: P(error) obtained by integration over all x w.r.t. p(x)
![Page 15: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/15.jpg)
• Optimal Bayes decision rule
Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2
• Special cases:(i) P(1) = P(2); Decide 1 if
p(x | 1) > p(x | 2), otherwise 2
(ii) p(x | 1) = p(x | 2); Decide 1 if
P(1) > P(2), otherwise 2
![Page 16: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/16.jpg)
Bayesian Decision Theory – Continuous Features
• Generalization of the preceding formulation
• Use of more than one feature (d features)
• Use of more than two states of nature (c classes)
• Allowing other actions besides deciding on the state of nature
• Introduce a loss function which is more general than the probability of error
![Page 17: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/17.jpg)
• Allowing actions other than classification primarily allows the possibility of rejection
• Refusing to make a decision when it is difficult to decide between two classes or in noisy cases!
• The loss function specifies the cost of each action
![Page 18: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/18.jpg)
• Let {1, 2,…, c} be the set of c states of nature
(or “categories”)
• Let {1, 2,…, a} be the set of a possible actions
• Let (i | j) be the loss incurred for taking
action i when the true state of nature is j
• General decision rule (x) specifies which action to take for every possible
observation x
![Page 19: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/19.jpg)
Conditional Risk
Overall risk
R = Expected value of R(i | x) w.r.t. p(x)
Minimizing R Minimize R(i | x) for i = 1,…, a
Conditional risk
cj
1jjjii )x|(P)|()x|(R
For a given x, suppose we take the action i ; if the true state is j , we will incur the loss (i | j). P(j | x) is the prob. that the true state is j But, any one of the C states is possible for given x.
![Page 20: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/20.jpg)
Select the action i for which R(i | x) is minimum
The overall risk R is minimized and the resulting risk is called the Bayes risk; it is the best performance that can be achieved!
![Page 21: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/21.jpg)
• Two-category classification
1 : deciding 1
2 : deciding 2
ij = (i | j)
loss incurred for deciding i when the true state of nature is j
Conditional risk:
R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)
![Page 22: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/22.jpg)
Bayes decision rule is stated as:
if R(1 | x) < R(2 | x)
Take action 1: “decide 1”
This results in the equivalent rule:
decide 1 if:
(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
and decide 2 otherwise
![Page 23: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/23.jpg)
Likelihood ratio:
The preceding rule is equivalent to the following rule:
then take action 1 (decide 1); otherwise take action 2 (decide 2)
Note that the posteriori porbabilities are scaled by the loss differences.
)(P
)(P.
)|x(P)|x(P
if1
2
1121
2212
2
1
![Page 24: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/24.jpg)
Interpretation of the Bayes decision rule:
“If the likelihood ratio of class 1 and class 2
exceeds a threshold value (that is independent of the input pattern x), the optimal action is to decide 1”
Maximum likelihood decision rule: the threshold value is 1; 0-1 loss function and equal class prior probability
![Page 25: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/25.jpg)
Bayesian Decision Theory(Sections 2.3-2.5)
• Minimum Error Rate Classification
• Classifiers, Discriminant Functions and Decision Surfaces
• The Normal Density
![Page 26: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/26.jpg)
Minimum Error Rate Classification
•Actions are decisions on classesIf action i is taken and the true state of nature is j then:the decision is correct if i = j and in error if i j
•Seek a decision rule that minimizes the probability of error or the error rate
![Page 27: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/27.jpg)
• Zero-one (0-1) loss function: no loss for correct decision and a unit loss for any error
The conditional risk can now be simplified as:
“The risk corresponding to the 0-1 loss function is the average probability of error”
c,...,1j,i ji 1
ji 0),( ji
1jij
cj
1jjjii
)x|(P1)x|(P
)x|(P)|()x|(R
![Page 28: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/28.jpg)
•Minimizing the risk requires maximizing the posterior probability P(i | x) since
R(i | x) = 1 – P(i | x))
•For Minimum error rate
•Decide i if P (i | x) > P(j | x) j i
![Page 29: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/29.jpg)
• Decision boundaries and decision regions
• If is the 0-1 loss function then the threshold involves only the priors:
b1
2
a1
2
)(P
)(P2 then
0 1
2 0 if
)(P
)(P then
0 1
1 0
)|x(P
)|x(P :if decide then
)(P
)(P. Let
2
11
1
2
1121
2212
![Page 30: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/30.jpg)
![Page 31: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/31.jpg)
Classifiers, Discriminant Functionsand Decision Surfaces
•Many different ways to represent pattern classifiers; one of the most useful is in terms of discriminant functions
•The multi-category case
•Set of discriminant functions gi(x), i = 1,…,c
•Classifier assigns a feature vector x to class i if:
gi(x) > gj(x) j i
![Page 32: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/32.jpg)
Network Representation of a Classifier
![Page 33: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/33.jpg)
•Bayes classifier can be represented in this way, but the choice of discriminant function is not unique
•gi(x) = - R(i | x)
(max. discriminant corresponds to min. risk!)
•For the minimum error rate, we take gi(x) = P(i | x)
(max. discrimination corresponds to max. posterior!)
gi(x) P(x | i) P(i)
gi(x) = ln P(x | i) + ln P(i)
(ln: natural logarithm!)
![Page 34: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/34.jpg)
•Effect of any decision rule is to divide the feature space into c decision regions
if gi(x) > gj(x) j i then x is in Ri
(Region Ri means assign x to i)
•The two-category case•Here a classifier is a “dichotomizer” that has two
discriminant functions g1 and g2
Let g(x) g1(x) – g2(x)
Decide 1 if g(x) > 0 ; Otherwise decide 2
![Page 35: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/35.jpg)
• So, a “dichotomizer” computes a single discriminant function g(x) and classifies x according to whether g(x) is positive or not.
• Computation of g(x) = g1(x) – g2(x)
)(P
)(Pln
)|x(P
)|x(Pln
)x|(P)x|(P)x(g
2
1
2
1
21
![Page 36: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/36.jpg)
![Page 37: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/37.jpg)
The Normal Density
• Univariate density: N( , 2)
• Normal density is analytically tractable
• Continuous density
• A number of processes are asymptotically Gaussian
• Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted versions of a single typical or prototype (Central Limit theorem)
where: = mean (or expected value) of x 2 = variance (or expected squared deviation) of x
,x
2
1exp
2
1)x(P
2
![Page 38: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/38.jpg)
![Page 39: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/39.jpg)
• Multivariate density: N( , )
• Multivariate normal density in d dimensions:
where:
x = (x1, x2, …, xd)t (t stands for the transpose of a vector)
= (1, 2, …, d)t mean vector = d*d covariance matrix
|| and -1 are determinant and inverse of , respectively
• The covariance matrix is always symmetric and positive semidefinite; we assume is positive definite so the determinant of is strictly positive
• Multivariate normal density is completely specified by [d + d(d+1)/2] parameters
• If variables x1 and x2 are statistically independent then the covariance of x1 and x2 is zero.
)x()x(
2
1exp
)2(
1)x(P 1t
2/12/d
![Page 40: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/40.jpg)
Multivariate Normal density
2 1( ) ( )tr x x
Samples drawn from a normal population tend to fall in a single cloud or cluster; cluster center is determined by the mean vector and shape by the covariance matrix
The loci of points of constant density are hyperellipsoids whose principal axes are the eigenvectors of
![Page 41: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/41.jpg)
Transformation of Normal VariablesLinear combinations of jointly normally distributed random variables are normally distributed
Coordinate transformation can convert an arbitrary multivariate normal distribution into a spherical one
![Page 42: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/42.jpg)
Bayesian Decision Theory (Sections 2-6 to 2-9)
• Discriminant Functions for the Normal Density
• Bayes Decision Theory – Discrete Features
![Page 43: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/43.jpg)
Discriminant Functions for the Normal Density
• The minimum error-rate classification can be achieved by the discriminant function
gi(x) = ln P(x | i) + ln P(i)
• In case of multivariate normal densities
)(Plnln2
12ln
2
d)x()x(
2
1)x(g ii
1
ii
tii
![Page 44: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/44.jpg)
• Case i = 2.I (I is the identity matrix)
Features are statistically independent and each feature has
the same variance
)category! ththe for thresholdthe called is (
)(Pln2
1w ;w
:where
function) ntdiscrimina (linear wxw)x(g
0i
iiti20i2
ii
0itii
i
![Page 45: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/45.jpg)
• A classifier that uses linear discriminant functions is called “a linear machine”
• The decision surfaces for a linear machine are pieces of hyperplanes defined by the linear equations:
gi(x) = gj(x)
![Page 46: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/46.jpg)
• The hyperplane separating Ri and Rj
is orthogonal to the line linking the means!
)()(P
)(Pln)(
2
1x ji
j
i2
ji
2
ji0
)(2
1x then )(P)(P if ji0ji
![Page 47: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/47.jpg)
![Page 48: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/48.jpg)
![Page 49: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/49.jpg)
![Page 50: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/50.jpg)
• Case 2: i = (covariance matrices of all classes are identical but otherwise arbitrary!)
• Hyperplane separating Ri and Rj
• The hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!
• To classify a feature vector x, measure the squared Mahalanobis distance from x to each of the c means; assign x to the category of the nearest mean
).(
)()(
)(P/)(Pln)(
2
1x ji
ji1t
ji
jiji0
![Page 51: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/51.jpg)
![Page 52: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/52.jpg)
![Page 53: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/53.jpg)
Discriminant Functions for 1D Gaussian
![Page 54: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/54.jpg)
• Case 3: i = arbitrary
• The covariance matrices are different for each category
In the 2-category case, the decision surfaces are hyperquadrics that can assume any of the general forms: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)
)(Plnln2
1
2
1 w
w
2
1W
:where
wxwxWx)x(g
iii1
iti0i
i1
ii
1ii
0itii
ti
![Page 55: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/55.jpg)
Discriminant Functions for the Normal Density
![Page 56: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/56.jpg)
![Page 57: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/57.jpg)
Discriminant Functions for the Normal Density
![Page 58: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/58.jpg)
Discriminant Functions for the Normal Density
![Page 59: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/59.jpg)
Decision Regions for Two-Dimensional Gaussian Data
2112 1875.0125.1514.3 xxx
![Page 60: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/60.jpg)
Error Probabilities and Integrals• 2-class problem
• There are two types of errors
• Multi-class problem – Simpler to computer the prob. of being correct (more
ways to be wrong than to be right)
![Page 61: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/61.jpg)
Error Probabilities and Integrals
Bayes optimal decision boundary in 1-D case
![Page 62: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/62.jpg)
Error Bounds for Normal Densities
• The exact calculation of the error for the general Guassian case (case 3) is extremely difficult
• However, in the 2-category case the general error can be approximated analytically to give us an upper bound on the error
![Page 63: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/63.jpg)
Error Rate of Linear Discriminant Function (LDF)
• Assume a 2-class problem
• Due to the symmetry of the problem (identical ), the two types of errors are identical
• Decide if or
or
1x 1 2( ) ( )g x g x
1 11 21 1 2 2
1 1( ) ( ) log ( ) ( ) ( ) log ( )
2 2t tx x P x x P
1 1 11 22 1 1 1 2 2
1( ) log ( ) / ( )
2t tt x P P
1 1 2 2
1
~ ( , ), ~ ( , )
1( ) log ( ) ( ) ( ) log ( )
2t
i i ii i
p x N p x N
g x P x x x P
![Page 64: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/64.jpg)
• Let• Compute expected values & variances of when
where = squared Mahalanobis distance
between
1 1 1
2 1 1 1 2 2
1( ) ( )
2t tth x x
( )h x
1 2&x x
1 1 11 1 12 1 1 1 2 2
1
2 1 2 1
1( ) ( )
21
( ) ( )2
t tt
t
E h x x E x
1
1
2 1 2 1( ) ( )t
1 2&
Error Rate of LDF
![Page 65: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/65.jpg)
• Similarly
12 2 1 2 1
1( ) ( )
2t
1
2
( ) ~ ( , 2 )
( ) ~ ( , 2 )
p h x x N
p h x x N
22 11 1 1 12 1 1
1
2 1 2 1
( ) ( ) ( )
( ) ( )
2
t
t
E h x E x x
22 2
Error Rate of LDF
![Page 66: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/66.jpg)
2
1( )
21 1 2 1 1
1
2
2
1( ) ( ) ( ) ( ) ~
2 2
1
2
1 1
2 2 4
t
n t
P g x g x x P h x dh h x e
e d
terf
Error Rate of LDF
![Page 67: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/67.jpg)
2
1
2
0
2
1 1 2 2
( )log
( )
2( )
1 1
2 2 4
Total probability of error
( ) ( )
rx
e
Pt
P
erf r e dx
terf
P P P
Error Rate of LDF
![Page 68: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/68.jpg)
1 2
1
1 2 1 21 2
10
2
( ) ( )1 1 1 1
2 2 2 24 2 2
t
P P t
erf erf
Error Rate of LDF
1
1 2 1 2
1 2
1
1 2 1 2
1 2
Mahalanobis distance is a good measure of separation between classes
(i) No Class Separation
( ) ( ) 0
1
2
(ii) Perfect Class Separation
( ) ( ) 0
0 ( 1)
t
t
erf
∞
![Page 69: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/69.jpg)
Chernoff Bound• To derive a bound for the error, we need the following inequality
Assume conditional prob. are normal
where
![Page 70: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/70.jpg)
Chernoff BoundChernoff bound for P(error) is found by determining the value of that minimizes exp(-k())
![Page 71: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/71.jpg)
Error Bounds for Normal Densities• Bhattacharyya Bound
• Assume = 1/2 • computationally simpler
• slightly less tight bound
• Now, Eq. (73) has the form
When the two covariance matrices are equal, k(1/2) is te same as the Mahalanobis distance between the two means
![Page 72: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/72.jpg)
Error Bounds for Gaussian Distributions
Chernoff Bound
Bhattacharya Bound (β=1/2)2–category, 2D data
True error using numerical integration = 0.0021
Best Chernoff error bound is 0.008190
Bhattacharya error bound is 0.008191
1 11 1 1 2( ) ( ) ( ) ( | ) ( | ) 0 1P error P P p x p x dx
1 ( )1 2( | ) ( | ) kp x p x dx e
1 1 212 1 1 2 2 1
1 2
(1 )(1 ) 1( ) ( ) [ (1 ) ] ( ) ln
2 2 | | | |
tk
(1/2)1 2 1 2 1 2( ) ( ) ( ) ( | ) ( | ) ( ) ( ) kP error P P P x P x dx P P e
1 211 2
2 1 2 11 2
1 2(1 / 2) 1 / 8( ) ( ) ln
2 2 | || |
tk
![Page 73: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/73.jpg)
Neyman-Pearson Rule
“Classification, Estimation and Pattern recognition” by Young and Calvert
![Page 74: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/74.jpg)
Neyman-Pearson Rule
![Page 75: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/75.jpg)
Neyman-Pearson Rule
![Page 76: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/76.jpg)
Neyman-Pearson Rule
![Page 77: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/77.jpg)
Neyman-Pearson Rule
![Page 78: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/78.jpg)
Neyman-Pearson Rule
![Page 79: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/79.jpg)
We are interested in detecting a single weak pulse, e.g. radar reflection; the internal signal (x) in detector has mean m1 (m2) when pulse is absent (present)
Signal Detection Theory
Discriminability: ease of determining whether the pulse is present or not
The detector uses a threshold x* to determine the presence of pulse
2( * | ) :P x x x hit1( * | ) :P x x x false alarm2( * | ) :P x x x miss1( * | ) :P x x x correct rejection
For given threshold, define hit, false alarm, miss and correct rejection
21 1( | ) ~ ( , )p x N
22 2( | ) ~ ( , )p x N
1 2| |'d
![Page 80: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/80.jpg)
Receiver Operating Characteristic (ROC)
• Experimentally compute hit and false alarm rates for fixed x*
• Changing x* will change the hit and false alarm rates
• A plot of hit and false alarm rates is called the ROC curve
Performance shown at different operating points
![Page 81: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/81.jpg)
Operating Characteristic• In practice, distributions may not be Gaussian and
will be multidimensional; ROC curve can still be plotted
• Vary a single control parameter for the decision rule and plot the resulting hit and false alarm rates
![Page 82: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/82.jpg)
Bayes Decision Theory – Discrete Features
• Components of x are binary or integer valued; x can take only one of m discrete values
v1, v2, …,vm
• Case of independent binary features for 2-category problem
Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with probabilities:
pi = P(xi = 1 | 1)
qi = P(xi = 1 | 2)
![Page 83: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/83.jpg)
• The discriminant function in this case is:
0g(x) if and0 g(x) if decide
)(P
)(Pln
q1
p1lnw
:and
d,...,1i )p1(q
)q1(plnw
:where
wxw)x(g
21
2
1d
1i i
i0
ii
iii
0i
d
1ii
![Page 84: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/84.jpg)
Bayesian Decision for Three-dimensional Binary Data
Decision boundary for 3D binary features. Left figure shows the case when pi=.8 and qi=.5. Right figure shows case when p3=q3 (Feature 3 is not providing any discriminatory information) so decision surface is parallel to x3 axis
• Consider a 2-class problem with three independent binary features; class priors are equal and pi = 0.8 and qi = 0.5, i = 1,2,3• wi = 1.3863• w0 = 1.2• Decision surface g(x) = 0 is shown below
![Page 85: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/85.jpg)
Handling Missing Features
• Suppose it is not possible to measure a certain feature for a given pattern
• Possible solutions:
• Reject the pattern
• Approximate the missing feature
• Mean of all the available values for the missing feature
• Marginalize over the distribution of the missing feature
![Page 86: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/86.jpg)
Handling Missing Features
![Page 87: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/87.jpg)
Other Topics• Compound Bayes Decision Theory & Context
– Consecutive states of nature might not be statistically independent; in sorting two types of fish, arrival of next fish may not be independent of the previous fish
– Can we exploit such statistical dependence to gain improved performance (use of context)
– Compound decision vs. sequential compound decision problems
– Markov dependence
• Sequential Decision Making
– Feature measurement process is sequential (as in medical diagnosis)
– Feature measurement cost
– Minimize the no. of features to be measured while achieving a sufficient accuracy; minimize a combination of feature measurement cost & classification accuracy
![Page 88: Bayesian Decision Theory (Sections 2.1-2.2)](https://reader035.fdocuments.us/reader035/viewer/2022062217/568159a3550346895dc6fc14/html5/thumbnails/88.jpg)
Context in Text Recognition