Machine Learning
description
Transcript of Machine Learning
![Page 1: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/1.jpg)
Machine Learning
Central Problem of Pattern Recognition:Supervised and Unsupervised Learning
ClassificationBayesian Decision Theory
Perceptrons and SVMsClustering
Visual Computing: Joachim M. Buhmann — Machine Learning 143/196
![Page 2: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/2.jpg)
Machine Learning – What is the Challenge?Find optimal structure in data and validate it!
38 March 2006 Joachim M. Buhmann / Institute for Computational Science
Concept for Robust Data Analysis
FeedbackFeedback
Structureoptimizationmultiscale analysis,stochastic approximation
xQuantization ofsolution spaceInformation/Rate Distortion Theory
Regularizationof statistical &
computational complexity
Structure Validationstatisticallearning theory
Structuredefinition
(costs, risk, ...)
Datavectors, relations,
images,...
Visual Computing: Joachim M. Buhmann — Machine Learning 144/196
![Page 3: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/3.jpg)
The Problem of Pattern Recognition
Machine Learning (as statistics) addresses a number of chal-lenging inference problems in pattern recognition which spanthe range from statistical modeling to efficient algorithmics.Approximative method which yield good performance on ave-rage are particularly important.
• Representation of objects . ⇒ Data representation
• What is a pattern? Definition/modeling of structure .
• Optimization : Search for prefered structures
• Validation : are the structures indeed in the data or are theyexplained by fluctuations?
Visual Computing: Joachim M. Buhmann — Machine Learning 145/196
![Page 4: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/4.jpg)
Literatur
• Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification.Wiley & Sons (2001)
• Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements ofStatistical Learning: Data Mining, Inference and Prediction. Springer Ver-lag (2001)
• Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat-tern Recognition. Springer Verlag (1996)
• Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da-ta. Springer Verlag (1983); The Nature of Statistical Learning Theory.Springer Verlag (1995)
• Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing,ISBN: 0-387-40272-1) Springer Verlag (2004)
Visual Computing: Joachim M. Buhmann — Machine Learning 146/196
![Page 5: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/5.jpg)
The Classification Problem
Visual Computing: Joachim M. Buhmann — Machine Learning 147/196
![Page 6: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/6.jpg)
Visual Computing: Joachim M. Buhmann — Machine Learning 148/196
![Page 7: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/7.jpg)
Classification as a Pattern Recognition Problem
Problem: We look for a partition of the object space O (fishin the previous example) which corresponds to classificationexamples.Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X !
Data: pairs of feature vectors and class labelsZ = (xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ 1, . . . , k
Definitions: feature space X with xi ∈ X ⊂ Rd
class labels yi ∈ 1, . . . , k
Classifier: mapping c : X → 1, . . . , k
k class problem: What is yn+1 ∈ 1, . . . , k for xn+1 ∈ Rd?
Visual Computing: Joachim M. Buhmann — Machine Learning 149/196
![Page 8: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/8.jpg)
Example of Classification
Visual Computing: Joachim M. Buhmann — Machine Learning 150/196
![Page 9: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/9.jpg)
Histograms of Length Values
salmon sea bass
length
count
l*
0
2
4
6
8
10
12
16
18
20
22
5 10 2015 25
FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 151/196
![Page 10: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/10.jpg)
Histograms of Skin Brightness Values
2 4 6 8 100
2
4
6
8
10
12
14
lightness
count
x*
salmon sea bass
FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗
marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 152/196
![Page 11: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/11.jpg)
Linear Classification
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 153/196
![Page 12: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/12.jpg)
Overfitting
?
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 154/196
![Page 13: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/13.jpg)
Optimized Non-Linear Classification
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Occam’s razor argument: Entia non sunt multiplicanda praeter necessitatem!
Visual Computing: Joachim M. Buhmann — Machine Learning 155/196
![Page 14: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/14.jpg)
Regression(see Introduction to Machine Learning)
Question: Given a feature(vector) xi and a corre-sponding noisy measure-ment of a function valueyi = f(xi) + noise, what isthe unknown function f(.)in a hypothesis class H?
Data: Z = (xi, yi) ∈ Rd × R : 1 ≤ i ≤ n
Modeling choice: What is an adequate hypothesis class anda good noise model? Fitting with linear/nonlinear functions?
Visual Computing: Joachim M. Buhmann — Machine Learning 156/196
![Page 15: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/15.jpg)
The Regression Function
Questions: (i) What is the statistically optimal estimate of afunction f : Rd → R and (ii) which algorithm achieves thisgoal most efficiently?
Solution to (i): the regression function
y(x) = E y|X = x =∫
Ω
y p(y|X = x)dy
Nonlinear regression of asinc functionsinc(x) := sin(x)/x
(gray) with a regression fit(black) based on 50 noisydata.
Visual Computing: Joachim M. Buhmann — Machine Learning 157/196
![Page 16: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/16.jpg)
Examples of linear and nonlinear regression
linear regression nonlinear regression
How should we measure the deviations?
vertical offsets perpendicular offsets
Visual Computing: Joachim M. Buhmann — Machine Learning 158/196
![Page 17: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/17.jpg)
Core Questions of Pattern Recognition:Unsupervised Learning
No teacher signal is available for the learning algorithm; lear-ning is guided by a general cost/risk function.
Examples for unsupervised learning
1. data clustering, vector quantization :as in classification we search for a partitioning of objects ingroups; but explicit labelings are not available.
2. hierarchical data analysis; search for tree structures in data3. visualisation, dimension reduction
Semisupervised learning: some of the data are labeled, mostof them are unlabeled.
Visual Computing: Joachim M. Buhmann — Machine Learning 159/196
![Page 18: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/18.jpg)
Modes of Learning
Reinforcement Learning: weakly supervised learningAction chains are evaluated at the end.Backgammon; the neural network TD-Gammon gained theworld championship! Quite popular in Robotics
Active Learning: Data are selected according to their expec-ted information gain.Information Filtering
Inductive Learning: the learning algorithm extracts logical ru-les from the data.Inductive Logic Programming is a popular sub area of Artifi-cial Intelligence
Visual Computing: Joachim M. Buhmann — Machine Learning 160/196
![Page 19: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/19.jpg)
Vectorial Data
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
A
AA
A
AA
A
AAA
B
B B
BBB
BB
B
B
C
C
C
C
C
C
C
C
C
C
DD
DD
D
DD
DD
D
E
E
EE
E
E
E
E
EE
FF
FFFF
F
F
FF
G
GGG
G
GGG
G
G
H
H
H
H
HH
H
HH
H
I
IIIII
III
I
J
J
JJJJ J
J
J
J
K
K
KK
K
KK
KKK
L
L
L
L
L
L
L
LL
LMM
M
M
M
M
M
MM
M
N
N
N
N
N
N
N
NN
N
O
O
O
OOO
O
OO
O
PP
P
P
P
PP
PP
P
Q
QQQ
Q
Q
Q Q
R
R
R
R
R
RR
R
R
R
S
S
S
S
S
S
S
S
S
S
T
T
TTTT
TTTT
Data of 20 Gaussiansources in R20, pro-jected onto two di-mensions with Princi-pal Component Ana-lysis.
Visual Computing: Joachim M. Buhmann — Machine Learning 161/196
![Page 20: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/20.jpg)
Relational Data
Pairwise dissimilarityof 145 globins whichhave been selectedfrom 4 classes ofα-globine, β-globine,myoglobins and glo-bins of insects andplants.
Visual Computing: Joachim M. Buhmann — Machine Learning 162/196
![Page 21: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/21.jpg)
Scales for Data
Nominal or categorial scale : qualitative, but without quantita-tive measurements,e.g. binary scale F = 0, 1 (presence or absence of proper-ties like “kosher”) ortaste categories “sweet, sour, salty and bitter.
Ordinal scale : measurement values are meaningful only withrespect to other measurements, i.e., the rank order of mea-surements carries the information, not the numerical diffe-rences (e.g. information on the ranking of different marathonraces!?)
Visual Computing: Joachim M. Buhmann — Machine Learning 163/196
![Page 22: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/22.jpg)
Quantitative scale:
• interval scale : the relation of numerical differences car-ries the information. Invariance w.r.t. translation and sca-ling (Fahrenheit scale of temperature).
• ratio scale : zero value of the scale carries information butnot the measurement unit. (Kelvin scale).
• Absolute scale : Absolute values are meaningful. (gradesof final exams)
Visual Computing: Joachim M. Buhmann — Machine Learning 164/196
![Page 23: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/23.jpg)
Machine Learning: Topic Chart
• Core problems of pattern recognition
• Bayesian decision theory
• Perceptrons and Support vector machines
• Data clustering
Visual Computing: Joachim M. Buhmann — Machine Learning 165/196
![Page 24: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/24.jpg)
Bayesian Decision TheoryThe Problem of Statistical Decisions
Task: textbf n objects have to be partitioned in 1, . . . , k classes,the doubt class D and the outlier class O.
D : doubt class (→ new measurements required)O : outlier class , definitively none of the classes 1, 2, . . . , k
Objects are characterized by feature vectors X ∈ X , X ∼P(X) with the probability P(X = x) of feature values x.
Statistical modeling: Objects represented by data X andclasses Y are considered to be random variables, i.e.,(X, Y ) ∼ P(X, Y ).Conceptually, it is not mandatory to consider class labels as random since they mightbe induced by legal considerations or conventions.
Visual Computing: Joachim M. Buhmann — Machine Learning 166/196
![Page 25: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/25.jpg)
Structure of the feature space X
• X ⊂ Rd
• X = X1 ×X2 × · · · × Xd with Xi ⊆ R or Xi finite.
Remark: in most situations we can define the feature space as subsets of Rd or astuples of real, categorial (B = 0, 1) or ordinal (K ⊂ K) numbers. Sometimes wehave more complicated data spaces composed of lists, trees or graphs.
Class density / likelihood: py(x) := P(X = x|Y = y) is equalto the probability of a feature value x given a class y.
Parametric Statistics: estimate the parameters of the classdensities py(x)
Non-Parametric Statistics: minimize the empirical risk
Visual Computing: Joachim M. Buhmann — Machine Learning 167/196
![Page 26: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/26.jpg)
Motivation of ClassificationGiven are labeled dataZ = (xi, yi) : i ≤ n
Questions:
1. What are the classboundaries?
2. What are the classspecific densitiespy(x)?
3. How many modesor parameters dowe need to modelpy(x)?
4. ...
Figure: quadratic SVM classifier for five classes.White areas are ambiguous regions.
Visual Computing: Joachim M. Buhmann — Machine Learning 168/196
![Page 27: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/27.jpg)
Thomas Bayes and his Terminology
The State of Nature is modelled as a random variable!
prior: Pmodel
likelihood: Pdata|model
posterior: Pmodel|data
evidence: Pdata
Bayes Rule: Pmodel|data =Pdata|modelPmodel
Pdata
Visual Computing: Joachim M. Buhmann — Machine Learning 169/196
![Page 28: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/28.jpg)
Ronald A. Fisher and Frequentism
Fisher, Ronald Aylmer (1890-1962): founder of frequentiststatistics together with Jerzey Neyman & Karl Pearson.
British mathematician and biologist who in-vented revolutionary techniques for apply-ing statistics to natural sciences.
Maximum likelihood method
Fisher information: a measure for the infor-mation content of densities.
Sampling theory
Hypothesis testing
Visual Computing: Joachim M. Buhmann — Machine Learning 170/196
![Page 29: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/29.jpg)
Bayesianism vs. Frequentist Inference 1
Bayesianism is the philosophical tenet that the mathematical theory of pro-bability applies to the degree of plausibility of statements, or to the degreeof belief of rational agents in the truth of statements; together with Bayestheorem, it becomes Bayesian inference. The Bayesian interpretation ofprobability allows probabilities assigned to random events, but also al-lows the assignment of probabilities to any other kind of statement.
Bayesians assign probabilities to any statement, even when no randomprocess is involved, as a way to represent its plausibility. As such, thescope of Bayesian inquiries include the scope of frequentist inquiries.
The limiting relative frequency of an event over a long series of trials isthe conceptual foundation of the frequency interpretation of probability.
Frequentism rejects degree-of-belief interpretations of mathematical pro-bability as in Bayesianism, and assigns probabilities only to randomevents according to their relative frequencies of occurrence.1see http://encyclopedia.thefreedictionary.com/
Visual Computing: Joachim M. Buhmann — Machine Learning 171/196
![Page 30: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/30.jpg)
Bayes Rule for Known Densities and Parameters
Assume that we know how the features are distributed for thedifferent classes, i.e., the class conditional densities and theirparameters are known.What is the best classification strat-egy in this situation?
Classifier:
c : X → 1, . . . , k,DThe assignment function c maps the feature space X to theset of classes 1, . . . , k,D. (Outliers are neglected)
Quality of a classifier: Whenever a classifier returns a labelwhich differs from the correct class Y = y then it has madea mistake.
Visual Computing: Joachim M. Buhmann — Machine Learning 172/196
![Page 31: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/31.jpg)
Error count: The indicator function∑x∈X
Ic(x) 6=y
counts the classifier mistakes. Note that this error count is arandom variable!
Expected errors also called expected risk define the qualityof a classifier
R(c) =∑y≤k
P(y)EP(x)
[Ic(x) 6=y|Y = y
]+ terms from D
Remark: The rational behind this choice comes from gambling. If we bet ona particular outcome of our experiment and our gain is measured by howoften we assign the measurements to the correct class then classifier withminimal expected risk will win on average against any other classificationrule (“Dutch books”)!
Visual Computing: Joachim M. Buhmann — Machine Learning 173/196
![Page 32: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/32.jpg)
The Loss Function
Weighted mistakes are introduced when classification errorsare not equally costly; e.g. in medical diagnosis, some di-sease classes might be harmless and others might be lethaldespite of similar symptoms.
⇒ We introduce a loss function L(y, z) which denotes the lossfor the decision z if class y is correct.
0-1 loss: all classes are treated the same!
L0−1(y, z) =
0 if z = y (correct decision)
1 if z 6= y and z 6= D (wrong decision)
d if z = D (no decision)
Visual Computing: Joachim M. Buhmann — Machine Learning 174/196
![Page 33: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/33.jpg)
• weighted classification costs L(y, z) ∈ R+ are frequentlyused, e.g. in medicine;classification costs can also be asymmetric, that meansL(y, z) 6= L(z, y) ((z, y) ∼ (pancreas cancer, gastritis).
Conditional Risk function of the classifier is the expectedloss of class y
R(c, y) = Ex [L(y, c(x))|Y = y]
=∑z≤k
(L(y, z)Pc(x) = z|Y = y
+L(y,D)Pc(x) = D|Y = y)
= Pc(x) 6= y ∧ c(x) 6= D|Y = y︸ ︷︷ ︸pmc(y) probability of misclassification
+ d ·Pc(x) = D|Y = y︸ ︷︷ ︸pd(y) probability of doubt
Visual Computing: Joachim M. Buhmann — Machine Learning 175/196
![Page 34: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/34.jpg)
Total risk of the classifier: (πy := P(Y = y))
R(c) =∑z≤k
πz pmc(z) + d∑z≤k
πz pd(z) = EC
[R(c, C)
]
Asymptotic average loss
limn→∞
1n
∑j≤n
L(cj, c(xj)) = limn→∞
R(c) = R(c),
where (xj, cj)|1 ≤ j ≤ n is a random sample set of size n.This formula can be interpreted as the expected loss with empirical distribution as probability model.
Visual Computing: Joachim M. Buhmann — Machine Learning 176/196
![Page 35: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/35.jpg)
Posterior class probability
Posterior: Let
p(y|x) ≡ PY = y|X = x =πypy(x)∑z πzpz(x)
be the posterior of the class y given X = x.
(The ‘Partition of One” πypy(x)/∑
z πzpz(x) results from the normalizati-
on∑
z p(z|x) = 1. )
Likelihood: The class conditional density py(x) is the probabi-lity of observing data X = x given class Y = y.
Prior: πy is the probability of class Y = y.
Visual Computing: Joachim M. Buhmann — Machine Learning 177/196
![Page 36: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/36.jpg)
Bayes Optimal Classifier
Theorem 1 The classification rule which minimizes the totalrisk for 0− 1 loss is
c(x) =
y if p(y|x) = maxz≤k p(z|x) > 1− d,
D if p(y|x) ≤ 1− d ∀y.
Generalization to arbitrary loss functions
c(x) =
y if
∑z L(z, y)p(z|x) = minρ≤k
∑z L(z, ρ)p(z|x) ≤ d,
D else .
Bayes classifier: Select the class with highest πypy(x) value ifit exceeds the costs for not making a decision, i.e., πypy(x) >
(1− d)p(x).
Visual Computing: Joachim M. Buhmann — Machine Learning 178/196
![Page 37: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/37.jpg)
Proof: Calculate the total expected loss R(c)
R(c) = EX
[EY
[L0−1(Y, c(x))|X = x
]]=
∫X
EY
[L0−1(Y, c(x))|X = x
]p(x)dx with p(x) =
∑z≤k
πzpz(x)
Minimize the conditional expectation value since it depends only on c.
c(x) = argminc∈1,...,k,DE[L0−1(Y, c)|X = x
]= argminc∈1,...,k,D
∑z≤k
L0−1(z, c)p(z|x)
=
argminc∈1,...,k (1− p(c|x)) if d > minc(1− p(c|x))D else
=
argmaxc∈1,...,kp(c|x) if 1− d < maxc p(c|x)D else
Visual Computing: Joachim M. Buhmann — Machine Learning 179/196
![Page 38: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/38.jpg)
Outliers
• Modeling by an outlier class πO with pO(x)
• “Novelty Detection” : Classify a measurement as an outlierif
πOpO(x) ≥ max
(1− d)p(x),maxz
πzpz(x)
• The outlier concept causes conceptual problems and it does not fit to thestatistical decision theory since outliers indicate an erroneous or incom-plete specification of the statistical model!
• The outlier class is often modeled by a uniform distribution.Attention : Normalization of uniform distribution does not exist in manyfeature spaces!
=⇒ Limit the support of the measurement space or put a (Gaussian)measure on it!
Visual Computing: Joachim M. Buhmann — Machine Learning 180/196
![Page 39: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/39.jpg)
Class Conditional Densities and Posteriors for 2Classes
Class-conditional probability den-sity function
Posterior probabilities for priorsP(y1) = 2
3,P(y2) = 13.
9 10 11 12 13 14 15
0.1
0.2
0.3
0.4
p(x|ωi)
x
ω1
ω2
FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi. If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
0.2
0.4
0.6
0.8
1
P(ωi|x)
x
ω1
ω2
9 10 11 12 13 14 15
FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 181/196
![Page 40: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/40.jpg)
Likelihood Ratio for 2 Class Example
x
θa
p(x|ω1)
p(x|ω2)
θb
R1
R2
R1
R2
FIGURE 2.3. The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown inFig. 2.1. If we employ a zero-one or classification loss, our decision boundaries aredetermined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1
patterns more than the converse, we get the larger threshold θb, and hence R1 becomessmaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifica-tion. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 182/196
![Page 41: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/41.jpg)
Discriminant Functions gl
discriminant
functions
input
g1(x) g
2(x) g
c(x). . .
x1
x2
xd. . .x
3
costs
action
(e.g., classification)
FIGURE 2.5. The functional structure of a general statistical pattern classifier whichincludes d inputs and c discriminant functions gi(x). A subsequent step determineswhich of the discriminant values is the maximum, and categorizes the input patternaccordingly. The arrows show the direction of the flow of information, though frequentlythe arrows are omitted when the direction of flow is self-evident. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
• Discriminant function: gz(x) = PY = y|X = x
• Class decision: gy(x) > gz(x) ∀z 6= y ⇒ class y.
• Different discriminant functions can yield the same decision:gy(x) = log Px|y+ log πy; minimize implementation problems!
Visual Computing: Joachim M. Buhmann — Machine Learning 183/196
![Page 42: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/42.jpg)
Example for Discriminant Functions
0
0.1
0.2
0.3
decision
boundary
p(x|ω2)P(ω2)
R1
R2
p(x|ω1)P(ω1)
R2
0
5
0
5
FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 184/196
![Page 43: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/43.jpg)
Adaptation of Discriminant Functions gl
discriminant
functions
input
g1(x) g
2(x) g
c(x). . .
x1
x2
xd. . .x
3
action
(e.g., classification)
MAX
-
teacher
signal
The red connections (weights) are adapted in such a way that the teacher
signal is imitated by the discriminant function.
Visual Computing: Joachim M. Buhmann — Machine Learning 185/196
![Page 44: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/44.jpg)
Example Discriminant Functions: NormalDistributions
The Likelihood of class y is Gaussian distributed.
py(x) =1√
(2π)d|Σy|exp
(−1
2(x− µy)TΣ−1
y (x− µy))
Special case: Σy = σ2I
gy(x) = log py(x) + log πy
= − 12σ2
‖x− µy‖2 + log πy + const.
Visual Computing: Joachim M. Buhmann — Machine Learning 186/196
![Page 45: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/45.jpg)
⇒ Decision surface between class z and y:
− 12σ2
‖x− µz‖2 + log πz = − 12σ2
‖x− µy‖2 + log πy
−‖x‖2 + 2x · µz − ‖µz‖2 + 2σ2 log πz = −‖x‖2 + 2x · µy − ‖µy‖2 + 2σ2 log πy
⇒ 2x · (µz − µy)− ‖µz‖2 + ‖µy‖2 + 2σ2 logπz
πy= 0
Linear decision rule : wT (x− x0) = 0
with w = µz − µy x0 =12(µz + µy)−
σ2(µz − µy)‖µz − µy‖2
logπz
πy
Visual Computing: Joachim M. Buhmann — Machine Learning 187/196
![Page 46: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/46.jpg)
Decision Surface for Gaussians in 1,2,3Dimensions
-2 2 4
0.1
0.2
0.3
0.4
P(ω1)=.5 P(ω2)=.5
x
p(x|ωi)ω1 ω2
0
R1
R2
02
4
0
0.05
0.1
0.15
-2
P(ω1)=.5P(ω2)=.5
ω1
ω2
R1
R2
-2
0
2
4
-2
-1
0
1
2
0
1
2
-2
-1
0
1
2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R1
R2
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identitymatrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane ofd − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensionalexamples, we indicate p(x|ωi) and the boundaries for the case P(ω1) = P(ω2). In the three-dimensional case,the grid plane separates R1 from R2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 188/196
![Page 47: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/47.jpg)
P(ω1)=.7 P(ω2)=.3
ω1 ω2
R1
R2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
P(ω1)=.9 P(ω2)=.1
ω1 ω2
R1
R2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
-2
0
2
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.8
P(ω2)=.2
ω1 ω2
R1
R2
-2
0
2
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.99
P(ω2)=.01
ω1
ω2
R1
R2
-1
0
1
2
0
1
2
3
-2
-1
0
1
2
-2
P(ω1)=.8
P(ω2)=.2
ω1
ω2
R1
R2
0
2
4
-1
0
1
2
-2
-1
0
1
2
-2
P(ω1)=.99
P(ω2)=.01ω1
ω2
R1
R2
FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficientlydisparate priors the boundary will not lie between the means of these one-, two- andthree-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.Visual Computing: Joachim M. Buhmann — Machine Learning 189/196
![Page 48: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/48.jpg)
Multi Class Case
R3
R2
R1
R4
R4
FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
Decision regions for four Gaussian distributions. Even for such a small num-
ber of classes the discriminant functions show a complex form.
Visual Computing: Joachim M. Buhmann — Machine Learning 190/196
![Page 49: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/49.jpg)
Example: Gene Expression Data
The expression of genes is measured for various patients. Theexpression profiles provide information of the metabolic state ofthe cells, meaning that they could be used as indicators for di-sease classes. Each patient is represented as a vector in a highdimensional (≈ 10000) space with Gaussian class distribution.
Genes
Sam
ples
ALL B
−Cell
AM
LA
LL T−C
ellT
rue
Pred
Visual Computing: Joachim M. Buhmann — Machine Learning 191/196
![Page 50: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/50.jpg)
Parametric Models for Class Densities
If we would know the prior probabilities and the class conditio-nal probabilities then we could calculate the optimal classifier.But we don’t!
Task: Estimate p(y|x; θ) from samplesZ = (x1, y1), . . . , (xn, yn)for classification.
Data are sorted according to their classes:Xy = X1y, . . . , Xny,y where Xiy ∼ PX|Y = y; θy
Question: How can we use the information in samples to esti-mate θy?
Assumption: classes can be separated and treated indepen-dently! Xy is not informative w.r.t. θz, z 6= y
Visual Computing: Joachim M. Buhmann — Machine Learning 192/196
![Page 51: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/51.jpg)
Maximum Likelihood Estimation Theory
Likelihood of the data set: PXy|θy =∏
i≤nyp(xiy|θy)
Estimation principle: Select the parameters θy which maximi-ze the likelihood, that means
θy = arg maxθy
PXy|θy
Procedure: Find the extreme value of the log-likelihood functi-on
∇θy log PX |θy = 0
∂
∂θy
∑i≤n
log p(xi|θy) = 0
Visual Computing: Joachim M. Buhmann — Machine Learning 193/196
![Page 52: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/52.jpg)
Remark
Bias of an estimator: bias(θn) = Eθn − θ.
Consistent estimator: A point estimator θn of a parameter θ
is consistent if θnP→ θ.
Asymptotic Normality of Maximum Likelihood estimates:
(θn − θ)/√
Vθn N (0, 1).
Alternative to ML class density estimation: discriminativelearning by maximizing the a posteriori distribution Pθy|Xy(details of the density do not have to be modelled since they might not influence the po-
sterior)
Visual Computing: Joachim M. Buhmann — Machine Learning 194/196
![Page 53: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/53.jpg)
Example: Multivariate Normal Distribution
Expectation values of a normal distribution and its estimation:Class index has been omitted for legibility reasons (θy → θ).
log p(xi|θ) = −12(xi − µ)TΣ−1(xi − µ)− d
2log 2π − 1
2log |Σ|
∂
∂µ
∑i≤n
log p(xi|θ) =12
∑i≤n
Σ−1(xi − µ) +12
∑i≤n
((xi − µ)Σ−1
)T= 0
Σ−1∑i≤n
(xi − µ) = 0 ⇒ µn =1n
∑i
xi estimator for µ
Average value formula results from the quadratic form.
Unbiasedness: E[µn] =1n
∑i≤n
Exi = E[x] = µ
Visual Computing: Joachim M. Buhmann — Machine Learning 195/196
![Page 54: Machine Learning](https://reader034.fdocuments.us/reader034/viewer/2022052301/544fd411af7959070a8b9b9b/html5/thumbnails/54.jpg)
ML estimation of the variance (1d case)
∂
∂σ2
∑i≤n
log p(xi|θ) = − ∂
∂σ2
∑i≤n
1σ2‖xi − µ‖2 − n
2log(2πσ2)
=12
∑i≤n
σ−4‖xi − µ‖2 − n
2σ−2 = 0
⇒ σ2n =
1n
∑i≤n
‖xi − µ‖2
Multivariate case Σn =1n
∑i≤n
(xi − µ)(xi − µ)T
Σn is biased, e.g., EΣn 6= Σ, if µ is unknown.
Visual Computing: Joachim M. Buhmann — Machine Learning 196/196