Machine Learning

Machine Learning

Central Problem of Pattern Recognition:Supervised and Unsupervised Learning

ClassificationBayesian Decision Theory

Perceptrons and SVMsClustering

Visual Computing: Joachim M. Buhmann — Machine Learning 143/196

Machine Learning – What is the Challenge?Find optimal structure in data and validate it!

38 March 2006 Joachim M. Buhmann / Institute for Computational Science

Concept for Robust Data Analysis

FeedbackFeedback

Structureoptimizationmultiscale analysis,stochastic approximation

xQuantization ofsolution spaceInformation/Rate Distortion Theory

Regularizationof statistical &

computational complexity

Structure Validationstatisticallearning theory

Structuredefinition

(costs, risk, ...)

Datavectors, relations,

images,...


The Problem of Pattern Recognition

Machine Learning (as statistics) addresses a number of chal-lenging inference problems in pattern recognition which spanthe range from statistical modeling to efficient algorithmics.Approximative method which yield good performance on ave-rage are particularly important.

• Representation of objects . ⇒ Data representation

• What is a pattern? Definition/modeling of structure .

• Optimization : Search for prefered structures

• Validation : are the structures indeed in the data or are theyexplained by fluctuations?


Literatur

• Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification.Wiley & Sons (2001)

• Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements ofStatistical Learning: Data Mining, Inference and Prediction. Springer Ver-lag (2001)

• Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat-tern Recognition. Springer Verlag (1996)

• Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da-ta. Springer Verlag (1983); The Nature of Statistical Learning Theory.Springer Verlag (1995)

• Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing,ISBN: 0-387-40272-1) Springer Verlag (2004)


The Classification Problem


Classification as a Pattern Recognition Problem

Problem: We look for a partition of the object space O (fishin the previous example) which corresponds to classificationexamples.Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X !

Data: pairs of feature vectors and class labelsZ = (xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ 1, . . . , k

Definitions: feature space X with xi ∈ X ⊂ Rd

class labels yi ∈ 1, . . . , k

Classifier: mapping c : X → 1, . . . , k

k class problem: What is yn+1 ∈ 1, . . . , k for xn+1 ∈ Rd?


Example of Classification


Histograms of Length Values

salmon sea bass

length

count

l*

0

2

4

6

8

10

12

16

18

20

22

5 10 2015 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.


Histograms of Skin Brightness Values

2 4 6 8 100

2

4

6

8

10

12

14

lightness

count

x*

salmon sea bass

FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗

marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.


Linear Classification

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.


Overfitting

?

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.


Optimized Non-Linear Classification

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Occam’s razor argument: Entia non sunt multiplicanda praeter necessitatem!


Regression(see Introduction to Machine Learning)

Question: Given a feature(vector) xi and a corre-sponding noisy measure-ment of a function valueyi = f(xi) + noise, what isthe unknown function f(.)in a hypothesis class H?

Data: Z = (xi, yi) ∈ Rd × R : 1 ≤ i ≤ n

Modeling choice: What is an adequate hypothesis class anda good noise model? Fitting with linear/nonlinear functions?


The Regression Function

Questions: (i) What is the statistically optimal estimate of afunction f : Rd → R and (ii) which algorithm achieves thisgoal most efficiently?

Solution to (i): the regression function

y(x) = E y|X = x =∫

Ω

y p(y|X = x)dy

Nonlinear regression of asinc functionsinc(x) := sin(x)/x

(gray) with a regression fit(black) based on 50 noisydata.


Examples of linear and nonlinear regression

linear regression nonlinear regression

How should we measure the deviations?

vertical offsets perpendicular offsets


Core Questions of Pattern Recognition:Unsupervised Learning

No teacher signal is available for the learning algorithm; lear-ning is guided by a general cost/risk function.

Examples for unsupervised learning

1. data clustering, vector quantization :as in classification we search for a partitioning of objects ingroups; but explicit labelings are not available.

2. hierarchical data analysis; search for tree structures in data3. visualisation, dimension reduction

Semisupervised learning: some of the data are labeled, mostof them are unlabeled.


Modes of Learning

Reinforcement Learning: weakly supervised learningAction chains are evaluated at the end.Backgammon; the neural network TD-Gammon gained theworld championship! Quite popular in Robotics

Active Learning: Data are selected according to their expec-ted information gain.Information Filtering

Inductive Learning: the learning algorithm extracts logical ru-les from the data.Inductive Logic Programming is a popular sub area of Artifi-cial Intelligence


Vectorial Data

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

A

AA

A

AA

A

AAA

B

B B

BBB

BB

B

B

C

C

C

C

C

C

C

C

C

C

DD

DD

D

DD

DD

D

E

E

EE

E

E

E

E

EE

FF

FFFF

F

F

FF

G

GGG

G

GGG

G

G

H

H

H

H

HH

H

HH

H

I

IIIII

III

I

J

J

JJJJ J

J

J

J

K

K

KK

K

KK

KKK

L

L

L

L

L

L

L

LL

LMM

M

M

M

M

M

MM

M

N

N

N

N

N

N

N

NN

N

O

O

O

OOO

O

OO

O

PP

P

P

P

PP

PP

P

Q

QQQ

QQ

Q

Q

Q Q

R

R

R

R

R

RR

R

R

R

S

S

S

S

S

S

S

S

S

S

T

T

TTTT

TTTT

Data of 20 Gaussiansources in R20, pro-jected onto two di-mensions with Princi-pal Component Ana-lysis.


Relational Data

Pairwise dissimilarityof 145 globins whichhave been selectedfrom 4 classes ofα-globine, β-globine,myoglobins and glo-bins of insects andplants.


Scales for Data

Nominal or categorial scale : qualitative, but without quantita-tive measurements,e.g. binary scale F = 0, 1 (presence or absence of proper-ties like “kosher”) ortaste categories “sweet, sour, salty and bitter.

Ordinal scale : measurement values are meaningful only withrespect to other measurements, i.e., the rank order of mea-surements carries the information, not the numerical diffe-rences (e.g. information on the ranking of different marathonraces!?)


Quantitative scale:

• interval scale : the relation of numerical differences car-ries the information. Invariance w.r.t. translation and sca-ling (Fahrenheit scale of temperature).

• ratio scale : zero value of the scale carries information butnot the measurement unit. (Kelvin scale).

• Absolute scale : Absolute values are meaningful. (gradesof final exams)


Machine Learning: Topic Chart

• Core problems of pattern recognition

• Bayesian decision theory

• Perceptrons and Support vector machines

• Data clustering


Bayesian Decision TheoryThe Problem of Statistical Decisions

Task: textbf n objects have to be partitioned in 1, . . . , k classes,the doubt class D and the outlier class O.

D : doubt class (→ new measurements required)O : outlier class , definitively none of the classes 1, 2, . . . , k

Objects are characterized by feature vectors X ∈ X , X ∼P(X) with the probability P(X = x) of feature values x.

Statistical modeling: Objects represented by data X andclasses Y are considered to be random variables, i.e.,(X, Y ) ∼ P(X, Y ).Conceptually, it is not mandatory to consider class labels as random since they mightbe induced by legal considerations or conventions.


Structure of the feature space X

• X ⊂ Rd

• X = X1 ×X2 × · · · × Xd with Xi ⊆ R or Xi finite.

Remark: in most situations we can define the feature space as subsets of Rd or astuples of real, categorial (B = 0, 1) or ordinal (K ⊂ K) numbers. Sometimes wehave more complicated data spaces composed of lists, trees or graphs.

Class density / likelihood: py(x) := P(X = x|Y = y) is equalto the probability of a feature value x given a class y.

Parametric Statistics: estimate the parameters of the classdensities py(x)

Non-Parametric Statistics: minimize the empirical risk


Motivation of ClassificationGiven are labeled dataZ = (xi, yi) : i ≤ n

Questions:

1. What are the classboundaries?

2. What are the classspecific densitiespy(x)?

3. How many modesor parameters dowe need to modelpy(x)?

4. ...

Figure: quadratic SVM classifier for five classes.White areas are ambiguous regions.


Thomas Bayes and his Terminology

The State of Nature is modelled as a random variable!

prior: Pmodel

likelihood: Pdata|model

posterior: Pmodel|data

evidence: Pdata

Bayes Rule: Pmodel|data =Pdata|modelPmodel

Pdata


Ronald A. Fisher and Frequentism

Fisher, Ronald Aylmer (1890-1962): founder of frequentiststatistics together with Jerzey Neyman & Karl Pearson.

British mathematician and biologist who in-vented revolutionary techniques for apply-ing statistics to natural sciences.

Maximum likelihood method

Fisher information: a measure for the infor-mation content of densities.

Sampling theory

Hypothesis testing


Bayesianism vs. Frequentist Inference 1

Bayesianism is the philosophical tenet that the mathematical theory of pro-bability applies to the degree of plausibility of statements, or to the degreeof belief of rational agents in the truth of statements; together with Bayestheorem, it becomes Bayesian inference. The Bayesian interpretation ofprobability allows probabilities assigned to random events, but also al-lows the assignment of probabilities to any other kind of statement.

Bayesians assign probabilities to any statement, even when no randomprocess is involved, as a way to represent its plausibility. As such, thescope of Bayesian inquiries include the scope of frequentist inquiries.

The limiting relative frequency of an event over a long series of trials isthe conceptual foundation of the frequency interpretation of probability.

Frequentism rejects degree-of-belief interpretations of mathematical pro-bability as in Bayesianism, and assigns probabilities only to randomevents according to their relative frequencies of occurrence.1see http://encyclopedia.thefreedictionary.com/


Bayes Rule for Known Densities and Parameters

Assume that we know how the features are distributed for thedifferent classes, i.e., the class conditional densities and theirparameters are known.What is the best classification strat-egy in this situation?

Classifier:

c : X → 1, . . . , k,DThe assignment function c maps the feature space X to theset of classes 1, . . . , k,D. (Outliers are neglected)

Quality of a classifier: Whenever a classifier returns a labelwhich differs from the correct class Y = y then it has madea mistake.


Error count: The indicator function∑x∈X

Ic(x) 6=y

counts the classifier mistakes. Note that this error count is arandom variable!

Expected errors also called expected risk define the qualityof a classifier

R(c) =∑y≤k

P(y)EP(x)

[Ic(x) 6=y|Y = y

]+ terms from D

Remark: The rational behind this choice comes from gambling. If we bet ona particular outcome of our experiment and our gain is measured by howoften we assign the measurements to the correct class then classifier withminimal expected risk will win on average against any other classificationrule (“Dutch books”)!


The Loss Function

Weighted mistakes are introduced when classification errorsare not equally costly; e.g. in medical diagnosis, some di-sease classes might be harmless and others might be lethaldespite of similar symptoms.

⇒ We introduce a loss function L(y, z) which denotes the lossfor the decision z if class y is correct.

0-1 loss: all classes are treated the same!

L0−1(y, z) =

0 if z = y (correct decision)

1 if z 6= y and z 6= D (wrong decision)

d if z = D (no decision)


• weighted classification costs L(y, z) ∈ R+ are frequentlyused, e.g. in medicine;classification costs can also be asymmetric, that meansL(y, z) 6= L(z, y) ((z, y) ∼ (pancreas cancer, gastritis).

Conditional Risk function of the classifier is the expectedloss of class y

R(c, y) = Ex [L(y, c(x))|Y = y]

=∑z≤k

(L(y, z)Pc(x) = z|Y = y

+L(y,D)Pc(x) = D|Y = y)

= Pc(x) 6= y ∧ c(x) 6= D|Y = y︸︷︷︸pmc(y) probability of misclassification

+ d ·Pc(x) = D|Y = y︸︷︷︸pd(y) probability of doubt


Total risk of the classifier: (πy := P(Y = y))

R(c) =∑z≤k

πz pmc(z) + d∑z≤k

πz pd(z) = EC

[R(c, C)

]

Asymptotic average loss

limn→∞

1n

∑j≤n

L(cj, c(xj)) = limn→∞

R(c) = R(c),

where (xj, cj)|1 ≤ j ≤ n is a random sample set of size n.This formula can be interpreted as the expected loss with empirical distribution as probability model.


Posterior class probability

Posterior: Let

p(y|x) ≡ PY = y|X = x =πypy(x)∑z πzpz(x)

be the posterior of the class y given X = x.

(The ‘Partition of One” πypy(x)/∑

z πzpz(x) results from the normalizati-

on∑

z p(z|x) = 1. )

Likelihood: The class conditional density py(x) is the probabi-lity of observing data X = x given class Y = y.

Prior: πy is the probability of class Y = y.


Bayes Optimal Classifier

Theorem 1 The classification rule which minimizes the totalrisk for 0− 1 loss is

c(x) =

y if p(y|x) = maxz≤k p(z|x) > 1− d,

D if p(y|x) ≤ 1− d ∀y.

Generalization to arbitrary loss functions

c(x) =

y if

∑z L(z, y)p(z|x) = minρ≤k

∑z L(z, ρ)p(z|x) ≤ d,

D else .

Bayes classifier: Select the class with highest πypy(x) value ifit exceeds the costs for not making a decision, i.e., πypy(x) >

(1− d)p(x).


Proof: Calculate the total expected loss R(c)

R(c) = EX

[EY

[L0−1(Y, c(x))|X = x

]]=

∫X

EY

[L0−1(Y, c(x))|X = x

]p(x)dx with p(x) =

∑z≤k

πzpz(x)

Minimize the conditional expectation value since it depends only on c.

c(x) = argminc∈1,...,k,DE[L0−1(Y, c)|X = x

]= argminc∈1,...,k,D

∑z≤k

L0−1(z, c)p(z|x)

=

argminc∈1,...,k (1− p(c|x)) if d > minc(1− p(c|x))D else

=

argmaxc∈1,...,kp(c|x) if 1− d < maxc p(c|x)D else


Outliers

• Modeling by an outlier class πO with pO(x)

• “Novelty Detection” : Classify a measurement as an outlierif

πOpO(x) ≥ max

(1− d)p(x),maxz

πzpz(x)

• The outlier concept causes conceptual problems and it does not fit to thestatistical decision theory since outliers indicate an erroneous or incom-plete specification of the statistical model!

• The outlier class is often modeled by a uniform distribution.Attention : Normalization of uniform distribution does not exist in manyfeature spaces!

=⇒ Limit the support of the measurement space or put a (Gaussian)measure on it!


Class Conditional Densities and Posteriors for 2Classes

Class-conditional probability den-sity function

Posterior probabilities for priorsP(y1) = 2

3,P(y2) = 13.

9 10 11 12 13 14 15

0.1

0.2

0.3

0.4

p(x|ωi)

x

ω1

ω2

FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi. If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

0.2

0.4

0.6

0.8

1

P(ωi|x)

x

ω1

ω2

9 10 11 12 13 14 15

FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)

= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.


Likelihood Ratio for 2 Class Example

x

θa

p(x|ω1)

p(x|ω2)

θb

R1

R2

R1

R2

FIGURE 2.3. The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown inFig. 2.1. If we employ a zero-one or classification loss, our decision boundaries aredetermined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1

patterns more than the converse, we get the larger threshold θb, and hence R1 becomessmaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifica-tion. Copyright c© 2001 by John Wiley & Sons, Inc.


Discriminant Functions gl

discriminant

functions

input

g1(x) g

2(x) g

c(x). . .

x1

x2

xd. . .x

3

costs

action

(e.g., classification)

FIGURE 2.5. The functional structure of a general statistical pattern classifier whichincludes d inputs and c discriminant functions gi(x). A subsequent step determineswhich of the discriminant values is the maximum, and categorizes the input patternaccordingly. The arrows show the direction of the flow of information, though frequentlythe arrows are omitted when the direction of flow is self-evident. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

• Discriminant function: gz(x) = PY = y|X = x

• Class decision: gy(x) > gz(x) ∀z 6= y ⇒ class y.

• Different discriminant functions can yield the same decision:gy(x) = log Px|y+ log πy; minimize implementation problems!


Example for Discriminant Functions

0

0.1

0.2

0.3

decision

boundary

p(x|ω2)P(ω2)

R1

R2

p(x|ω1)P(ω1)

R2

0

5

0

5

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.


Adaptation of Discriminant Functions gl

discriminant

functions

input

g1(x) g

2(x) g

c(x). . .

x1

x2

xd. . .x

3

action

(e.g., classification)

MAX

-

teacher

signal

The red connections (weights) are adapted in such a way that the teacher

signal is imitated by the discriminant function.


Example Discriminant Functions: NormalDistributions

The Likelihood of class y is Gaussian distributed.

py(x) =1√

(2π)d|Σy|exp

(−1

2(x− µy)TΣ−1

y (x− µy))

Special case: Σy = σ2I

gy(x) = log py(x) + log πy

= − 12σ2

‖x− µy‖2 + log πy + const.


⇒ Decision surface between class z and y:

− 12σ2

‖x− µz‖2 + log πz = − 12σ2

‖x− µy‖2 + log πy

−‖x‖2 + 2x · µz − ‖µz‖2 + 2σ2 log πz = −‖x‖2 + 2x · µy − ‖µy‖2 + 2σ2 log πy

⇒ 2x · (µz − µy)− ‖µz‖2 + ‖µy‖2 + 2σ2 logπz

πy= 0

Linear decision rule : wT (x− x0) = 0

with w = µz − µy x0 =12(µz + µy)−

σ2(µz − µy)‖µz − µy‖2

logπz

πy


Decision Surface for Gaussians in 1,2,3Dimensions

-2 2 4

0.1

0.2

0.3

0.4

P(ω1)=.5 P(ω2)=.5

x

p(x|ωi)ω1 ω2

0

R1

R2

02

4

0

0.05

0.1

0.15

-2

P(ω1)=.5P(ω2)=.5

ω1

ω2

R1

R2

-2

0

2

4

-2

-1

0

1

2

0

1

2

-2

-1

0

1

2

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R1

R2

FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identitymatrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane ofd − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensionalexamples, we indicate p(x|ωi) and the boundaries for the case P(ω1) = P(ω2). In the three-dimensional case,the grid plane separates R1 from R2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.


P(ω1)=.7 P(ω2)=.3

ω1 ω2

R1

R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

P(ω1)=.9 P(ω2)=.1

ω1 ω2

R1

R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.8

P(ω2)=.2

ω1 ω2

R1

R2

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.99

P(ω2)=.01

ω1

ω2

R1

R2

-1

0

1

2

0

1

2

3

-2

-1

0

1

2

-2

P(ω1)=.8

P(ω2)=.2

ω1

ω2

R1

R2

0

2

4

-1

0

1

2

-2

-1

0

1

2

-2

P(ω1)=.99

P(ω2)=.01ω1

ω2

R1

R2

FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficientlydisparate priors the boundary will not lie between the means of these one-, two- andthree-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.Visual Computing: Joachim M. Buhmann — Machine Learning 189/196

Multi Class Case

R3

R2

R1

R4

R4

FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Decision regions for four Gaussian distributions. Even for such a small num-

ber of classes the discriminant functions show a complex form.


Example: Gene Expression Data

The expression of genes is measured for various patients. Theexpression profiles provide information of the metabolic state ofthe cells, meaning that they could be used as indicators for di-sease classes. Each patient is represented as a vector in a highdimensional (≈ 10000) space with Gaussian class distribution.

Genes

Sam

ples

ALL B

−Cell

AM

LA

LL T−C

ellT

rue

Pred


Parametric Models for Class Densities

If we would know the prior probabilities and the class conditio-nal probabilities then we could calculate the optimal classifier.But we don’t!

Task: Estimate p(y|x; θ) from samplesZ = (x1, y1), . . . , (xn, yn)for classification.

Data are sorted according to their classes:Xy = X1y, . . . , Xny,y where Xiy ∼ PX|Y = y; θy

Question: How can we use the information in samples to esti-mate θy?

Assumption: classes can be separated and treated indepen-dently! Xy is not informative w.r.t. θz, z 6= y


Maximum Likelihood Estimation Theory

Likelihood of the data set: PXy|θy =∏

i≤nyp(xiy|θy)

Estimation principle: Select the parameters θy which maximi-ze the likelihood, that means

θy = arg maxθy

PXy|θy

Procedure: Find the extreme value of the log-likelihood functi-on

∇θy log PX |θy = 0

∂

∂θy

∑i≤n

log p(xi|θy) = 0


Remark

Bias of an estimator: bias(θn) = Eθn − θ.

Consistent estimator: A point estimator θn of a parameter θ

is consistent if θnP→ θ.

Asymptotic Normality of Maximum Likelihood estimates:

(θn − θ)/√

Vθn N (0, 1).

Alternative to ML class density estimation: discriminativelearning by maximizing the a posteriori distribution Pθy|Xy(details of the density do not have to be modelled since they might not influence the po-

sterior)


Example: Multivariate Normal Distribution

Expectation values of a normal distribution and its estimation:Class index has been omitted for legibility reasons (θy → θ).

log p(xi|θ) = −12(xi − µ)TΣ−1(xi − µ)− d

2log 2π − 1

2log |Σ|

∂

∂µ

∑i≤n

log p(xi|θ) =12

∑i≤n

Σ−1(xi − µ) +12

∑i≤n

((xi − µ)Σ−1

)T= 0

Σ−1∑i≤n

(xi − µ) = 0 ⇒ µn =1n

∑i

xi estimator for µ

Average value formula results from the quadratic form.

Unbiasedness: E[µn] =1n

∑i≤n

Exi = E[x] = µ


ML estimation of the variance (1d case)

∂

∂σ2

∑i≤n

log p(xi|θ) = − ∂

∂σ2

∑i≤n

1σ2‖xi − µ‖2 − n

2log(2πσ2)

=12

∑i≤n

σ−4‖xi − µ‖2 − n

2σ−2 = 0

⇒ σ2n =

1n

∑i≤n

‖xi − µ‖2

Multivariate case Σn =1n

∑i≤n

(xi − µ)(xi − µ)T

Σn is biased, e.g., EΣn 6= Σ, if µ is unknown.


Machine Learning

Documents

Transcript of Machine Learning