An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and...

128
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult¨ at f¨ ur Informatik Technische Universit¨ at M¨ unchen June 07, 2017

Transcript of An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and...

Page 1: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

An Introduction to Statistical andProbabilistic Linear Models

Maximilian Mozes

Proseminar Data MiningFakultat fur Informatik

Technische Universitat Munchen

June 07, 2017

Page 2: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Introduction

In statistical learning theory, linear models areused for regression and classification tasks.

2

Page 3: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Introduction

In statistical learning theory, linear models areused for regression and classification tasks.

I What is regression?

2

Page 4: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Introduction

In statistical learning theory, linear models areused for regression and classification tasks.

I What is regression?

I What is classification?

2

Page 5: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Introduction

In statistical learning theory, linear models areused for regression and classification tasks.

I What is regression?

I What is classification?

I How can we model such concepts in amathematical context?

2

Page 6: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression

Page 7: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

What is regression?

4

Page 8: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

What is regression?

I Approximation of data using a (closed)mathematical expression

4

Page 9: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

What is regression?

I Approximation of data using a (closed)mathematical expression

I Achieved by estimating the model parametersthat maximize the approximation

4

Page 10: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - example

I A company changes the price of their products for thenth time.

5

Page 11: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - example

I A company changes the price of their products for thenth time.

I They know how the price changes affected theconsumer behavior n− 1 times before.

5

Page 12: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - example

I A company changes the price of their products for thenth time.

I They know how the price changes affected theconsumer behavior n− 1 times before.

I Using linear regression, they can predict the consumerbehavior for the nth price change.

5

Page 13: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

I Let D denote a set of n-dimensional datavectors

6

Page 14: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

I Let D denote a set of n-dimensional datavectors

I Let x be an n-dimensional observation

6

Page 15: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

I Let D denote a set of n-dimensional datavectors

I Let x be an n-dimensional observation

I How can we approximate x?

6

Page 16: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

Create a linear function

y(x,w) = w0 + w1x1 + · · ·+ wnxn

that approximates x with w.

7

Page 17: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

Example for n = 2.

y(x,w) = w0 + w1x1 + w2x2

8

Page 18: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

ProblemWeight parameters wi are simply values.

9

Page 19: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

ProblemWeight parameters wi are simply values.

=⇒ significant limitation!

9

Page 20: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear regression - basics

ProblemWeight parameters wi are simply values.

=⇒ significant limitation!

IdeaUse weighted non-linear functions φj instead!

y(x,w) = w0 +n∑

j=1

wjφj(x) = w>φ(x),

where φ = (φ0, ..., φn)>.

9

Page 21: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

ExampleLet φi(x) = xi.

10

Page 22: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

ExampleLet φi(x) = xi.

y(x,w) = w0 +n∑

j=1

wjxj

= w0 + w1x+ w2x2 + · · ·+ wnx

n

10

Page 23: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

Approximation with a 2nd-order polynomial.

11

Page 24: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

Approximation with a 6th-order polynomial.

11

Page 25: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

Approximation with an 8th-order polynomial.

11

Page 26: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

Problem

12

Page 27: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

Problem

=⇒ Overfitting with higher polynomial degree

12

Page 28: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

13

Page 29: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Polynomial regression

13

Page 30: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification

Page 31: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

What is classification?

15

Page 32: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

What is classification?

I Aims to partition the data into predefinedclasses

15

Page 33: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

What is classification?

I Aims to partition the data into predefinedclasses

I A class contains observations with similarcharacteristics

15

Page 34: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - example

I We have n different cucumbers andcourgettes.

16

Page 35: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - example

I We have n different cucumbers andcourgettes.

I Each record contains the weight and thetexture (smooth, rough).

16

Page 36: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - example

I We have n different cucumbers andcourgettes.

I Each record contains the weight and thetexture (smooth, rough).

I We want to predict the correct label withoutknowing the real label.

16

Page 37: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

I Assume a two-class classification problem

17

Page 38: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

I Assume a two-class classification problem

I How can we categorise the data intopredefined classes?

17

Page 39: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

I Assume a two-class classification problem

I How can we categorise the data intopredefined classes?

17

Page 40: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

I Assume a two-class classification problem

I How can we categorise the data intopredefined classes?

17

Page 41: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

I Assume a two-class classification problem

I How can we categorise the data intopredefined classes?

17

Page 42: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

Cucumbers vs. courgettes

18

Page 43: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

Discriminant functions

I Given a dataset D,

19

Page 44: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

Discriminant functions

I Given a dataset D,

I we aim to categorise x into either class C1 orC2.

19

Page 45: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

Discriminant functions

I Given a dataset D,

I we aim to categorise x into either class C1 orC2.

Use y(x) such that

x ∈ C1 ⇐⇒ y(x) ≥ 0 and x ∈ C2 otherwise

19

Page 46: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

Discriminant functions

I Given a dataset D,

I we aim to categorise x into either class C1 orC2.

Use y(x) such that

x ∈ C1 ⇐⇒ y(x) ≥ 0 and x ∈ C2 otherwise

withy(x) = w>x + w0.

19

Page 47: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

Decision boundary H:

H := {x ∈ D : y(x) = 0}

20

Page 48: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Linear classification - basics

Decision boundary H:

H := {x ∈ D : y(x) = 0}

20

Page 49: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linearity

However, it often occurs that the data are notlinearly-separable:

21

Page 50: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linearity

However, it often occurs that the data are notlinearly-separable:

21

Page 51: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linearity

However, it often occurs that the data are notlinearly-separable:

21

Page 52: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linearity

SolutionUse non-linear functions instead!

22

Page 53: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linearity

SolutionUse non-linear functions instead!

=⇒ y(x) = w>φ(x)

φ(x) = (φ0, φ1, ..., φM−1)>

22

Page 54: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Common classification algorithms

Other commonly-used classification algorithms:

23

Page 55: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Common classification algorithms

Other commonly-used classification algorithms:

I Naive Bayes classifier

23

Page 56: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Common classification algorithms

Other commonly-used classification algorithms:

I Naive Bayes classifier

I Logistic regression (Cox, 1958)

23

Page 57: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Common classification algorithms

Other commonly-used classification algorithms:

I Naive Bayes classifier

I Logistic regression (Cox, 1958)

I Support vector machines (Vapnik and Lerner,1963)

23

Page 58: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Summary

24

Page 59: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Summary

I Linear modelsLinear combinations of weighted (non-)linearfunctions

24

Page 60: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Summary

I Linear modelsLinear combinations of weighted (non-)linearfunctions

I RegressionApproximation of a given amount of datausing a closed mathematical representation

24

Page 61: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Summary

I Linear modelsLinear combinations of weighted (non-)linearfunctions

I RegressionApproximation of a given amount of datausing a closed mathematical representation

I ClassificationCategorisation of data according to individualcharacteristics and common patterns

24

Page 62: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Further readings

Further readings:I J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood

1912 - 1922. Statistical Science Vol. 12, No. 3, 162-176.

I D. Barber, Bayesian Reasoning and Machine Learning. CambridgeUniversity Press. 2012.

I C. M. Bishop, Pattern Recognition and Machine Learning. Springer.2006.

I T. Hastie, R. Tibshirani and J. Friedman, The Elements of StatisticalLearning - Data Mining, Inference, and Prediction. Second edition.Springer. 2009.

I K. Murphy, Machine Learning - A Probabilistic Perspective. The MITPress, Cambridge, Massachusetts, London, England. 2012.

I A. Y. Ng and M. I. Jordan, 2002. On Discriminative vs. Generativeclassifiers: A comparison of logistic regression and naive bayes.Advances in Neural Information Processing Systems.

25

Page 63: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

References

References:

I J. Aldrich, 1997. R.A. Fisher and the Making of Maximum Likelihood1912 - 1922. Statistical Science Vol. 12, No. 3, 162-176.

I D. Barber, Bayesian Reasoning and Machine Learning. CambridgeUniversity Press. 2012.

I C. M. Bishop, Pattern Recognition and Machine Learning. Springer.2006.

I D. R. Cox, 1958. The regression analysis of binary sequences. Journalof the Royal Statistical Society Vol. XX, No. 2.

26

Page 64: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

References

References:

I T. Hastie, R. Tibshirani and J. Friedman, The Elements of StatisticalLearning - Data Mining, Inference, and Prediction. Second edition.Springer. 2009.

I K. Murphy, Machine Learning - A Probabilistic Perspective. The MITPress, Cambridge, Massachusetts, London, England. 2012.

I F. Rosenblatt, 1958. The Perceptron: A probabilistic model forinformation storage and organization in the brain. PsychologicalReview Vol. 65, No. 6.

26

Page 65: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Thank you for yourattention.

Questions?

Page 66: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Backup slides

Page 67: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

Page 68: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Parameter estimation

Estimating the weight parametersHow to choose the wi?

30

Page 69: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Parameter estimation

Estimating the weight parametersHow to choose the wi?

=⇒ Find the set of parameters that maximize

p(D |w)

30

Page 70: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Method to optimize the weights w

31

Page 71: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Method to optimize the weights w

I Aims to minimize the residual sum of squares(RSS) by optimizing weights

31

Page 72: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Method to optimize the weights w

I Aims to minimize the residual sum of squares(RSS) by optimizing weights

I Minimize

RSS(w) =N∑i=1

(zi − y(xi,w))2

=N∑i=1

zi − w0 −M−1∑j=1

xijwj

2

31

Page 73: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I RSS can be simplified by using an N ×Mmatrix X with the xi as rows

32

Page 74: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I RSS can be simplified by using an N ×Mmatrix X with the xi as rows

I Then

RSS(w) = (z−Xw)>(z−Xw),

where z is a vector of target values

32

Page 75: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Building the derivatives leads to

∂RSS

∂w= −2X>(z−Xw)

∂2RSS

∂w∂w>= 2X>X

33

Page 76: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Building the derivatives leads to

∂RSS

∂w= −2X>(z−Xw)

∂2RSS

∂w∂w>= 2X>X

I Setting the first derivative to zero and solvingfor w results in

w = (X>X)−1X>z

33

Page 77: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Note: the approach assumes X>X to bepositive definite

34

Page 78: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Note: the approach assumes X>X to bepositive definite

I Therefore, X is assumed to have full columnrank

34

Page 79: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Approximation function can be described as

zi = y(xi,w) + ε,

where ε represents the data noise

35

Page 80: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Sum of least squares

I Approximation function can be described as

zi = y(xi,w) + ε,

where ε represents the data noise

I RSS provides a measurement for theprediction error ED defined as

ED(w) =1

2

N∑i=1

(zi −w>φ(xi))2,

where φ = (φ0, ..., φM−1)>

35

Page 81: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum LikelihoodEstimation

Page 82: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE)

37

Page 83: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE)

I Introduced by Fisher (1922)

37

Page 84: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE)

I Introduced by Fisher (1922)

I Commonly-used method to optimize themodel parameters

37

Page 85: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

GoalFind the optimal parameters w such thatp(D |w) is maximized, i.e.

w , arg maxw

p(D |w).

38

Page 86: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

GoalFind the optimal parameters w such thatp(D |w) is maximized, i.e.

w , arg maxw

p(D |w).

For target values zi, p(zi |xi,w) are assumed tobe independent such that

p(D |w) =N∏i=1

p(zi |xi,w)

holds for all zi ∈ D.38

Page 87: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

The product makes the equation unwieldy. Let’ssimplify it using the log!

39

Page 88: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

The product makes the equation unwieldy. Let’ssimplify it using the log!

log p(D |w) =N∑i=1

log p(zi |xi,w)

39

Page 89: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

The product makes the equation unwieldy. Let’ssimplify it using the log!

log p(D |w) =N∑i=1

log p(zi |xi,w)

We can now compute ∇ log p(D |w) = 0 andsolve for w.

=⇒ w , w are the optimal weight parameters!

39

Page 90: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

I Assumption: data noise ε follows Gaussiandistribution

40

Page 91: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

I Assumption: data noise ε follows Gaussiandistribution

I Then p(D |w) can be described as

p(z |x,w, β) = N (z | y(x,w), β−1),

where β is the model’s inverse variance

40

Page 92: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

I For z we get

log p(z |X,w, β) =N∑i=1

logN (zi |w>φ(xi), β−1)

41

Page 93: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

I It follows that

∇ log p(z |X,w, β) = β ·N∑i=1

(zi −w>φ(xi)) · φ(xi)>

42

Page 94: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

I It follows that

∇ log p(z |X,w, β) = β ·N∑i=1

(zi −w>φ(xi)) · φ(xi)>

I Setting the gradient to zero and solving for wleads to

wML = (Φ>Φ)−1Φ>z

42

Page 95: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Maximum Likelihood Estimation

I Φ is called design matrix and is described as

Φ =

φ0(x1) φ1(x1) . . . φM−1(x1)φ0(x2) φ1(x2) . . . φM−1(x2)

...... . . . ...

φ0(xN) φ1(xN) . . . φM−1(xN)

43

Page 96: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Regularization and ridgeregression

Page 97: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Ridge regression

I Method to prevent overfitting

45

Page 98: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Ridge regression

I Method to prevent overfitting

I Add regularization term compensating thebiased prediction

45

Page 99: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Ridge regression

I Regularization term λ||w||22

46

Page 100: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Ridge regression

I Regularization term λ||w||22I Error function E(w) becomes

E(w) =1

2

N∑i=1

(zi −w>φ(xi))2 + λ||w||22

46

Page 101: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Ridge regression

I Regularization term λ||w||22I Error function E(w) becomes

E(w) =1

2

N∑i=1

(zi −w>φ(xi))2 + λ||w||22

I Minimizing E(w) and solving for w results in

wridge = (λI + Φ>Φ)−1Φ>z

46

Page 102: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linear classification

Page 103: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linearity

In our example

48

Page 104: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Non-linearity

In our example

the appropriate decision boundary is

φ : (x1, x2)> 7→ (r · cos(x1), r · sin(x2))

>, r ∈ R.48

Page 105: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Multi-class classification

Page 106: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Multi-class classification

I The discriminant function for two-classclassification can be extended to a k-classproblem

50

Page 107: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Multi-class classification

I The discriminant function for two-classclassification can be extended to a k-classproblem

I Use k-class discriminator

yk(x) = w>k x + wk0

50

Page 108: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Multi-class classification

I The discriminant function for two-classclassification can be extended to a k-classproblem

I Use k-class discriminator

yk(x) = w>k x + wk0

I x is assigned to class Cj if yj(x) > yi(x) for alli 6= j

50

Page 109: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Multi-class classification

I The decision boundary is then

yj(x) = yi(x)

which can be transformed to

(wj −wi)>x + wj0 − wi0 = 0

51

Page 110: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Multi-class classification

52

Page 111: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Perceptron algorithm

Page 112: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Perceptron

Given an input vector x and a fixed non-linearfunction φ(x), the class of x is estimated by

y(x) = f(w>φ(x)),

where f(t) is called non-linear activationfunction.

54

Page 113: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Perceptron

Given an input vector x and a fixed non-linearfunction φ(x), the class of x is estimated by

y(x) = f(w>φ(x)),

where f(t) is called non-linear activationfunction.

f(t) =

{+1 if t ≥ 0,

−1 otherwise.

54

Page 114: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic generativemodels

Page 115: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic generative models

I Compute probability p(x, z) directly insteadof optimizing the weight parameters

56

Page 116: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic generative models

I Compute probability p(x, z) directly insteadof optimizing the weight parameters

I Apply Bayes’ theorem on p(z |x)

56

Page 117: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Bayes’ theorem

For a set of disjunct samples A1, ..., An and agiven sample B the probabilityp(Ai|B), i ∈ {1, ..., n} can be computed as follows:

p(Ai|B) =p(B|Ai) · p(Ai)∑nj=1 p(B|Aj) · p(Aj)

57

Page 118: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic generative models

I Consider a two-class classification problem forC1 and C2

58

Page 119: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic generative models

I Consider a two-class classification problem forC1 and C2

I Then

p(C1 |x) =p(x | C1)p(C1)

p(x | C1)p(C1) + p(x | C2)p(C2)

=1

1 + e−a

= σ(a),

where

a = log

(p(x | C1)p(C1)p(x | C2)p(C2)

)and σ(a) =

1

1 + e−a

58

Page 120: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic discriminativemodels

Page 121: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic discriminative models

I Predict the correct class by directlycomputing the posterior probability

p(z |x)

60

Page 122: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic discriminative models

I Predict the correct class by directlycomputing the posterior probability

p(z |x)

I This makes the computation of p(x | z)(Bayes’ theorem) redundant

60

Page 123: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic discriminative models

I Computing the posterior probability

p(Ck |x, θoptCk|x)

is then achieved by using MLE

61

Page 124: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Probabilistic discriminative models

I Computing the posterior probability

p(Ck |x, θoptCk|x)

is then achieved by using MLE

I Disadvantage: only little knowledge about thegiven data is required (”black-box”)

61

Page 125: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Logistic regression

Page 126: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Logistic regression

I Commonly-used classification algorithm forbinary classification problems

63

Page 127: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Logistic regression

I Commonly-used classification algorithm forbinary classification problems

I Assumption: data noise ε follows Bernoullidistribution

Ber(n) = pn(1− p)1−n, n ∈ {0, 1}

as this distribution is more appropriate for abinary classification problem

63

Page 128: An Introduction to Statistical and Probabilistic …...An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakult at fur Informatik

Logistic regression

I Predict the correct class label on theprobability

p(Ck |x,w) = Ber(Ck |σ(w>x)),

where σ is a squashing function (e.g. sigmoid)

64