Bayesian regularization of learning

55
Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC

description

Bayesian regularization of learning. Sergey Shumsky NeurOK Software LLC. Induction F.Bacon Machine. Deduction R.Descartes Math. modeling. Learning. Scientific methods. Models. Data. Outline. Learning as ill-posed problem General problem: data generalization - PowerPoint PPT Presentation

Transcript of Bayesian regularization of learning

Page 1: Bayesian regularization of learning

Bayesian regularization of learning

Sergey Shumsky NeurOK Software LLC

Page 2: Bayesian regularization of learning

Scientific methods

InductionF.Bacon

Machine

Models

Data

Deduction R.Descartes

Math. modeling

Page 3: Bayesian regularization of learning

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 4: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 5: Bayesian regularization of learning

Problem statement

Learning is inverse, ill-posed problem Model Data

Learning paradoxes Infinite predictions Finite data? How to optimize future predictions? How to select regular from casual in

data?

Regularization of learning Optimal model complexity

Page 6: Bayesian regularization of learning

Well-posed problem

Solution is unique Solution is stable Hadamard (1900-s)

Tikhonoff (1960-s)

:h H h D

1 2

1 1

1 2

2 2

limD D

H h Dh h

H h D

lim limi ii i

H h H h h h

Page 7: Bayesian regularization of learning

Learning from examples

Problem: Find hypothesis h, generating

observed data D in model H

Well defined if not sensitive to: noise in data (Hadamard) learning procedure (Tikhonoff)

:h H h D

Page 8: Bayesian regularization of learning

Learning is ill-posed problem

Example: Function approximation

Sensitive tonoise in data

Sensitive tolearning procedure

Page 9: Bayesian regularization of learning

Learning is ill-posed problem

Solution is non-unique

h x

h f

x x x x

Page 10: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 11: Bayesian regularization of learning

Problem regularization Main idea: restrict solutions –

sacrifice precision to stability

: P hh H h D

arg minh H h D P h

How to choose?

Page 12: Bayesian regularization of learning

Statistical Learning practice

Data Learning set

+ Validation set

Cross-validation:

Systematic approach to ensembles Bayes

+ +… +

Page 13: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 14: Bayesian regularization of learning

Statistical Learning theory

Learning as inverse Probability Probability theory. H: h D

Learning theory. H: h D

, 1 hhN NN

h

NP D h H h h

N

,

,P D h H P h H

P h D HP D H

HBernoulli (1713)

Bayes (~ 1750)

Page 15: Bayesian regularization of learning

Bayesian learning

D

h

,P h D H

,

,

,

P D h H P h H

P h D

P D

D

h

H

H

H P

P h H

P D HEvidence

Prior

Posterior

Page 16: Bayesian regularization of learning

Coin tossing gameH

1P h

11, ,

2MP MPh h

MP

h hN Nh h h

N N N

,h h

P D h P h DP h D D P h N N N N

P D D

1 ,h

P D h P hP h D N P D h P h N N

P D

Page 17: Bayesian regularization of learning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

16

18

20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

Monte Carlo simulations

100N

P h N t

10N

Page 18: Bayesian regularization of learning

Bayesian regularization

Most Probable hypothesis

arg max log ,

arg min log , log

MPh P h D H

P D h H P h H

Learning error

212

2

η

ηlog ,

2η e

y hP D h H y h

P

x

x

Example: Function approximation

Regularization

Page 19: Bayesian regularization of learning

Minimal Description Length

Most Probable hypothesis

2logL x P x

Code length for:

arg max log ,

arg min log , log

MPh P h D H

P D h H P h H

Data hypothesis

Example: Optimal prefix code

1

21

4 18

18

0 1

1110

110 111

P x

Rissanen (1978)

Page 20: Bayesian regularization of learning

Data Complexity

Complexity K(D |H) = min L(h,D|

H)Code length L(h,D)= coded data L(D|h) + decoding program L(h)

Data D

Decoding:H h D

Kolmogoroff (1965)

Page 21: Bayesian regularization of learning

Complex = Unpredictable

Prediction error ~ L(h,D)/L(D) Random data is uncompressible Compression = predictability

Program h: length L(h,D)

Data D

Decoding:H h D

Example: block coding

Solomonoff (1978)

Page 22: Bayesian regularization of learning

Universal Prior All 2L programs with

length L are equiprobable

Data complexity

,

,2, , 2

L h D HL h D H

hP h D H P D H

P D H

Solomonoff (1960) Bayes (~1750)

logK D H L D H P D H

L(h,D)

D

H

Page 23: Bayesian regularization of learning

Statistical ensemble Shorter description length

Proof:

Corollary: Ensemble predictions are superior to most probable prediction

,log 2 ,L h D H

MPhL D H L h D H

, ,log 2 log 2 ,MPL h D H L h D H

MPhL h D H

Page 24: Bayesian regularization of learning

Ensemble prediction

MPh

1h

2h P h H

1h

MPh

P h H2h

Page 25: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 26: Bayesian regularization of learning

Model comparison

D

h

,P h D H

, iP D h H

P D HEvidence

Posterior

Page 27: Bayesian regularization of learning

Statistics: Bayes vs. Fisher

Fisher: max Likelihood

Bayes: max Evidence

arg max log ,

arg max log , log

MP

ML

h P h D H

P D h H P h H h

arg max log ,

arg max log , log

MP

ML

H P H D

P D H P H H

H

H H

arg max log , logMPh P D h H P h H

Page 28: Bayesian regularization of learning

Historical outlook

20 – 60s of ХХ century Parametric statistics Asymptotic N

60 - 80s of ХХ century Non-Parametric statistics Regularization of ill-posed problems Non-asymptotic learning Algorithmic complexity Statistical physics of disordered systems

h

Fisher (1912)

Chentsoff (1962)

Tikhonoff (1963)

Vapnik (1968)

Kolmogoroff (1965)

h

Gardner (1988)

Page 29: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 30: Bayesian regularization of learning

Statistical physics

Probability of hypothesis - microstate

Optimal model - macrostate

,

,

1, L h D H

L h D H

h

P h D H eZ

Z D H e

arg min logMLH L D H Z D H

Page 31: Bayesian regularization of learning

Free energy

F = - log Z: Log of Sum

F = E – TS:

Sum of logs P = P{L}

, , log ,h

L D H P h D H L h D H P h D H

,log L h D H

hL D H e

Page 32: Bayesian regularization of learning

EM algorithm. Main idea

Introduce independent P:

Iterations E-step:

М-step:

, log

log,

h

h

F h L h D H h

hL D H h L D H

P h D H

P P

PP

arg min ,

arg min ,

t t

t t

H

F H

H F H

P

P P

P

Page 33: Bayesian regularization of learning

EM algorithm

Е-step Estimate Posterior for given Model

М-step Update Model for given Posterior

,t t th P D h H P D HP

arg min ,t

t

HH L D h H

P

Page 34: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 35: Bayesian regularization of learning

Bayesian regularization: Examples

Hypothesis testing

Function approximation

Data clustering

yh

y

h(x) x

P(x|H)

x

Page 36: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 37: Bayesian regularization of learning

Hypothesis testing Problem Noisy observations: y

Is theoretical value h0 true?

Model H:

2

2

0

2 exp2

2 exp2

y h

P

h hP h

Gaussian noise

Gaussian prior

yh0

Page 38: Bayesian regularization of learning

00.2

0.40.6

0.8

1

0

2

4

6

8

107.5

8

8.5

9

9.5

10

10.5

11

11.5

Optimal model: Phase transition

1 2

1ML y

N

N

2 2, ln lnyL D N y NN N

,MLL D

y

y

Confidence finite infinite

Page 39: Bayesian regularization of learning

Threshold effect

Student coefficient

Hypothesis h0 is true

Corrections to h0

22

2 1y

Nt y N

y

P(h)1Nt

yh

1Nt P(h)2

2

1NMP

N

th y

t

0P h h h

Page 40: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 41: Bayesian regularization of learning

Function approximation

Problem Noisy data: y (x ) Find approximation h(x)

Model:

Noise

Prior

1

1

,

exp

exp W

y h

P Z E

P Z E

x w

w w

y

h(x)x

Page 42: Bayesian regularization of learning

Optimal model

Free energy minimization

,

,

, ln ln ln

exp

x

e

e p

xp

D

W

W

D

Z d E E

L D Z Z Z

Z d E

Z dD E

w w

w w

w

w

,n nD n

E E y h w x w

Page 43: Bayesian regularization of learning

Saddle point approximation

Function of best hypothesis

,ln exp

1ln

2

D W

W MP D MP

D MP W MP

Z d E E

E E

E E

w w w

w w

w w

MPw

Page 44: Bayesian regularization of learning

ЕМ learning

Е-step. Optimal hypothesis

М-step. Optimal regularization

arg minMP ML W ML DE E w w w

, arg min , ,ML ML MPL D w

Page 45: Bayesian regularization of learning

Laplace Prior

Pruned weights

Equisensitive weights

1

W

W iiE w

w

sgnD W DE E Et

w

w w w w

0D i iE w w

0i D iw E w

Page 46: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 47: Bayesian regularization of learning

Clustering

Problem Noisy data: x

Find prototypes (mixture density approximation)

How many clusters?

Модель:

Noise

1

exp

M

m

m

P H P m P m

P m E

x x

x x h

P(x|H)

x

Page 48: Bayesian regularization of learning

Optimal model

Free energy minimization

Iterations E-step:

М-step:

min ln ,n

n mL D H P m x

,min ln ln ,n

m nF m n m n P m P P x

arg min ,

arg min ,

t t

t t

H

F H

H F H

P

P P

P

Page 49: Bayesian regularization of learning

ЕМ algorithm

Е-step:

М-step:

exp

exp

nm

n

nmm

P m EP m

P m E

x hx

x h

1

n n

n

m n

n

n

n

P m

P m

P m P mN

x xh

x

x

Page 50: Bayesian regularization of learning

How many clusters?

Number of clusters M()

Optimal number of clusters

h(m)

1/

min ln ,m mmL D d P D h h

Page 51: Bayesian regularization of learning

Simulations: Uniform data

Optimal model

0 10 20 30 4011

12

13

14

15

16

M

L D M

Page 52: Bayesian regularization of learning

Simulations: Gaussian data

Optimal model

M0 10 20 30 40 50

-12.5

-12

-11.5

-11

-10.5

-10

-9.5 L D M

Page 53: Bayesian regularization of learning

0 5 10 1520

25

30

35

40

45

Simulations: Gaussian mixture

Optimal model

M

L D M

Page 54: Bayesian regularization of learning

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Page 55: Bayesian regularization of learning

Summary

Learning Ill-posed problem Remedy: regularization

Bayesian learning Built-in regularization (model assumptions) Optimal model = minimal Description Length

= minimal Free Energy

Practical issues Learning algorithms with built-in optimal

regularization - from first principles (opposite to cross validation)