PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary...

PAC-Bayesian Learning and Neural NetworksThe Binary Activated Case

Pascal Germain1,2

joint work with Gael Letarte3, Benjamin Guedj1,2,4, Francois Laviolette3

1 Inria Lille - Nord Europe (equipe-projet MODAL)2 Laboratoire Paul Painleve (equipe Probabilites et Statistique)

3 Universite Laval (equipe GRAAL) 4 University College London (CSML group)

51es Journees de Statistique de la SFdS (Nancy)

June 6, 2019

Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 1 / 19

1 PAC-Bayesian Learning

2 Standard Neural Networks

3 Binary Activated Neural NetworksOne Layer (Linear predictor)Two Layers (shallow)More Layers (deep)

4 Empirical results

Setting

Assumption : Learning samples are generated iid by a data-distribution D.

S = { (x1, y1), (x2, y2), . . . , (xn, yn) } ∼ Dn

Objective

Given a loss function ` : Y × Y → R, find a predictor f ∈ F that minimizes thegeneralization loss on D :

LD(f ) := E(x ,y)∼D

`(f (x), y)

Challenge

The learning algorithm has only access to the empirical loss on S

LS(f ) :=1

n∑i=1

`(f (xi ), yi )

PAC-Bayesian Theory

Pioneered byShawe-Taylor et Williamson (1997), McAllester (1999) et Catoni (2003),

the PAC-Bayesian theory give PAC generalization guarantees to “Bayesian like” algorithms.

PAC guarantees (Probably Approximately Correct)

With probability at least “ 1−δ ”, the loss of predictor f is less than “ ε ”

PrS∼Dn

(((LD(f ) ≤≤≤ ε(LS(f ), n, δ, . . .)

)))≥ 1−δ

Bayesian flavor

Given :

A prior distribution P on F .

A posterior distribution Q on F .

PrS∼Dn

f∼QLD(f ) ≤≤≤ ε( E

f∼QLS(f ), n, δ,P, . . .)

)))≥ 1−δ

A Classical PAC-Bayesian Theorem

PAC-Bayesian theorem (adapted from McAllester 1999 ; McAllester 2003)

For any distribution P on F , for any δ∈(0, 1], we have,

PrS∼Dn

(((∀Q on F : E

f∼QLD(f ) ≤≤≤ E

f∼QLS(f )︸︷︷︸

empirical loss

[KL(Q‖P) + ln 2

√nδ

]︸︷︷︸

complexity term

)))≥ 1−δ ,

where KL(Q‖P) = Ef∼Q

ln Q(f )P(f ) is the Kullback-Leibler divergence.

Training bound

Gives generalization guarantees not based on testing sample.

Valid for all posterior Q on FInspiration for conceiving new learning algorithms.

4 Empirical results

Standard Neural Networks (Multilayer perceptrons, or MLP)

Classification setting :

x ∈ Rd

y ∈ {−1, 1}Architecture :

L fully connected layers

dk denotes the number of neurons of the kth layer

σ : R→ R is the activation function

Parameters :

Wk ∈ Rdk×dk−1 denotes the weight matrices.

θ= vec({Wk}Lk=1

)∈RD

x1 · · · xd

σ σ σ

Prediction

fθ(x) = σ(wLσ

(WL−1σ

(. . . σ

First PAC-Bayesian bounds for Stochastic Neural Networks

“(Not) Bounding the True Error” (Langford et Caruana 2001)

Shallow networks (L = 2)

Sigmoid activation functions 5.0 2.5 0.0 2.5 5.00.0

“Computing Nonvacuous Generalization Bounds for Deep(Stochastic) Neural Networks (Dziugaite et Roy 2017)

Deep networks (L > 2)

ReLU activation functions5.0 2.5 0.0 2.5 5.0

Idea : Bound the expected loss of the network under a Gaussian perturbation of the weights

Empirical loss : Eθ′∼N (θ,Σ)

LS(fθ′) −→ estimated by sampling

Complexity term : KL(N (θ,Σ)‖N (θp,Σp)) −→ closed formPascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 7 / 19

4 Empirical results

Binary Activated Neural Networks

Classification setting :

x ∈ Rd

y ∈ {−1, 1}Architecture :

L fully connected layers

dk denotes the number of neurons of the kth layer

sgn(a) = 1 if a > 0 and sgn(a) = −1 otherwise

Parameters :

Wk ∈ Rdk×dk−1 denotes the weight matrices.

θ= vec({Wk}Lk=1

)∈RD

x1 · · · xd

sgn sgn sgn

Prediction

fθ(x) = sgn(wLsgn

(WL−1sgn

(. . . sgn

4 Empirical results

One Layer

“PAC-Bayesian Learning of Linear Classifiers” (Germain et al. 2009)

fw(x) := sgn(w · x), with w ∈ Rd .

x1 x2 · · · xd

One Layer

“PAC-Bayesian Learning of Linear Classifiers” (Germain et al. 2009)

fw(x) := sgn(w · x), with w ∈ Rd .

PAC-Bayes analysis :Space of all linear classifiers Fd := {fv|v ∈ Rd}Gaussian posterior Qw := N (w, Id) over Fd

Gaussian prior Pw0:= N (w0, Id) over Fd

Predictor

Fw(x) := Ev∼Qw

fv(x) = erf(

w·x√d‖x‖

)2 0 2

1.0 erf(x)tanh(x)sgn(x)

Bound minimization — under the linear loss `(y , y ′) := 12(1− yy ′)

C n LS(Fw) + KL(Qw‖Pw0) = C 12

∑ni=1 erf

(−yi w·xi√

d‖xi‖

2‖w −w0‖2 .

4 Empirical results

Two Layers

Posterior Qθ = N (θ, ID), over the family of all networks FD = {fθ | θ ∈ RD}, where

fθ(x) = sgn(w2 · sgn(W1x)

Fθ(x) = Eθ∼Qθ

fθ(x)

∫Rd1×d0

Q1(V1)

∫Rd1

Q2(v2)sgn(v2 · sgn(V1x))dv2dV1

∫Rd1×d0

Q1(V1) erf(

w2·sgn(V1x)√2‖sgn(V1x)‖

s∈{−1,1}d1

w2·s√2d1

)∫Rd1×d0

1[s = sgn(V1x)]Q1(V1) dV1

s∈{−1,1}d1

(w2 · s√

)︸︷︷︸

Fw2 (s)

d1∏i=1

si2erf

1 · x√2 ‖x‖

)]︸︷︷︸

Pr(s|x,W1)

. x1 · · · xd

sgn sgn sgn

PAC-Bayesian Ingredients

Empirical loss

LS(Fθ) = Eθ′∼Qθ

LS(fθ′) =1

n∑i=1

2− 1

2yiFθ(xi )

Complexity term

KL(Qθ‖Pθ0) =1

2‖θ − θ0‖2 .

Derivatives

Chain rule.

∂LS(Fθ)

∂θ=

n∑i=1

∂`(Fθ(xi ), yi )

∂θ=

n∑i=1

∂Fθ(xi )

∂θ`′(Fθ(xi ), yi ) ,

Hidden layer partial derivatives.

∂wk1

Fθ(x) =x

232 ‖x‖

erf ′(

wk1 · x√2 ‖x‖

) ∑s∈{−1,1}d1

sk erf

(w2 · s√

)[Pr(s|x,W1)

Pr(sk |x,wk1 )

with erf ′(x) := 2√πe−x

Stochastic Approximation

Fθ(x) =∑

s∈{−1,1}d1

Fw2(s) Pr(s|x,W1)

Monte Carlo sampling

We generate T random binary vectors {st}Tt=1 according to Pr(s|x,W1)

Prediction.

Fθ(x) ≈ 1

T∑t=1

Fw2(st) .

Derivatives.

∂wk1

Fθ(x) ≈ x

232 ‖x‖

erf ′(

wk1 · x√2 ‖x‖

T∑t=1

stkPr(stk |x,wk

1 )Fw2(st) .

4 Empirical results

More Layers

x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2

F(j)1 (x) = erf

1·x√2‖x‖

(j)k+1(x) =

∑s∈{−1,1}dk

(wjk+1·s√2dk

) dk∏i=1

2si × F

(i)k (x)

)Pascal Germain (Inria MODAL) PAC-Bayesian Learning and Neural Networks June 6, 2019 15 / 19

4 Empirical results

Model name Cost function Train split Valid split Model selection Prior

MLP–tanh linear loss, L2 regularized 80% 20% valid linear loss -PBGNet` linear loss, L2 regularized 80% 20% valid linear loss random initPBGNet PAC-Bayes bound 100 % - PAC-Bayes bound random init

PBGNetpre

– pretrain linear loss (20 epochs) 50% - - random init– final PAC-Bayes bound 50% - PAC-Bayes bound pretrain

DatasetMLP–tanh PBGNet` PBGNet PBGNetpre

ES ET ES ET ES ET Bound ES ET Bound

ads 0.021 0.037 0.018 0.032 0.024 0.038 0.283 0.034 0.033 0.058adult 0.128 0.149 0.136 0.148 0.158 0.154 0.227 0.153 0.151 0.165mnist17 0.003 0.004 0.008 0.005 0.007 0.009 0.067 0.003 0.005 0.009mnist49 0.002 0.013 0.003 0.018 0.034 0.039 0.153 0.018 0.021 0.030mnist56 0.002 0.009 0.002 0.009 0.022 0.026 0.103 0.008 0.008 0.017mnistLH 0.004 0.017 0.005 0.019 0.071 0.073 0.186 0.026 0.026 0.033

Perspectives

Transfer learning

Multiclass

Merci !

https://arxiv.org/abs/1905.10259

References

Shawe-Taylor, John et Robert C. Williamson (1997). “A PAC Analysis of a BayesianEstimator”. In : COLT.

McAllester, David (1999). “Some PAC-Bayesian Theorems”. In : Machine Learning 37.3.

Catoni, Olivier (2003). “A PAC-Bayesian approach to adaptive classification”. In : preprint 840.

McAllester, David (2003). “PAC-Bayesian Stochastic Model selection”. In : MachineLearning 51.1.

Langford, John et Rich Caruana (2001). “(Not) Bounding the True Error”. In : NIPS. MITPress, p. 809-816.

Dziugaite, Gintare Karolina et Daniel M. Roy (2017). “Computing NonvacuousGeneralization Bounds for Deep (Stochastic) Neural Networks with Many More Parametersthan Training Data”. In : UAI. AUAI Press.

Germain, Pascal et al. (2009). “PAC-Bayesian learning of linear classifiers”. In : ICML. ACM,p. 353-360.

PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary...

Documents

Transcript of PAC-Bayesian Learning and Neural Networks · PAC-Bayesian Learning and Neural Networks The Binary...

Neural Implementation of Bayesian Inference in …papers.nips.cc/paper/1988-neural-implementation-of...Neural Implementation of Bayesian Inference in Population Codes Si Wu Computer

Bayesian Regularization of Neural Networks

Chapter 1 Bayesian Neural Networks arXiv:2006.01490v2 ...

Scalable Bayesian Optimization Using Deep Neural Networks

Context Model, Bayesian Exemplar Models, Neural Networks

Bayesian Neural Networks For Uncertainty Estimation In ...

Neural Networks for Efﬁcient Bayesian Decoding of Natural ...med.stanford.edu/content/dam/sm/chichilnisky/... · Bayesian inference by exploiting artiﬁcial neural networks developed

Generalisation Bounds (4): PAC Bayesian Boundsjaven/talk/bounds_slide4.pdf · Generalisation Bounds (4): PAC Bayesian Bounds Qinfeng (Javen) Shi The Australian Centre for Visual Technologies,

Bayesian Neural Networks - cedar.buffalo.edusrihari/CSE574/Chap5/Chap5.7... · Machine Learning Srihari Bayesian Neural Networks Sargur Srihari srihari@cedar.buffalo.edu 1

Bayesian Optimization with Robust Bayesian Neural Networks

Nonlinear Bayesian Filters for Training Recurrent Neural ...haranarasaratnam.com/docs/ch_nn.pdf · Nonlinear Bayesian Filters for Training Recurrent Neural Networks Ienkaran Arasaratnam

Neural networks, connectionism and bayesian learningNeural networks, connectionism and bayesian learning Pantelis P. Analytis Neural nets Connectionism in Cognitive Science Bayesian

Comparing Bayesian and Neural Network Supported Lithotype ...

A (condensed) primer on PAC-Bayesian Learning followed ...A (condensed) primer on PAC-Bayesian Learning followed by News from the PAC-Bayes frontline Benjamin Guedj RIKEN AIP Seminar

Bayesian Neural Networks

Bayesian Neural Networkssrihari/CSE574/Chap5/Chap5.7-BayesianNeuralNetworks.pdfClassical and Bayesian neural networks • Classical neural networks use maximum likelihood • To determine

Bayesian Nonparametric Federated Learning of Neural Networks

Aggregation by exponential weighting, sharp PAC-Bayesian ... · gression with deterministic design. We obtain sharp PAC-Bayesian risk bounds for aggre-gates deﬁned via exponential

by Fran˘cois Laviolette - GitHub Pages...A tutorial on the Pac-Bayesian Theory NIPS workshop -\(Almost) 50 shades of Bayesian Learning: PAC-Bayesian trends and insights" by Fran˘cois

PAC-Bayesian Theorems for Gaussian Process Classifications