Post on 18-Nov-2014
description
Memo: Backpropaga.on in Convolu.onal Neural Network
Hiroshi Kuwajima 13-‐03-‐2014
1 /14
2 /14 Note n Purpose
The purpose of this memo is trying to understand and remind the backpropaga.on algorithm in Convolu.onal Neural Network based on a discussion with Prof. Masayuki Tanaka.
n Table of Contents In this memo, backpropaga.on algorithms in different neural networks are explained in the following order. p Single neuron 3 p Mul.-‐layer neural network 5 p General cases 7 p Convolu.on layer 9 p Pooling layer 11 p Convolu.onal Neural Network 13
n Nota.on This memo follows the nota.on in UFLDL tutorial (hYp://ufldl.stanford.edu/tutorial)
3 /14 Neural Network as a Composite Func4on A neural network is decomposed into a composite func.on where each func.on element corresponds to a differen.able opera.on. n Single neuron (the simplest neural network) example
A single neuron is decomposed into a composite func.on of an affine func.on element parameterized by W and b and an ac.va.on func.on element f which we choose to be the sigmoid func.on.
Deriva.ves of both affine and sigmoid func.on elements w.r.t. both inputs and parameters are known. Note that sigmoid func.on does not have neither parameters nor deriva.ves parameters.
Sigmoid func.on is applied element-‐wise. ‘●’ denotes Hadamard product, or element-‐wise product.
hW ,b x( ) = f W T x + b( ) = sigmoid affineW ,b x( )( ) = sigmoid!affineW ,b( ) x( )
∂a∂z
= a • 1− a( ) where a = hw,b x( ) = sigmoid z( ) = 11+ exp −z( )
∂z∂x
=W, ∂z∂W
= x, ∂z∂b
= I where z = affineW ,b x( ) =WTx + b and I is identity matrix
Decomposi.on
Neuron
Standard network representa.on
x1
x2
x3
+1
hW,b(x)
Affine Ac.va.on (e.g. sigmoid)
Composite func.on representa.on
x1
x2
x3
+1
hW,b(x) z a
∇W J W,b; x, y( ) = ∂∂W
J W,b; x, y( ) = ∂J∂z
∂z∂W
= δ z( )xT
∇bJ W,b; x, y( ) = ∂∂bJ W,b; x, y( ) = ∂J
∂z∂z∂b
= δ z( )
4 /14 Chain Rule of Error Signals and Gradients Error signals are defined as the deriva.ves of any cost func.on J which we choose to be the square error. Error signals are computed (propagated backward) by the chain rule of deriva.ve and useful for compu.ng the gradient of the cost func.on. n Single neuron example
Suppose we have m labeled training examples {(x(1), y(1)), …, (x(m), y(m))}. Square error cost func.on for each example is as follows. Overall cost func.on is the summa.on of cost func.ons over all examples. Error signals of the square error cost func.on for each example are propagated using deriva.ves of func.on elements w.r.t. inputs.
Gradient of the cost func.on w.r.t. parameters for each example is computed using error signals and deriva.ves of func.on elements w.r.t. parameters. Summing gradients for all examples gets overall gradient.
δ a( ) = ∂∂aJ W,b; x, y( ) = − y − a( )
δ z( ) = ∂∂zJ W,b; x, y( ) = ∂J
∂a∂a∂z
= δ a( ) • a • 1− a( )
J W,b; x, y( ) = 12y − hw,b x( ) 2
5 /14 Decomposi4on of Mul4-‐Layer Neural Network n Composite func.on representa.on of a mul.-‐layer neural network
n Deriva.ves of func.on elements w.r.t. inputs and parameters
a 1( ) = x, a lmax( ) = hw,b x( )∂a l+1( )
∂z l+1( ) = al+1( ) • 1− a l+1( )( ) where a l+1( ) = sigmoid z l+1( )( ) = 1
1+ exp −z l+1( )( )∂z l+1( )
∂a l( ) =W l( ), ∂zl+1( )
∂W l( ) = al( ), ∂z
l+1( )
∂b l( ) = I where z l+1( ) = W l( )( )T a l( ) + b l( )
hW ,b x( ) = sigmoid!affineW 2( ),b 2( )!sigmoid!affineW 1( ),b 1( )( ) x( )
Decomposi.on Standard network representa.on
x1
x2
x3
+1
Layer 1
+1
Layer 2 x
hW,b(x)
a2(2)
a1(2)
a3(2)
Composite func.on representa.on
x1
x2
x3
+1
Affine 1 Sigmoid 1 x
z2(2)
z1(2)
z3(2)
+1
Affine 2
a2(2)
a1(2)
a3(2) hW,b(x)
z1(3) a1(3)
a1(1)
a2(1)
a3(1)
Sigmoid 2
6 /14 Error Signals and Gradients in Mul4-‐Layer NN n Error signals of the square error cost func.on for each example
n Gradient of the cost func.on w.r.t. parameters for each example
δ a l( )( ) = ∂∂a l( ) J W,b; x, y( ) =
− y − a l( )( ) for l = lmax
∂J∂z l+1( )
∂z l+1( )
∂a l( ) = W l( )( )T δ z l+1( )( ) otherwise
⎧
⎨⎪
⎩⎪
δ z l( )( ) = ∂∂z l( ) J W,b; x, y( ) = ∂J
∂a l( )∂a l( )
∂z l( ) = δa l( )( ) • a l( ) • 1− a l( )( )
∇W l( )J W,b; x, y( ) = ∂
∂W l( ) J W,b; x, y( ) = ∂J∂z l+1( )
∂z l+1( )
∂W l( ) = δz l+1( )( ) a l( )( )T
∇b l( )J W,b; x, y( ) = ∂
∂b l( ) J W,b; x, y( ) = ∂J∂z l+1( )
∂z l+1( )
∂b l( ) = δ z l+1( )( )
7 /14 Backpropaga4on in General Cases 1. Decompose opera.ons in layers of a neural network into func.on elements whose
deriva.ves w.r.t inputs are known by symbolic computa.on.
2. Backpropagate error signals corresponding to a differen.able cost func.on by numerical computa.on (Star.ng from cost func.on, plug in error signals backward).
3. Use backpropagated error signals to compute gradients w.r.t. parameters only for the func.on elements with parameters where their deriva.ves w.r.t parameters are known by symbolic computa.on.
4. Sum gradients over all example to get overall gradient.
hθ x( ) = f lmax( ) !…! fθ l( )l( ) !…! f
θ 2( )2( ) ! f 1( )( ) x( ) where f 1( ) = x, f lmax( ) = hθ x( ) and ∀l : ∂f
l+1( )
∂f l( ) is known
δ l( ) = ∂∂f l( ) J θ; x, y( ) = ∂J
∂f l+1( )∂f l+1( )
∂f l( ) = δ l+1( ) ∂f l+1( )
∂f l( ) where ∂J∂f lmax( ) is known
∇θ l( )J θ; x, y( ) = ∂
∂θ l( ) J θ; x, y( ) = ∂J∂f l( )
∂fθ l( )l( )
∂θ l( ) = δl( ) ∂fθ l( )
l( )
∂θ l( ) where ∂f
θ l( )l( )
∂θ l( ) is known
∇θ l( )J θ( ) = ∇
θ l( )J θ; x i( ), y i( )( ) i=1
m
∑
8 /14 Convolu4onal Neural Network A convolu.on-‐pooling layer in Convolu.onal Neural Network is a composite func.on decomposed into func.on elements f(conv) and f(pool).
Let a be the output from the previous layer.
f pool( ) ! fWconv( )( ) a( )
Convolu.on Pooling a
f (conv)
f (pool)
9 /14 Deriva4ves of Convolu4on n Discrete convolu.on parameterized by a feature W and its deriva.ves
Let z = f(conv) for simplifica.on.
z = a∗W = zn[ ], zn = a∗W( ) n[ ] = an+i−1Wii=1
W
∑ =WTan:n+W −1
∂zn∂an:n+W −1
=W, ∂zn∂an+i−1
=Wi, ∂zn−i+1
∂an=Wi for 1≤ i ≤ W and ∂zn
∂W= an:n+W −1
an
an+1
…
W1
W2 …
zn zn has incoming connec.ons from an:n+|W|-‐1.
a Convolu.on
an W1 zn
W2
… zn-‐1
… an has outgoing connec.ons to zn-‐|W|+1:n, i.e., all zn-‐|W|+1:n have deriva.ves w.r.t. an.
a Convolu.on
10 /14 Backpropaga4on in Convolu4on Error signals and gradient for each example are computed by convolu.on using its commuta.vity property and the mul.variable chain rule of deriva.ve.
Let the error signals at convolu.on layer δ(z) = δ(conv) for consistency.
δn
a( )= ∂∂an
J θ; x, y( ) = ∂J∂z
∂z∂an
= ∂J∂zn−i+1
∂zn−i+1
∂ani=1
W
∑ = δn−i+1
z( )Wi
i=1
W
∑ = δz( )∗W( ) n[ ], δ a( )
= δn
a( )⎡⎣
⎤⎦ = δ
z( )∗W
∇W J θ; x, y( ) = ∂∂W
J θ; x, y( ) = ∂J∂z
∂z∂W
= ∂J∂zn
∂zn∂Wn
∑ = δn
z( )an:n+W −1
n∑ = δ
z( )∗a = a∗δ
conv( )
Backward propaga.on (convolu.on)
…
…
δ(z)n-‐1
δ(z)n
W1
W2
Error signals
δ(a) = W * δ(z)
δ(a)n
(Full convolu.on)
…
Gradient
∂J/∂Wn
an
an+1
…
δ(z)1
δ(z)2
a * δ(z) = ∂J/∂W
(Valid convolu.on)
a * W = z
an
an+1
…
W1
W2 …
zn
Forward propaga.on (convolu.on)
(Valid convolu.on)
11 /14 Deriva4ves of Pooling Pooling layer subsamples sta.s.cs to obtain summary sta.s.cs with any aggregate func.on (or filter) g whose input is vector and output is scalar. Subsampling is an opera.on like convolu.on, however g is applied to disjoint regions. n Defini.on: subsample
Let m be the size of pooling region and let a = f(conv), z = f(pool) for simplicity.
a(n-‐1)m+1
…
a Pooling
zn
zn = subsample a,g( ) n[ ] = g a n−1( )m+1:nm( )z = subsample a,g( ) = zn[ ]
g x( ) =
xkk=1
m
∑m
, ∂g∂x
= 1m
for mean pooling
max x( ), ∂g∂xi
=1 if xi =max x( )0 otherwise
⎧⎨⎩
for max pooling
x p = xkp
k=1
m
∑⎛⎝⎜⎞⎠⎟
1/p
, ∂g∂xi
= xkp
k=1
m
∑⎛⎝⎜⎞⎠⎟
1/p−1
xip−1 for Lp pooling
!
⎧
⎨
⎪⎪⎪⎪⎪
⎩
⎪⎪⎪⎪⎪
12 /14 Backpropaga4on in Pooling Error signals for each example are computed by upsampling. Upsampling is an opera.on which backpropagates (distributes) the error signals over the aggregate func.on g using its deriva.ves g’n = ∂g/∂a(n-‐1)m+1:nm. g’n can change depending on pooling region n.
p In max pooling, the unit which was the max at forward propaga.on receives all the error at backward propaga.on and the unit is different depending on the region n.
n Defini.on: upsample
δ n−1( )m+1:nma( )
= upsample δz( ), ′g( ) n[ ] = δn
z( )′gn = δn
z( ) ∂g∂a n−1( )m+1:nm
= ∂J∂zn
∂zn∂a n−1( )m+1:nm
= ∂∂a n−1( )m+1:nm
J θ; x, y( )
δa( )= upsample δ
z( ), ′g( ) = δ n−1( )m+1:nm
a( )⎡⎣
⎤⎦
δ(a) = upsample(δ(z), g’)
δ(z)n
δ(a)(n-‐1)m+1
…
Forward propaga.on (upsapmling)
a(n-‐1)m+1
…
subsample(a, g) = z
zn
Forward propaga.on (subsapmling)
13 /14 Backpropaga4on in Convolu4onal Neural Network
Plug in δ(conv)
Plug in δ(conv)
…
∂J/∂Wn
an
an+1
…
δ(conv)1
δ(conv)2
(Valid convolu.on)
∇W J θ; x, y( ) = a∗δ conv( )
3. Compute gradient ∇wJ
…
…
δ(conv)n-‐1
δ(conv)n
W1
W2 δ(a)n
(Full convolu.on)
2. Propagate error signals δ(conv)
δa( )= δ
conv( )∗W δ
conv( )= upsample δ
pool( ), ′g( )
1. Propagate error signals δ(pool)
δ(z)n
δ(a)(n-‐1)m+1
…
14 /14 Remarks n References
p UFLDL Tutorial, hYp://ufldl.stanford.edu/tutorial p Chain Rule of Neural Network is Error Back Propaga.on,
hYp://like.silk.to/studymemo/ChainRuleNeuralNetwork.pdf
n Acknowledgement This memo was wriYen thanks to a good discussion with Prof. Masayuki Tanaka.