Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks:...
Transcript of Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks:...
Deep neural networks: the basicsMaster’s Computer Vision
Sergey Nikolenko, Alex Davydow
Harbour Space University, BarcelonaMay 19, 2020
Random facts:
• On May 19, 1655, the Anglo-Spanish War began with an invasion of Jamaica; soon a ter takingthe island the English were starving and dying of disease; still, they persevered, Jamaica wasceded at the Treaty of Madrid and soon became a hugely profitable sugar colony
• On May 19, 1910, Earth passed through the tail of Halley’s Comet; anti-comet pills andanti-comet umbrellas were widely sold to the public; Mark Twain was born two weeks a terthe previous perihelion and died the day a ter the perihelion in 1910
• On May 19, 1980, Apple announced Apple III, the first failure of Apple; the hardware o tenfailed, there were problems with cooling and board design; Steve Wozniak estimated thatApple “put $100 million in advertising, promotion, and research and development into aproduct that was 3 percent of our revenues”
• On May 19, 2018, Prince Harry and Meghan Markle married at St George’s Chapel, Windsor; itwas estimated that 1.9 billion were watching the ceremony worldwide
The brain and ANN history
Human brain as a computer
• The human brain is a pretty slow computer, but it can do somepretty amazing things.
• How does it do it?• Lots of neurons: 1011 neurons, about 7000 synapses each, so intotal about 1014–1015 connections (synapses).
1
Human brain as a computer
• Each neuron:• gets a signal through the dendrites from other neurons;• sends signals from time to time through the axon;• synapses are connections between one neuron’s axon and otherneurons’ dendrites attached to it.
1
Human brain as a computer
• A neuron produces spikes stochastically.• The firing rate is only 10 to 200 Hz.• But we can recognize a face in a couple hundred milliseconds.• This means that sequential computations are very short.• But the brain is heavily parallelized.
2
Human brain as a computer
• Here is how we process visual input:
3
Feature learning
• Another part of the story: feature learning.• The brain can train very well on a very small data sample.• And it can adapt to new data sources.• How does it do it? Can we do the same?
4
Feature learning
• Systems for processing unstructured data look like
input→ features→ classifier
• For a long time good features had to be handmade:• MFCC for speech recognition;• SIFT for image processing;• ...
• Feature engineering: can we learn good features automatically?
4
Plasticity
• Third point: neuroplasticity.
• It appears that there is a single learning algorithm (“The MasterAlgorithm”) that can be applied to very different situations.
5
The mathematical essence
• Generally speaking, all we do in machine learning isapproximation and optimization of various functions.
• For example:• a function from a picture to what it shows;• even simpler: binary function from an image’s pixels to “is it a cator not”;
• it’s a rather complicated function, and it’s defined as a dataset(values at a few points);
• how do we approximate it?
6
The mathematical essence
• Common approach in machine learning:• construct a class of models, usually parametric;• tune the parameters so that the model fits the data well;• i.e., we optimize a certain error function or quality metric.
• Neural networks are basically universal approximators, a verywide and flexible parametric class of functions.
• The problem is how to train them well.
6
ANN history
• Let’s try to mimic nature: get a large highly connected thing ofsmall simple things.
• The idea of modeling neurons with simple functions is very-veryold (McCulloch, Pitts, 1943).
• Artificial neural networks (even with several layers) wereproposed, among others, by Turing (1948).
• Rosenblatt, 1958: perceptron (one artificial neuron, will see in aminute): a linear separating surface.
• Training by gradient descent also appeared at the same time.
7
ANN history
• One neuron is modeled as follows:
• Linear combination of inputs follows by a nonlinear activationfunction:
y = h(w⊤x) = h(∑
i
wixi
).
• How can you train a perceptron and what will be the result?
7
ANN history
• Training by gradient descent.• Original Rosenblatt’s perceptron (h = id):
EP(w) = −∑x∈M
y(x)(w⊤x
),
w(τ+1) = w(τ) − η∇wEP(w) = w(τ) + ηtnxn.
• Or, e.g., we can do binary classification by plugging in logisticsigmoid as the activation function h(x) = σ(x) = 1
1+e−x :
E(w) = − 1N
N∑i=1
(yi log σ(w⊤xi) + (1− yi) log
(1− σ(w⊤xi)
)).
• Basically logistic regression.
7
ANN history
• The first perceptron (Rosenblatt, 1958) learned to recognizeletters.
7
ANN history
• Hailed as a huge success:
7
ANN history
• Minsky’s critique: even with a nonlinearity, one perceptron canonly have a linear separating surface.
7
ANN history
• Well, sure! You need to make a network of perceptrons:
• More about that later (actually, the whole course is about that).• But first let’s see what people do with a single perceptron.
7
ANN history
• 1960s: studies of the perceptron.• (Minsky, Papert, 1969): XOR cannot be modeled by a perceptron.• For some reason, this was thought to be a big problem.• (Brison, Hoback, 1969): backpropagation.• (Hinton, 1974): rediscovered backpropagation and popularized it.• Second half of the 1970s – multilayer ANN, basically in themodern form.
• Deep models appeared in early 1980s! It’s not much of an idea.
7
ANN history
• But by the early 1990s neural networks did not live up to thehype (in part for technical reasons).
• John Denker, 1994: «neural networks are the second best way ofdoing just about anything».
• So in the 1990s, neural networks were all but forgotten again.• Except for image processing (Yann LeCun) and a few moregroups (Geoffrey Hinton, Yoshua Bengio).
7
ANN history
• Deep learning revolution begins in 2006: Geoffrey Hinton’s grouplearned how to train Deep Belief Networks (DBN).
• Main idea: unsupervised pretraining.• Next component: more powerful computers, computations onGPUs.
• And much larger datasets.• Then we got new regularization and optimization techniques(we’ll talk about those), and pretraining by now is not reallynecessary.
• These, together with the architectures, are the main componentsof the deep learning revolution.
7
Activation functions
• There are many different nonlinearities; let’s do a brief survey.• Logistic sigmoid:
σ(x) = 11+ e−x .
8
Activation functions
• Hyperbolic tangent:
tanh(x) = ex − e−xex + e−x
• Very similar, but zero is not a saturation point.
8
Activation functions
• Heaviside step function:
step(x) ={0, if x < 0,1, if x > 0.
8
Activation functions
• Rectified linear unit (ReLU):
ReLU(x) ={0, if x < 0,x, if x ≥ 0.
8
Activation functions
• Mathematical motivation for ReLU:
σ
(x+ 1
2
)+ σ
(x− 1
2
)+ σ
(x− 3
2
)+ σ
(x− 5
2
)+ . . . .
8
Activation functions
• Biological motivation for ReLU:
f(I) =
(τ log E+RI−Vreset
E+RI−Vth
)−1, if E+ RI > Vth,
0, if E+ RI ≤ Vth,
8
Activation functions
• There are several ReLU variations. Leaky ReLU and ParametricReLU:
LReLU(x) ={ax, if x < 0,x, if x > 0.
.
8
Activation functions
• Exponential linear unit:
ELU(x) ={α (ex − 1) , x < 0,x, x ≥ 0.
• The jury is still out, there are probably more to come, but usuallyReLU is just fine.
8
Real neurons
• Do actual neurons do gradient descent?• Probably not (we’ll see that backprop would be hard to do).• Although Geoffrey Hinton had an interesting talk about it. Butstill, no.
• Hebbian learning – increase the weight between neurons thatfire together:
∆wi = ηxixj.
• Led to Hopfield networks that kinda do associative memory.• Current research: spike-timing-dependent plasticity; i.e., theincrease depends on the actual temporal properties of signals.
9
Combining perceptrons into a network
• A network of perceptrons; outputs of one are inputs of another.• Hornik, 1990: a two-layer ANN can approximate any function.• This is just theory, but in practice, deep networks are indeedmore expressive – distributed representations:
10
Combining perceptrons into a network
• Note also how they NNs are usually organized in layers.• This is natural and suggests easy parallelization.
10
Combining perceptrons into a network
• So we approximate a very complicated function with a largecomposition of simple functions.
• How do we train it?• Simple answer: gradient descent. But there are complicationshere.
10
Gradient descentand computational graphs
Gradient descent
• Gradient descent: take the gradient w.r.t. weights, move in thatdirection.
• Formally: for an error function E, targets y, and model f withparameters θ,
E(θ) =∑
(x,y)∈DE(f(x,θ), y),
θt = θt−1 − η∇E(θt−1) = θt−1 − η∑
(x,y)∈D∇E(f(x,θt−1), y).
• So we need to sum over the entire dataset for every step?!..
11
Gradient descent
• Hence, stochastic gradient descent: a ter every training sampleupdate
θt = θt−1 − η∇E(f(xt,θt−1), yt),
• In practice people usually use mini-batches, it’s easy toparallelize and smoothes out excessive “stochasticity”.
• So far the only parameter is the learning rate η.
11
Gradient descent
• Lots of problems with η:
• We will get to them later, for now let’s concentrate on thecertainly required step: the derivatives.
11
Computational graph, frop and bprop
• Let us represent a function as a composition of simple functions(“simple” means that we can take derivatives).
• Example – f(x, y) = x2 + xy+ (x+ y)2:
12
Computational graph, frop and bprop
• This way we can take the gradient with the chain rule:
(f ◦ g)′(x) = (f(g(x)))′ = f′(g(x))g′(x).
• This simply means that an increment δx results in
δf = f′(g(x))δg = f′(g(x))g′(x)δx.
• We only need to be able to take gradients, i.e., derivatives w.r.t.vectors:
∇xf =
∂f∂x1...∂f∂xn
.
∇x(f ◦ g) =
∂f◦g∂x1...
∂f◦g∂xn
=
∂f∂g
∂g∂x1...
∂f∂g
∂g∂xn
=∂f∂g∇xg.
12
Computational graph, frop and bprop
• Or, if f depends on x in several different ways,f = f(g1(x),g2(x), . . . ,gk(x)), the increment δx now comes intoplay several times:
∂f∂x =
∂f∂g1
∂g1∂x + . . .+
∂f∂gk
∂gk∂x =
k∑i=1
∂f∂gi
∂gi∂x .
∇xf =∂f∂g1
∇xg1 + . . .+∂f∂gk
∇xgk =k∑i=1
∂f∂gi
∇xgi.
• Note that we got matrix multiplication for the Jacobi matrix:
∇xf = ∇xg∇gf, where ∇xg =
∂g1∂x1 . . . ∂gk
∂x1......
∂g1∂xn . . . ∂gk
∂xn
.
12
Computational graph, frop and bprop
• Let’s now go back to the example:
12
Computational graph, frop and bprop
• Forward propagation: we compute ∂f∂x by the chain rule.
12
Computational graph, frop and bprop
• Backpropagation: starting from the end node, go back as∂f∂g =
∑g′∈Children(g)
∂f∂g′
∂g′∂g .
12
Computational graph, frop and bprop
• Backprop is much better: we get all derivatives in a single passthrough the graph.
• Aaaand... that’s it! We can now take the gradients of anycomplicated composition of simple functions.
• Which is all we need to apply gradient descent!• The libraries – PyTorch, TensorFlow, theano – are actuallyautomatic differentiation libraries. This is their main function.
• So you can implement lots of “classical” models in TensorFlowand train them by gradient descent.
• And live neurons can’t do that because you need two differentoutputs to compute the value and the derivative.
12
Regularizationin neural networks
Regularization in neural networks
• NNs have lots of parameters.• Regularization is necessary.• L2 or L1 regularization (λ
∑w w2 or λ
∑w |w|) is called weight
decay.• Very easy to add, just another term in the objective function.• Sometimes still useful.
13
Regularization in neural networks
• Regularization is very easy to add in all libraries. E.g., in Keras:• W_regularizer adds a regularizer on the weight matrix of alayer;
• b_regularizer — on the bias vector;• activity_regularizer — on the output vector.
13
Regularization in neural networks
• Second way: early stopping.• Let’s just stop when the error on the validation set begins toincrease!
• Also implemented out of the box in Keras, via callbacks.
13
Regularization in neural networks
• Third way – max-norm constraint.• Let’s bound the norm of the vector of weights for every neuronartificially:
∥w∥2 ≤ c.
• We can do it during optimization as well: when w leaves the ballof radious c, project it back.
13
Dropout
• But there are NN-specific ways too.• Dropout: let’s just remove some units at random with probabilityp!(Srivastava et al., 2014)
14
Dropout
• Turns out we sample a huge number of networks, and theneuron get an “average” activation from a lot of differentarchitectures.
• How do we use a network trained like that? Should we sample alot of networks and average them out?
14
Dropout
• To apply, simply multiply the result by 1/p (preserving averageoutput):
• And you can usually take p = 12 , there’s seldom any big
difference.
14
Dropout
• Dropout improves everything (large NNs, convolutional NNs) alot... but why? WTF?
14
Dropout
• Idea 1: we are making the units learn features by themselves,without relying on the others.
14
Dropout
• Similarly, this leads to sparsity:
14
Dropout
• Idea 2: we are kind of averaging a huge number of networkswith shared weights, training each for one step.
• It’s like bootstrapping or boosting taken to the extreme, and weknow xgboost rules Kaggle.
14
Dropout
• Idea 3: this is just like sex!• How does sexual reproduction work?• It’s important to get not just a good combination of genes but astable good combination.
14
Dropout
• Idea 4: a neuron sends out activation a with probability 0.5.• Let’s do it the other way around: send out 0.5 (or 1) withprobability a (a/2).
• Same expectation, larger variance for small p (which is not sobad).
• And thus we get stochastic neurons that send out signals atrandom intervals – just like the brain!
• Also brings improvements but needs less communicationbetween the neurons (one bit instead of a float).
• Maybe stochastic neurons work as a dropout regularizer! Andmaybe that’s why we are able to actually think of something.
14
Dropout
• Idea 5: dropout is a special form of prior.• The most rigorous and useful idea, it helped fix dropout inrecurrent networks.
• But this requires a Bayesian view of neural networks, soprobably not now.
14
Conclusion
• Thus, dropout is a very cool method.
• But it is now losing popularity (almost lost it altogether) due tobatch normalization and new versions of gradient descent.
• We’ll talk about them later, but first let’s tackle weightinitialization.
15
Thank you!
Thank you for your attention!
16