Lecture 25
-
Upload
shaayaan-sayed -
Category
Documents
-
view
212 -
download
0
description
Transcript of Lecture 25
12/1/2015
1
Deep learning and convolutional neural nets
Tues Dec 1Kristen Grauman
UT Austin
Last time
• Support vector machines (SVM)
– Basic algorithm
– Kernels
– Multi-class
– HOG + SVM for person detection
• Visualizing a feature: Hoggles
• Evaluating an object detector
– Precision/recall, IoU, Average precision
12/1/2015
2
Today
• Pyramid match kernels
• (Deep) Neural networks
• Convolutional neural networks
Recalll: Examples of kernel functions
Linear:
Gaussian RBF:
Histogram intersection:
)2
exp()(2
2
ji
ji
xx,xxK
k
jiji kxkxxxK ))(),(min(),(
j
T
iji xxxxK ),(
• Kernels go beyond vector space data
• Kernels also exist for “structured” input spaces like
sets, graphs, trees…
12/1/2015
3
Discriminative classification with
sets of features?• Each instance is unordered set of vectors
• Varying number of vectors per instance
Slide credit: Kristen Grauman
Partially matching sets of features
We introduce an approximate matching kernel that
makes it practical to compare large sets of features
based on their partial correspondences.
Optimal match: O(m3)
Greedy match: O(m2 log m)Pyramid match: O(m)
(m=num pts)
[Previous work : Indyk & Thaper, Bartal, Charikar, Agarwal &
Varadarajan, …]
Slide credit: Kristen Grauman
12/1/2015
4
Pyramid match: main idea
descriptor space
Feature space partitions
serve to “match” the local
descriptors within
successively wider regions.
Slide credit: Kristen Grauman
Pyramid match: main idea
Histogram intersection
counts number of possible
matches at a given
partitioning.Slide credit: Kristen Grauman
12/1/2015
5
Pyramid match
• For similarity, weights inversely proportional to bin size
(or may be learned)
• Normalize these kernel values to avoid favoring large sets
[Grauman & Darrell, ICCV 2005]
measures
difficulty of a
match at level
number of newly matched
pairs at level
Slide credit: Kristen Grauman
Pyramid match
optimal partial
matching
Optimal match: O(m3)
Pyramid match: O(mL)
The Py ramid Match Kernel: Ef f icient
Learning with Sets of Features. K.
Grauman and T. Darrell. Journal of
Machine Learning Research (JMLR), 8
(Apr): 725--760, 2007.
12/1/2015
6
BoW Issue:
No spatial layout preserved!
Too much? Too little?
Slide credit: Kristen Grauman
[Lazebnik, Schmid & Ponce, CVPR 2006]
• Make a pyramid of bag-of-words histograms.
• Provides some loose (global) spatial layout information
Spatial pyramid match
12/1/2015
7
[Lazebnik, Schmid & Ponce, CVPR 2006]
• Make a pyramid of bag-of-words histograms.
• Provides some loose (global) spatial layout information
Spatial pyramid match
Sum over PMKs
computed in image
coordinate space,
one per word.
• Can capture scene categories well---texture-like patterns
but with some variability in the positions of all the local
pieces.
Spatial pyramid match
12/1/2015
8
• Can capture scene categories well---texture-like patterns
but with some variability in the positions of all the local
pieces.
• Sensitive to global shifts of the view
Confusion table
Spatial pyramid match
Today
• Pyramid match kernels
• (Deep) Neural networks
• Convolutional neural networks
12/1/2015
9
Traditional Image Categorization: Training phase
Training Labels
Training
Images
Classifier Training
Training
Image Features
Trained Classifier
Jia-Bin Huang and Derek Hoiem, UIUC
Training Labels
Training
Images
Classifier Training
Training
Image Features
Trained Classifier
Image Features
Testing
Test Image
Outdoor
PredictionTrained Classifier
Traditional Image Categorization: Testing phase
12/1/2015
10
Features have been key
SIFT [Low e IJCV 04] HOG [Dalal and Triggs CVPR 05]
SPM [Lazebnik et al. CVPR 06] Textons
SURF, MSER, LBP, Color-SIFT, Color histogram, GLOH, …..
and many others:
• Each layer of hierarchy extracts features from output of previous layer
• All the way from pixels classifier
• Layers have the (nearly) same structure
• Train all layers jointly
Learning a Hierarchy of Feature Extractors
Layer 1 Layer 2 Layer 3 Simple Classifier
Image/VideoPixels
Image/video Labels
Slide: Rob Fergus
12/1/2015
11
Learning Feature Hierarchy
Goal: Learn useful higher-level features from imagesFeature representation
Input data
1st layer “Edges”
2nd layer “Object parts”
3rd layer “Objects”
Pixels
Lee et al., ICML 2009;
CACM 2011
Slide: Rob Fergus
Learning Feature Hierarchy
• Better performance
• Other domains (unclear how to hand engineer):– Kinect
– Video
– Multi spectral
• Feature computation time– Dozens of features now regularly used [e.g., MKL]
– Getting prohibitive for large datasets (10’s sec /image)
Slide: R. Fergus
12/1/2015
12
Biological neuron and Perceptrons
A biological neuron An artificial neuron (Perceptron) - a linear classifier
Jia-Bin Huang and Derek Hoiem, UIUC
Simple, Complex and Hypercomplex cells
David H. Hubel and Torsten Wiesel
David Hubel's Eye, Brain, and Vision
Suggested a hierarchy of feature detectors
in the visual cortex, with higher level features
responding to patterns of activation in lower
level cells, and propagating activation
upwards to still higher level cells.
Jia-Bin Huang and Derek Hoiem, UIUC
12/1/2015
13
Hubel/Wiesel Architecture and Multi-layer Neural Network
Hubel and Weisel’s architecture Multi-layer Neural Network- A non-linear classifier
Jia-Bin Huang and Derek Hoiem, UIUC
Neuron: Linear Perceptron
Inputs are feature values
Each feature has a weight
Sum is the activation
If the activation is:
Positive, output +1
Negative, output -1
Slide credit: Pieter Abeel and Dan Klein
12/1/2015
14
Multi-layer Neural Network
• A non-linear classifier
• Training: find network weights w to minimize the error between true training labels 𝑦𝑖 and estimated labels 𝑓𝒘 𝒙𝒊
• Minimization can be done by gradient descent provided 𝑓 is differentiable
• This training method is called back-propagation
Jia-Bin Huang and Derek Hoiem, UIUC
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
12/1/2015
15
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
12/1/2015
16
Learning w
Training examples
Objective: a misclassification loss
Procedure:
Gradient descent / hill climbing
Slide credit: Pieter Abeel and Dan Klein
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
12/1/2015
17
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
Two-layer neural network
Slide credit: Pieter Abeel and Dan Klein
12/1/2015
18
Neural network properties
Theorem (Universal function approximators): A
two-layer network with a sufficient number of neurons can approximate any continuous function to any desired accuracy
Practical considerations:
Can be seen as learning the features
Large number of neurons
Danger for overfitting
Hill-climbing procedure can get stuck in bad local
optima
Slide credit: Pieter Abeel and Dan KleinApproximation by Superpositions of Sigmoidal Function,1989
Today
• Pyramid match kernels
• (Deep) Neural networks
• Convolutional neural networks
12/1/2015
19
Convolutional Neural Networks (CNN, ConvNet, DCN)
• CNN = a multi-layer neural network with
– Local connectivity:• Neurons in a layer are only connected to a small region
of the layer before it
– Share weight parameters across spatial positions:• Learning shift-invariant filter kernels
Image credit: A. KarpathyJia-Bin Huang and Derek Hoiem, UIUC
Neocognitron [Fukushima, Biological Cybernetics 1980]
Deformation-Resistant Recognition
S-cells: (simple)
- extract local features
C-cells: (complex)
- allow for positional errors
Jia-Bin Huang and Derek Hoiem, UIUC
12/1/2015
20
LeNet [LeCun et al. 1998]
Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 1998] LeNet-1 from 1993
Jia-Bin Huang and Derek Hoiem, UIUC
What is a Convolution?
• Weighted moving sum
Input Feature Activation Map
.
.
.
slide credit: S. Lazebnik
12/1/2015
21
Input Image
Convolution (Learned)
Non-linearity
Spatial pooling
Normalization
Convolutional Neural Networks
Feature maps
slide credit: S. Lazebnik
Input Image
Convolution (Learned)
Non-linearity
Spatial pooling
Normalization
Feature maps
Input Feature Map
.
.
.
Convolutional Neural Networks
slide credit: S. Lazebnik
12/1/2015
22
Input Image
Convolution (Learned)
Non-linearity
Spatial pooling
Normalization
Feature maps
Convolutional Neural Networks
Rectified Linear Unit (ReLU)
slide credit: S. Lazebnik
Input Image
Convolution (Learned)
Non-linearity
Spatial pooling
Normalization
Feature maps
Max pooling
Convolutional Neural Networks
slide credit: S. Lazebnik
Max-pooling: a non-linear down-sampling
Provide translation invariance
12/1/2015
23
Input Image
Convolution (Learned)
Non-linearity
Spatial pooling
Normalization
Feature maps
Convolutional Neural Networks
slide credit: S. Lazebnik
Engineered vs. learned features
Image
Feature extraction
Pool ing
Classifier
Label
Image
Convolution/pool
Convolution/pool
Convolution/pool
Convolution/pool
Convolution/pool
Dense
Dense
Dense
LabelConvolutional filters are trained in a
supervised manner by back-propagating
classification error
Jia-Bin Huang and Derek Hoiem, UIUC
12/1/2015
24
SIFT Descriptor
Image Pixels Apply
oriented filters
Spatial pool
(Sum)
Normalize to unit length
Feature Vector
Lowe [IJCV 2004]
slide credit: R. Fergus
Spatial Pyramid Matching
SIFTFeatures
Filter with Visual Words
Multi-scalespatial pool
(Sum)
Max
Classifier
Lazebnik, Schmid,
Ponce [CVPR 2006]
slide credit: R. Fergus
12/1/2015
25
AlexNet
• Similar framework to LeCun’98 but:• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)• More data (106 vs. 103 images)• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
A. Krizhevsky, I . Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
Jia-Bin Huang and Derek Hoiem, UIUC
Applications
• Handwritten text/digits
– MNIST (0.17% error [Ciresan et al. 2011])
– Arabic & Chinese [Ciresan et al. 2012]
• Simpler recognition benchmarks
– CIFAR-10 (9.3% error [Wan et al. 2013])
– Traffic sign recognition• 0.56% error vs 1.16% for humans [Ciresan et al. 2011]
Slide: R. Fergus
12/1/2015
26
Application: ImageNet
[Deng et al. CVPR 2009]
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon Turk
https://sites.google.com/site/deeplearningcvpr2014 Slide: R. Fergus
AlexNet [Krizhevsky et al., NIPS 2012]
• 7 hidden layers, 650,000 neurons, 60,000,000 parameters
• Trained on 2 GPUs for a week
• Similar model as LeCun’98 but:
- Bigger model (8 layers)
- More data (106 vs 103 images)
- GPU implementation (50x speedup over CPU)
https://sites.google.com/site/deeplearningcvpr2014 Slide: R. Fergus
12/1/2015
27
ImageNet Classification 2012
• Krizhevsky et al. -- 16.4% error (top-5)
• Next best (non-convnet) – 26.2% error
35
30
25
20
15
10
5
0SuperVision ISI Oxford INRIA Amsterdam
Top
-5e
rro
rra
te%
https://sites.google.com/site/deeplearningcvpr2014 Slide: R. Fergus
ImageNet Classification 2013 Results
• http://www.image-net.org/challenges/LSVRC/2013/results.php
0.17
0.16
0.15
0.14
0.13
0.12
0.11
0.1
Test
err
or
(to
p-5)
https://sites.google.com/site/deeplearningcvpr2014
12/1/2015
28
ImageNet Challenge 2012-2014
Team Year Place Error (top-5) External data
SuperVision – Toronto(7 layers)
2012 - 16.4% no
SuperVision 2012 1st 15.3% ImageNet 22k
Clarifai – NYU (7 layers) 2013 - 11.7% no
Clarifai 2013 1st 11.2% ImageNet 22k
VGG – Oxford (16 layers) 2014 2nd 7.32% no
GoogLeNet (19 layers) 2014 1st 6.67% no
Human expert* 5.1%Error (top-5)
5.33%
4.94%
4.82%
Best non-convnet in 2012: 26.2%
Jia-Bin Huang and Derek Hoiem, UIUC
Industry Deployment
• Used in Facebook, Google, Microsoft
• Image Recognition, Speech Recognition, ….
• Fast at test time
Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR’14
Slide: R. Fergus
12/1/2015
29
Beyond classification
• Detection
• Segmentation
• Regression
• Pose estimation
• Matching patches
• Synthesis
and many more…
Jia-Bin Huang and Derek Hoiem, UIUC
R-CNN: Regions with CNN features
• Trained on ImageNet classification
• Finetune CNN on PASCAL
RCNN [Girshick et al. CVPR 2014]Jia-Bin Huang and Derek Hoiem, UIUC
12/1/2015
30
Labeling Pixels: Semantic Labels
Fully Convolutional Netw orks for Semantic Segmentation [Long et al. CVPR 2015] Jia-Bin Huang and Derek Hoiem, UIUC
Labeling Pixels: Edge Detection
DeepEdge: A Multi-Scale Bifurcated Deep Netw ork for Top-Dow n Contour Detection
[Bertasius et al. CVPR 2015]Jia-Bin Huang and Derek Hoiem, UIUC
12/1/2015
31
CNN for Regression
DeepPose [Toshevand Szegedy CVPR 2014]Jia-Bin Huang and Derek Hoiem, UIUC
CNN as a Similarity Measure for Matching
FaceNet [Schroff et al. 2015]Stereo matching [Zbontar and LeCun CVPR 2015]
Compare patch [Zagoruyko and Komodakis 2015]
Match ground and aerial images
[Lin et al. CVPR 2015]Flow Net [Fischer et al 2015]Jia-Bin Huang and Derek Hoiem, UIUC
12/1/2015
32
CNN for Image Generation
Learning to Generate Chairs w ith Convolutional Neural Netw orks [Dosovitskiy et al. CVPR 2015]Jia-Bin Huang and Derek Hoiem, UIUC
Chair Morphing
Learning to Generate Chairs w ith Convolutional Neural Netw orks [Dosovitskiy et al. CVPR 2015]Jia-Bin Huang and Derek Hoiem, UIUC
12/1/2015
33
Recap
• Pyramid match kernels: example of structured input data for kernel-based classifiers (SVM)
• Neural networks / multi-layer perceptrons– View of neural networks as learning hierarchy of
features
• Convolutional neural networks– Architecture of network accounts for image structure
– “End-to-end” recognition from pixels
– Together with big (labeled) data and lots of computation major success on benchmarks, image classification and beyond
Coming up
Thursday
• Visual attributes and learning to rank
Final exam
• Dec 9, 2-5 pm, in JGB 2.216
• Comprehensive
• Closed book
• Two pages of notes allowed