UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local...
Transcript of UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local...
UNSUPERVISED DEEP LEARNING
Erez AharonovNoam Eilon
Deep Learning Seminar School of Electrical Engineer – Tel Aviv University 1
Building High-level Features Using Large Scale Unsupervised
Learning
Quoc V. Le
Marc’Aurelio
Rajat Monga
Matthieu Devin
Kai Chen
Greg S. Corrado
Jeff Dean
Andrew Y. Ng
2012 2
Outline
• Short introduction - Unsupervised Learning
• Overview
• Training Deep autoencoder
• Model Architecture
• Parallelism and ASGD
• Results
3
Supervised Learning
Input Data
Learning Machine
Outputs
ObjectiveExternal rewards
Target
World
Machine
4
Unsupervised Learning
Input Data
Learning Machine
Outputs
ObjectiveIntrinsic Rewards
Target
World
Machine
5
Three Kinds of Learning
Supervised Leaning Unsupervised Learning Reinforced Learning
Input X – Data, Y- Label X – Data Current state, reward
Goal Learn a function to map X to Y Learn structure Optimize reward
Limitation Availability of labeled data Complexity and size Training model
Examples Classification, Segmentation, Object detection, Image captioning
Feature learning, Generative models.
Policy/Decisions/Games
6
Overview• Building high level class-specific feature detectors from unlabeled data.
• How can a perceptual system build itself by looking at the world? How much prior structure is necessary?
• Could a network learn, in an unsupervised way, to be sensitive to high level concepts like human faces, cats.
• Inspiration: “Grandmother neurons”: Represents a complex but specific concept or object.
“Invariant visual representation by single neurons in the human brain,” Quian Quiroga et al.7
Main concept: Deep Autoencoders
• Hierarchy of representations with increasing level of abstraction.
• Each module transforms its input representation into a higher-level one.
• High-level features are more global and more invariant.
• Low-level features are shared among categories.
x1
x2
x3
x4
x5
x'1
x'2
x‘3
x‘4
x‘5
Encoding Decoding
8
Training Deep autoencoders
End to End training:x1
x2
x3
x4
x5
x'1
x'2
x‘3
x‘4
x‘5
Encoding Decoding
• Encoding decoding through all layers• Calculating loss on input and
reconstruction.
9
Training Deep autoencoders
Greedy Layer wise:• Training each layer separately as an autoencoder.
• The input of each autoencoder is the output of the previous hierarchy
• Finetuning on the full network.
x'1
x'2
x‘3
x‘4
x‘5
x1
x2
x3
x4
x5
x'1
x'2
x‘3
x1
x2
x3
x'1
x'2
x‘3
x‘4
x1
x2
x3
x4
10
The Network Outline
• 3 Encoding-Decoding layers.
• 9 Layer autoencoder.
• All parameters in our model were trained jointly with the objective being the sum of the objectives of the three layers.
Image
Encode
Pool & LCN
Decode
Encode
Pool & LCN
Decode
Encode
Pool & LCN
Decode
60,000 Neurons
200X200 Image 11
One layer architecture
First sublayer: Local receptive fields.
• 18x18 pixels RF windows.• 8 Feature maps.• Each neuron connects to all
input channels• Not convolutional for more
invariance.
12
One layer architecture
Second sublayer - Pooling
• L2 pooling • 5x5 overlapping windows.• H – Fixed pooling matrix.• Pooling over one feature.• Improves invariance to local deformations.
𝑦𝑗,𝑖 =
𝑢𝑣
𝐻𝑢,𝑣𝑔𝑗+𝑢,𝑖+𝑣2
13
One layer architecture
Third sublayer – Local contrast normalization
𝑔𝑖,𝑗,𝑘 = ℎ𝑖,𝑗,𝑘 − 𝑖𝑢𝑣𝐺𝑢𝑣ℎ𝑖,𝑗+𝑢,𝑖+𝑣
𝑦𝑖,𝑗,𝑘= 𝑔𝑖,𝑗,𝑘
max{𝑐, 𝑖𝑢𝑣 𝐺𝑢𝑣𝑔2𝑖,𝑗+𝑢,𝑖+𝑣}
𝐺 − Gaussian weighted window 5x5𝑐 − Small constant to prevent numerical errorsi/u/v – Channel number, and window size
• 5x5 overlapping windows. • Connects to all input channels
14
Local contrast normalization
• Relatively dominant activations are preferred over high activations on all features.• Enforcing a sort of local competition between adjacent feature, and between
features at the same spatial location in different feature maps• Improves optimization.
LCN
0
1
0
-1
0
0.5
0
-0.5
LCN
10
10
10
10
0
0
0
0
15
One Layer summery
W1
H
8 output maps 171x171.Local Contrast Normalization.
Input3 x 200 x 200 image
Local receptive fields:• 18x18 pixels RF windows.• Not convolutional• 8 Feature maps.• Each neuron connects to all
input channels
Second sublayerFirst Sublayer
Pooling:• 5x5 overlapping pooling
maps• L2 pooling• Pooling over one feature
Third sublayer
Local contrast normalization:• Pooling over all features
3 x 200 x 200 image xi
16
9 Layer structure
3 x 200 x 200 image xi
LCN LCN LCN
W11 W1
2W1
3
HH
L2 Pooling
L2 Pooling
L2 Pooling
Local ReceptiveFields
Local ReceptiveFields
Local ReceptiveFields
H
17
The Optimization Problem
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑤1𝑤2
𝑖=1
𝑚
𝑊2𝑊1𝑇𝑥(𝑖) − 𝑥(𝑖)
2
2+ 𝜆
𝑗=1
𝑘
𝜖 + 𝐻𝑗(𝑊1𝑇𝑥(𝑖))2
𝑊1 Encoding matrix
𝑊2 Decoding matrix
𝜆 Tradeoff between sparsity and reconstruction (0.1)
k Number of pooling units
𝐻𝑗 Vector of weights of the j-th pooling unit (constant)
𝜖 Numerical stability constant
m Number of examples
ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning, Le, Q. V et al18
The Optimization Problem
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑤1𝑤2
𝑖=1
𝑚
𝑊2𝑊1𝑇𝑥(𝑖) − 𝑥(𝑖)
2
2+ 𝜆
𝑗=1
𝑘
𝜖 + 𝐻𝑗(𝑊1𝑇𝑥(𝑖))2
Global reconstruction cost -Ensures the representations encode important information about the data = they can reconstruct the input data
Group Sparsity / Spatial pooling –• Outputs of second sublayer.• Lower sum of activations is preferred.• Encourages pooling to group similar
features together to achieve invariances.
19
Feature Grouping
Forces encodings to be organized in a topographical map by pooling together structure-correlated features belonging to the same hidden topic, More specifically, features that are near to each other in the topographic map are relatively strongly dependent in the sense of mutual information.
Kavukcuoglu, Koray, Rob Fergus, and Yann LeCun. "Learning invariant features through topographic filter maps."20
Training the Network
3 x 200 x 200 image xi8 LCN maps.5x5 kernels.Unit computes
8 LCN maps.5x5 kernels.Unit computes
8 LCN maps.5x5 kernels.Unit computes
W11 W1
2W1
3
HHH
W21 W2
2 W23
LCN maps from prior layer LCN maps from prior layer
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑤1𝑤2
𝑖=1
𝑚
𝑊2𝑊1𝑇𝑥(𝑖) − 𝑥(𝑖)
2
2+ 𝜆
𝑗=1
𝑘
𝜖 + 𝐻𝑗(𝑊1𝑇𝑥(𝑖))2
21
Implementation
Year Deep network Arcitecture Parameters
2012 Alexnet 60M
2014 VGGnet 138M
2014 GoogLeNet 5M
2012 Google autoencoder 1.15B
Dataset: 10 million 200X200 unlabeled images from YouTube
Training: 2000 machines with 16000 CPU cores for 1 week
Parameters: 1.15B learned weights
22
Model parallelism
• The network is partitioned• Each machine store the partition parameters• Partitions pass update messages• Less fault tolerant (requires some recovery if
any single machine fails).• Good for convolution layers, less for fully-
connected.
Large Scale Distributed Deep Networks, Dean et al.23
Data parallelism
• Multiple instances of the model.• Each computes parameters updates.• Communicates results with parameter server.
Large Scale Distributed Deep Networks, Dean et al.24
Asynchronous Gradient Descent
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, Abadi et al
Synchronized parallelismEach iteration: • Waiting for all devices to finish• Calculating parameter updates• Updating parameters server
Asynchronous parallelism• Each model run separately • Updates the parameters without synchronization• Less accurate if using ASGD.
25
Results -Detection
• Looking for a neuron that is sensitive to high level concepts, a face/cat/body part - detector.
• Method – Test set with known positive/negative ratio .(Example - Faces: 37,000 images, of those 13,026 are of faces.)
• For each neuron checking the minimum and maximum activation values.
• Splitting the activation range to 20 equally spaced threshold.
• Picking the best neuron and the best threshold that gives the highest accuracy.
26
Results -DetectionSummary of numerical comparisons against other baselines:
Histograms of faces (red) vs. no faces (blue):
27
Results - Invariance
• Method - choosing of 10 face images and perform distortions to them, e.g., scaling and translating.
• Out-of-plane rotation using 10 images of faces rotating in 3D.
pixels pixels
28
Results -Visualization
Most responsive stimuli in the test set. The optimal stimulus according to numerical constraint optimization
29
Results – ImageNet
• Unsupervised training on YouTube and ImageNet images.• Logistic classifier on top of the highest layer.• Training the logistic classifiers and then fine-tuned the network.• The entire training was carried out on 2,000 machines for one week
Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet.
30
Summary• This work shows that it is possible to train neurons to be selective for
high-level concepts using entirely unlabeled data.
• The network was able to learn invariances from unlabeled data.
• Object recognition on ImageNet: A significant leap of 70% relative improvement over the state-of-the-art.
Google Builds a Brain that Can Search for Cat Videos, Time, June 2012How Many Computers to Identify a Cat? 16,000, NYT June 2012
31
Unsupervised Learning of Visual Representations using Videos
Xiaolong Wang, Abhinav Gupta
Robotics Institute, Carnegie Mellon University
Published in 2015
http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Wang_Unsupervised_Learning_of_ICCV_2015_paper.pdf
32
Agenda
• Overview
• Patch Mining in Videos
• CNN implementation
• Results
• Discussion and Conclusion
33
Overview
• Do we really need millions of semantically-labeled images to learn a good representation?
• It seems humans can learn visual representations using little or no semantic supervision but our approaches still remain completely supervised.
34
Overview
• Previous work on unsupervised learning:• Millions of static images or frames extracted from
videos• The most common architecture used is an auto-
encoder which learns representations based on its ability to reconstruct the input images.
35
Overview
• Previous work on unsupervised learning:• Have been able to automatically learn V1-like filters
given unlabeled data, but they are still far away from supervised approaches on tasks such as object detection
36
Overview
37
Overview
• Key insight:• Visual tracking is one of the first capabilities that develops
in infants and often before semantic representations are learned.
• Using a video and tracking we are able to produce patches of the same object. Should have similar visual representation in deep feature space.
38
Overview
http://www.aoa.org/patients-and-public/good-vision-throughout-life/childrens-vision/infant-vision-birth-to-24-months-of-age?sso=y#1
39
Overview
• Proposal:• Siamese-triplet network with ranking loss function to train
a CNN representation.• This ranking loss function enforces that in the final deep
feature space the first frame patch should be much closer to the tracked patch than any other randomly sampled patch.
40
Overview
• Proposal:
41
Patch Mining in Videos
• Source for videos: YouTube• Estimated number of new videos uploaded: 300K per
minute (2016)• Tracking:
• Obtain SURF interest points (Speed up robust features, 2006)
• Improved Dense Trajectories (IDT) to obtain motion (2013)
• Kernelized correlation filter (KCF, 2014)
42
Patch Mining in Videos
43
• Patches accepted:o > 25 % of moving SURF points ando < 75 % of moving SURF points
Patch Mining in Videos
44
Siamese Triplet Network
• 3 networks which share the same parameters• Image with size 227 × 227 as input• Based on the AlexNet architecture• Two fully connected layers stacked on the pool5 outputs,
whose neuron numbers are 4096 and 1024 respectively• Thus final output of each single network is 1024
dimensional feature space
45
Siamese Triplet Network
46
Ranking Loss Function
• Cosine distance in the feature space
𝐷 𝑋1, 𝑋2 = 1 −𝑓 𝑋1 ∙ 𝑓 𝑋2𝑓 𝑋1 𝑓 𝑋2
• Goal: 𝐷 𝑋𝑖 , 𝑋𝑖− > 𝐷 𝑋𝑖 , 𝑋𝑖
+
• 𝑋𝑖 - first frame patch
• 𝑋𝑖+ - last frame patch
• 𝑋𝑖− - patch from different video
47
Ranking Loss Function
• Per triplet of images:𝐿 𝑋𝑖 , 𝑋𝑖
+, 𝑋𝑖− = 𝑚𝑎𝑥 0, 𝐷 𝑋𝑖 , 𝑋𝑖
+ − 𝐷 𝑋𝑖 , 𝑋𝑖− +𝑀
• Total objective:
min𝑊
𝜆
2𝑊 22 +
𝑖=1
𝑁
𝐿 𝑋𝑖 , 𝑋𝑖+, 𝑋𝑖−
M = 0.5
𝜆 = 0.0005
48
Patch Mining for Triplet Sampling
• Given 𝑋𝑖 , 𝑋𝑖+, how to select 𝑋𝑖
−
• Random Selection:• For each images couple in batch B randomly sample K
negative matches in the same batch• Shuffle all the images randomly after each epoch of
training
49
Patch Mining for Triplet Sampling
• Given 𝑋𝑖 , 𝑋𝑖+, how to select 𝑋𝑖
−
• Hard Negative Mining• Applied after 10 epochs of training• Choose k samples from batch with highest loss• K = 4, B = 100
50
Adapting for Supervised Tasks
• Method #1:oBased on RCNN paper.oUse pre-trained unsupervised “AlexNet” based network oParameters of layers till pool5 are used as initialization.oTwo fully connected layers initialized randomly.oLearning rate is 0.01 instead of 0.001 (RCNN)
51
Adapting for Supervised Tasks
• Method #2:oIterative approach
1) Fine-tune using the PASCAL VOC data2) Re-adapt to ranking triplet task3) Again, transfer convolutional parameters for re-
adaptingoNetwork converges after two iterations
52
Implementation Details
• 100K videos into 8M patches
• 3 different networks using 1.5M, 5M and 8M patches
• Batch size: 100
• Initial learning rate: 0.001
• Random negative sampling for 150K iterations, afterwards hard negative mining
53
Implementation Details
• 1.5M Patches:• Reduce learning rate by 10 every 80K iterations
• Total: 240K iterations
• 5M & 8M Patches:• Reduce learning rate by 10 every 120K iterations
• Total: 350K iterations
54
Results: Learned features
55
Results: Network response
56
Results, no fine-tuning: Qualitative comparison
57
Results, no fine-tuning: Quantitative comparison • Measurement: retrieval rate by counting number of correct
retrievals in top-K retrievals (K=20)
• Pool 5 features with cosine distance
58
Method Score
Article’s 40%
Elda on HOG 24%
Random AlexNet 19%
ImageNet CNN 62%
Results, with fine-tuning: Object detection• Follows the pipeline in RCNN• PASCAL VOC 2012 dataset• Trainval set & Test set ~ 10K images• SVM classifier• Learning rate: 0.01, x0.1 each 80K• Total iteration for fine-tune: 200K• 21 Clasees
59
Results, with fine-tuning: Object detection
60
Results, with fine-tuning: Object detection
•Without using a single image from ImageNet, just 100K unlabeled videos and VOC 2012 dataset, an ensemble of AlexNet networks achieves 52% mAP.
• ImageNet-supervised counterpart: an ensemble which achieves 54.4% mAP
61
Results, with fine-tuning: Surface Normal Estimation
62
Results, with fine-tuning: Surface Normal Estimation
63
• 227 × 227 image as input• Output of our network is 20 × 20 pixels• Each of which is represented by a distribution over
20 code-words, which learnt using K-means• Dimension of output is 20 × 20 × 20 = 8000• Two fully connected layers with 4096 and 8000
neurons on the pool5
Results, with fine-tuning: Surface Normal Estimation
64
Discussion and Conclusion
65
• Much more data available
• Might be as close as 2.5% in mAP to supervised networks
• Greater boost using ensemble of networks
• Can be generalized to different tasks
• Mimic human brain?
Questions
66