Large Image Databases and Small Codes for Object Recognition Rob Fergus (NYU) Antonio Torralba (MIT)...
-
date post
20-Jan-2016 -
Category
Documents
-
view
217 -
download
0
Transcript of Large Image Databases and Small Codes for Object Recognition Rob Fergus (NYU) Antonio Torralba (MIT)...
Large Image Databases and Small Codes
for Object Recognition
Rob Fergus (NYU)Antonio Torralba (MIT)Yair Weiss (Hebrew U.)
William T. Freeman (MIT)
Banks
y,
2006
Object Recognition
Pixels Description of scene contents
Amazing resource Maybe we can use it for recognition?
Large image datasets
Internet contains billions of images
Web image dataset• 79.3 million images• Collected using image
search engines• List of nouns taken from
Wordnet
• 32x32resolution
a-bomba-horizona._conan_doylea._e._burnsidea._e._housmana._e._kennellya.e.a_batterya_cappella_singinga_horizon
• Example first 10:
See “80 Million Tiny images” TR
Noisy Output from Image Search Engines
Nearest-Neighbor methods for recognition
# images
105
106
108
Issues
• Need lots of data• How to find neighbors quickly?• What distance metric to use?
• How to transfer labels from neighbors to query image?
• What happens if labels are unreliable?
Overview
1.Fast retrieval using compact codes
2.Recognition using neighbors with unreliable labels
Overview
1.Fast retrieval using compact codes
2.Recognition using neighbors with unreliable labels
Binary codes for images
• Want images with similar contentto have similar binary codes
• Use Hamming distance between codes– Number of bit flips– E.g.:
• Semantic Hashing [Salakhutdinov & Hinton, 2007]– Text documents
Ham_Dist(10001010,10001110)=1
Ham_Dist(10001010,11101110)=3
Binary codes for images
• Permits fast lookup via hashing– Each code is a memory address– Find neighbors by exploring Hamming
ball around query address– Lookup time depends on
radius of ball, NOT on # data points
Address Space
Query ImageSemantic HashFunction
Semantically similar images
Query address
Figu
re a
dapt
ed fr
om S
alak
hutd
inov
& H
into
n ‘0
7
Compact Binary Codes
• Google has few billion images (109)• Big PC has ~10 Gbytes (1011 bits)• Codes must fit in memory (disk too slow) Budget of 102 bits/image
• 1 Megapixel image is 107 bits• 32x32 color image is 104 bits Semantic hash function must also reduce
dimensionality
Code requirements
• Preserves neighborhood structureof input space
• Very compact (<102 bits/image)• Fast to compute
Three approaches:1.Locality Sensitive Hashing (LSH)2.Boosting3.Restricted Boltzmann Machines (RBM’s)
Image representation: Gist vectors• Pixels not a convenient representation• Use GIST descriptor instead• 512 dimensions/image• L2 distance btw. Gist vectors not bad
substitute for human perceptual distance
Oliva & Torralba, IJCV 2001
1. Locality Sensitive Hashing• Gionis, A. & Indyk, P. & Motwani, R. (1999)• Take random projections of data• Quantize each projection with few bits
•For our N bit code:– Compute first N PCA components of data– Each random projection must be linear combination of the N PCA components
2. Boosting
• Modified form of BoostSSC [Shaknarovich, Viola & Darrell, 2003]
• GentleBoost with exponential loss• Positive examples are pairs of neighbors• Negative examples are pairs of unrelated images• Each binary regression stump examines a single
coordinate in input pair, comparing to some learnt threshold to see that they agree
3. Restricted Boltzmann Machine (RBM)
Hidden units: h
Visible units: v
Symmetric weights w
Parameters: Weights w Biases b
• Network of binary stochastic units• Hinton & Salakhutdinov, Science 2006
RBM architecture
Hidden units: h
Visible units: v
Symmetric weights w
Learn weights and biases using Contrastive Divergence
Parameters: Weights w Biases b
• Network of binary stochastic units• Hinton & Salakhutdinov, Science 2006
Convenient conditional distributions:
Multi-Layer RBM: non-linear dimensionality reduction
512
512w1
Input Gist vector (512 dimensions)
Layer 1
512
256w2
Layer 2
256
Nw3
Layer 3
Output binary code (N dimensions)
Training RBM models• Two phases
1. Pre-training – Unsupervised– Use Contrastive Divergence to learn weights and biases– Gets parameters to right ballpark
2. Fine-tuning– Can make use of label information– No longer stochastic– Back-propagate error to update parameters– Moves parameters to local minimum
Greedy pre-training (Unsupervised)
512
512w1
Input Gist vector (512 dimensions)
Layer 1
Greedy pre-training (Unsupervised)
Activations of hidden units from layer 1 (512 dimensions)
512
256w2
Layer 2
Greedy pre-training (Unsupervised)
Activations of hidden units from layer 2 (256 dimensions)
256
Nw3
Layer 3
Backpropagation using Neighborhood Components Analysis objective
512
512w1
Input Gist vector (512 dimensions)
Layer 1
512
256w2
Layer 2
256
Nw3
Layer 3
Output binary code (N dimensions)
Neighborhood Components Analysis• Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004
Labels defined ininput space (high dim.)
Points in output space (low dimensional)
Neighborhood Components Analysis• Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004
Output of RBMW are RBM weights
Neighborhood Components Analysis• Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004
Pulls nearby pointsOF SAME CLASScloser
Push nearby pointsOF DIFFERENT CLASSaway
Neighborhood Components Analysis• Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004
Pulls nearby pointsOF SAME CLASScloser
Preserves neighborhood structure of original, high dimensional, space
Push nearby pointsOF DIFFERENT CLASSaway
Two test datasets1. LabelMe– 22,000 images (20,000 train | 2,000 test)– Ground truth segmentations for all– Can define ground truth distance btw. images using these
segmentations – Per pixel labels [SUPERVISED]
2. Web data– 79.3 million images– Collected from Internet– No labels, so use L2 distance btw. GIST vectors as
ground truth distance [UNSUPERVISED]– Noisy image labels only
Retrieval Experiments
Examples of LabelMe retrieval• 12 closest neighbors under different distance metrics
LabelMe Retrieval
Size of retrieval set % o
f 50
true
nei
ghbo
rs in
retr
ieva
l set
0 2,000 10,000 20,0000
LabelMe Retrieval
Size of retrieval set % o
f 50
true
nei
ghbo
rs in
retr
ieva
l set
0 2,000 10,000 20,0000
Number of bits% o
f 50
true
nei
ghbo
rs in
firs
t 500
retr
ieve
d
Web images retrieval%
of 5
0 tr
ue n
eigh
bors
in re
trie
val s
et
Size of retrieval set
Web images retrieval
Size of retrieval set
% o
f 50
true
nei
ghbo
rs in
retr
ieva
l set
% o
f 50
true
nei
ghbo
rs in
retr
ieva
l set
Size of retrieval set
Examples of Web retrieval
• 12 neighbors using different distance metrics
Retrieval Timings
LabelMe Recognition examples
LabelMe Recognition
Overview
1.Fast retrieval using compact codes
2.Recognition using neighbors with unreliable labels
Label Assignment
• Retrieval gives set of nearby images• How to compute label?
• Issues:– Labeling noise– Keywords can be very specific
• e.g. yellowfin tuna
Query Grover Cleveland Linnet Birdcage Chiefs Casing
Neighbors
Wordnet – a Lexical Dictionary
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun aardvark
Sense 1aardvark, ant bear, anteater, Orycteropus afer => placental, placental mammal, eutherian, eutherian mammal => mammal => vertebrate, craniate => chordate => animal, animate being, beast, brute, creature
=> organism, being => living thing, animate thing => object, physical object => entity
http://wordnet.princeton.edu/
Wordnet Hierarchy
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun aardvark
Sense 1aardvark, ant bear, anteater, Orycteropus afer => placental, placental mammal, eutherian, eutherian mammal => mammal => vertebrate, craniate => chordate => animal, animate being, beast, brute, creature
=> organism, being => living thing, animate thing => object, physical object => entity
• Convert graph structure into tree by taking most common meaning
Wordnet Voting Scheme
Ground truth
One image – one vote
Classification at Multiple Semantic Levels
Votes:
Animal 6Person 33Plant 5Device 3Administrative4Others 22
Votes:
Living 44Artifact 9Land 3Region 7Others 10
Wordnet Voting Scheme
Person Recognition
• 23% of all imagesin dataset containpeople
• Wide range ofposes: not justfrontal faces
Person Recognition – Test Set
• 1016 images fromAltavista using“person” query
• High res and 32x32available
• Disjoint from 79million tiny images
Person Recognition• Task: person in image or not? • 64-bit RBM code trained on web data (Weak image
labels only)
Viola-JonesViola-Jones
Object Classification• Hand-designed metric
Distribution of objects in the world
1
10
1,000
100
10,000
10 100 1,000
L a b e lM e s t a t i s t i c s
object rank
10% of the classes account for 93% of the labels!
1500 classes with lessthan 100 samples
Number oflabeledsamples
LabelMe dataset
Conclusions
• Possible to build compact codes for retrieval– # bits seems to depend on dataset– Much room for improvement– Use JPEG coefficients as input to RBM
• Can do interesting things with lots of data– What would happen with Google’s ~ 2 billion
images?– Video data