L EARNING F INE -GRAINED I MAGE S IMILARITY WITH D EEPR ANKING
Jiang Wang(Northwestern) Yang Song (Google) Thomas Leung (Google) Chuck Rosenberg (Google) Jingbin Wang(Google) James Philbin (Google) Bo Chen (Caltech) Ying Wu (Northwestern)
PROBLEM
Fine-grained imagesimilarity, for imageswith the same category.It is for image-searchapplication, defined bytriplets.
Query
PositiveN
egative
• image similarities are defined subtle di�er-ence.
• it is more difficult to obtain triplet trainingdata.
• we would like to train a model directly fromimages instead of rely on the hand-craftedfeatures.
A RCHITECTURE
Q P N
Triplet Sampling Layer
....Images
....
Ranking Layer
p i p i-p i
+
f(pi) f(pi ) f(pi)+ -
• a novel deep learning that can learns fine-grained image similarity model directlyfrom images.
• a multi-scale network structure.• a computationally efficient online triplet
sampling algorithm.• high quality triplet evaluation dataset.
R ELATED W ORK
• category-level image similarity: the similar-ities are purely defined by labels.
• classification deep learning models.• pairwise ranking model.
FORMULATION
The similarity of two images P and Q can bedefined according to their squared Euclidean dis-tance in the image embedding space:
D ( f (P ) , f (Q )) = �f (P ) − f (Q )�22 (1)
Triplet-based Objective: r i,j = r (pi , pj ) is pair-wise relevance score.
D ( f (pi ) , f (p+i )) < D ( f (pi ) , f (p−i )) ,
�pi , p+i , p
−i such that r (pi , p+
i ) > r (pi , p−i )
(2)
ti = ( pi , p+i , p
−i ) a triplet. The hinge loss is:
l(pi , p+i , p
−i ) =max { 0, g + D ( f (pi ) , f (p+i )) −
D ( f (pi ) , f (p−i )) }(3)
M ULTI -SCALE A RCHITECTURE
Image
225 x 225
SubSample
SubSample
Convolution
Convolution
4:1
8:1
Max pooling
Max pooling
57 X 5729 X 29
8 x 8 x 96
l2 Norm
alization
Linear Embedding
l2 N
ormalization
8 x 8: 4x4
8 x 8: 4x4
3 x 3: 2x2
7 x 7: 4x4
15 x 15 x 96
4 x 4 x 964 x 4 x 96
30744096
4096
4096
ConvNet
l2 Norm
alization
4096
T RAINING DATA
• ImageNet for pre-training. Category-levelinformation.
• Relevance training data. Fine-grained vi-sual information.
– Golden Feature, good for visual simi-larity but not so good for semantic sim-ilarity, and it is expensive to compute,
O PTIMIZATION• Asynchronized stochastic gradient algo-
rithm.• Momentum algorithm.• Dropout to avoid overfitting
Challenges:
• Cannot enumerate all the triplets, need tosample important triplets.
• Cannot load all the images into memory,need to generate triplets online.
T RIPLET S AMPLINGSampling criteria: we sample more highly rel-
evant images.Total relevance score r i :
r i =�
j :c j = c i ,j �= i
r i,j (4)
• For query image: according to total rele-vance score.
• For positive image: sample images with thesame label as the query image, sampling
probability is P (p+i ) =min { T p ,r i,i + }
Z i.
• For negative image, we have two types ofsamples:
1. in-class negative: we draw in-classnegative samples p−i with the samedistribution as the positive image. Wealso require that the margin betweenthe relevance score r i,i + and r i,i −
should be larger than T r
2. out-of-class negative: drawn uni-formly from all the images in di�erentcategories.
Online triplet sampling: reservoir sampling:Buffers for queries
Image sample
Find buffer of the query
Triplets
Query
Positive
Negative
E XPERIMENTS
Comparison with hand-crafted features:
Method Precision Score-30Wavelet 62.2% 2735Color 62.3% 2935
SIFT-like 65.5% 2863Fisher 67.2% 3064HOG 68.4% 3099
SPMKtexton1024max 66.5% 3556L1HashKPCA 76.2% 6356
OASIS 79.2% 6813Golden Features 80.3% 7165DeepRanking 85 .7% 7004
Comparison of di�erent architectures:
Method Precision Score-30ConvNet 82.8% 5772
Single-scale Ranking 84.6% 6245OASIS on Single-scale Ranking 82.5% 6263Single-Scale & Visual Feature 84.1% 6765
DeepRanking 85 .7% 7004
Comparison of di�erent sampling methods:
0 0.2 0.4 0.6 0.8 16600
6700
6800
6900
7000
7100
7200
Fraction of out−of−class negative samples
Scor
e at
30
weighted samplinguniform sampling
0 0.2 0.4 0.6 0.8 10.83
0.84
0.85
0.86
0.87
Fraction of out−of−class negative samples
Ove
rall
prec
isio
n
weighted samplinguniform sampling
R ANKING E XAMPLES
ConvN
etO
ASISD
eepR
ankingC
onvNet
OASIS
Deep
Ranking
stluseR gniknaRyreuQ
A CKNOWLEDGMENT
The work was done when the first author isworking as an intern at Google.
D ATA
High quality image triplet evaluation dataset:Available athttps://sites.google.com/site/imagesimilaritydata/
Top Related