FINE IMAGE SIMILARITY WITH DEEP R ANKINGwangjiangb.github.io/pdfs/deep_ranking_poster.pdf ·...

1
L EARNING F INE - GRAINED I MAGE S IMILARITY WITH D EEP R ANKING Jiang Wang(Northwestern) Yang Song (Google) Thomas Leung (Google) Chuck Rosenberg (Google) Jingbin Wang (Google) James Philbin (Google) Bo Chen (Caltech) Ying Wu (Northwestern) P ROBLEM Fine-grained image similarity, for images with the same category. It is for image-search application, defined by triplets. Query Positive Negative image similarities are defined subtle differ- ence. it is more difficult to obtain triplet training data. we would like to train a model directly from images instead of rely on the hand-crafted features. A RCHITECTURE Q P N Triplet Sampling Layer .... Images .... Ranking Layer p i p i - p i + f(p i ) f(p i ) f(p i ) + - a novel deep learning that can learns fine- grained image similarity model directly from images. a multi-scale network structure. a computationally efficient online triplet sampling algorithm. high quality triplet evaluation dataset. R ELATED W ORK category-level image similarity: the similar- ities are purely defined by labels. classification deep learning models. pairwise ranking model. F ORMULATION The similarity of two images P and Q can be defined according to their squared Euclidean dis- tance in the image embedding space: D ( f ( P ) ,f ( Q )) = f ( P ) - f ( Q ) 2 2 (1) Triplet-based Objective: r i,j = r ( p i ,p j ) is pair- wise relevance score. D ( f ( p i ) ,f ( p + i )) <D ( f ( p i ) ,f ( p - i )) , p i ,p + i ,p - i such that r ( p i ,p + i ) >r ( p i ,p - i ) (2) t i =( p i ,p + i ,p - i ) a triplet. The hinge loss is: l ( p i ,p + i ,p - i ) = max { 0,g + D ( f ( p i ) ,f ( p + i )) - D ( f ( p i ) ,f ( p - i )) } (3) M ULTI - SCALE A RCHITECTURE Image 225 x 225 SubSample SubSample Convolution Convolution 4:1 8:1 Max pooling Max pooling 57 X 57 29 X 29 8 x 8 x 96 l2 Normalization Linear Embedding l2 Normalization 8 x 8: 4x4 8 x 8: 4x4 3 x 3: 2x2 7 x 7: 4x4 15 x 15 x 96 4 x 4 x 96 4 x 4 x 96 3074 4096 4096 4096 ConvNet l2 Normalization 4096 T RAINING DATA ImageNet for pre-training. Category-level information. Relevance training data. Fine-grained vi- sual information. Golden Feature, good for visual simi- larity but not so good for semantic sim- ilarity, and it is expensive to compute, O PTIMIZATION Asynchronized stochastic gradient algo- rithm. Momentum algorithm. Dropout to avoid overfitting Challenges: Cannot enumerate all the triplets, need to sample important triplets. Cannot load all the images into memory, need to generate triplets online. T RIPLET S AMPLING Sampling criteria: we sample more highly rel- evant images. Total relevance score r i : r i = j : c j = c i ,j = i r i,j (4) For query image: according to total rele- vance score. For positive image: sample images with the same label as the query image, sampling probability is P ( p + i )= min { T p ,r i,i + } Z i . For negative image, we have two types of samples: 1. in-class negative: we draw in-class negative samples p - i with the same distribution as the positive image. We also require that the margin between the relevance score r i,i + and r i,i - should be larger than T r 2. out-of-class negative: drawn uni- formly from all the images in different categories. Online triplet sampling: reservoir sampling: Buffers for queries Image sample Find buffer of the query Triplets Query Positive Negative E XPERIMENTS Comparison with hand-crafted features: Method Precision Score-30 Wavelet 62 . 2% 2735 Color 62 . 3% 2935 SIFT-like 65 . 5% 2863 Fisher 67 . 2% 3064 HOG 68 . 4% 3099 SPMKtexton1024max 66 . 5% 3556 L1HashKPCA 76 . 2% 6356 OASIS 79 . 2% 6813 Golden Features 80 . 3% 7165 DeepRanking 85 . 7 % 7004 Comparison of different architectures: Method Precision Score-30 ConvNet 82 . 8% 5772 Single-scale Ranking 84 . 6% 6245 OASIS on Single-scale Ranking 82 . 5% 6263 Single-Scale & Visual Feature 84 . 1% 6765 DeepRanking 85 . 7 % 7004 Comparison of different sampling methods: 0 0.2 0.4 0.6 0.8 1 6600 6700 6800 6900 7000 7100 7200 Fraction of out−of−class negative samples Score at 30 weighted sampling uniform sampling 0 0.2 0.4 0.6 0.8 1 0.83 0.84 0.85 0.86 0.87 Fraction of out−of−class negative samples Overall precision weighted sampling uniform sampling R ANKING E XAMPLES ConvNet OASIS Deep Ranking ConvNet OASIS Deep Ranking s t l u s e R g n i k n a R y r e u Q A CKNOWLEDGMENT The work was done when the first author is working as an intern at Google. D ATA High quality image triplet evaluation dataset: Available at h ttps://sites.google.com/site/imagesimilaritydata/

Transcript of FINE IMAGE SIMILARITY WITH DEEP R ANKINGwangjiangb.github.io/pdfs/deep_ranking_poster.pdf ·...

L EARNING F INE -GRAINED I MAGE S IMILARITY WITH D EEPR ANKING

Jiang Wang(Northwestern) Yang Song (Google) Thomas Leung (Google) Chuck Rosenberg (Google) Jingbin Wang(Google) James Philbin (Google) Bo Chen (Caltech) Ying Wu (Northwestern)

PROBLEM

Fine-grained imagesimilarity, for imageswith the same category.It is for image-searchapplication, defined bytriplets.

Query

PositiveN

egative

• image similarities are defined subtle di�er-ence.

• it is more difficult to obtain triplet trainingdata.

• we would like to train a model directly fromimages instead of rely on the hand-craftedfeatures.

A RCHITECTURE

Q P N

Triplet Sampling Layer

....Images

....

Ranking Layer

p i p i-p i

+

f(pi) f(pi ) f(pi)+ -

• a novel deep learning that can learns fine-grained image similarity model directlyfrom images.

• a multi-scale network structure.• a computationally efficient online triplet

sampling algorithm.• high quality triplet evaluation dataset.

R ELATED W ORK

• category-level image similarity: the similar-ities are purely defined by labels.

• classification deep learning models.• pairwise ranking model.

FORMULATION

The similarity of two images P and Q can bedefined according to their squared Euclidean dis-tance in the image embedding space:

D ( f (P ) , f (Q )) = �f (P ) − f (Q )�22 (1)

Triplet-based Objective: r i,j = r (pi , pj ) is pair-wise relevance score.

D ( f (pi ) , f (p+i )) < D ( f (pi ) , f (p−i )) ,

�pi , p+i , p

−i such that r (pi , p+

i ) > r (pi , p−i )

(2)

ti = ( pi , p+i , p

−i ) a triplet. The hinge loss is:

l(pi , p+i , p

−i ) =max { 0, g + D ( f (pi ) , f (p+i )) −

D ( f (pi ) , f (p−i )) }(3)

M ULTI -SCALE A RCHITECTURE

Image

225 x 225

SubSample

SubSample

Convolution

Convolution

4:1

8:1

Max pooling

Max pooling

57 X 5729 X 29

8 x 8 x 96

l2 Norm

alization

Linear Embedding

l2 N

ormalization

8 x 8: 4x4

8 x 8: 4x4

3 x 3: 2x2

7 x 7: 4x4

15 x 15 x 96

4 x 4 x 964 x 4 x 96

30744096

4096

4096

ConvNet

l2 Norm

alization

4096

T RAINING DATA

• ImageNet for pre-training. Category-levelinformation.

• Relevance training data. Fine-grained vi-sual information.

– Golden Feature, good for visual simi-larity but not so good for semantic sim-ilarity, and it is expensive to compute,

O PTIMIZATION• Asynchronized stochastic gradient algo-

rithm.• Momentum algorithm.• Dropout to avoid overfitting

Challenges:

• Cannot enumerate all the triplets, need tosample important triplets.

• Cannot load all the images into memory,need to generate triplets online.

T RIPLET S AMPLINGSampling criteria: we sample more highly rel-

evant images.Total relevance score r i :

r i =�

j :c j = c i ,j �= i

r i,j (4)

• For query image: according to total rele-vance score.

• For positive image: sample images with thesame label as the query image, sampling

probability is P (p+i ) =min { T p ,r i,i + }

Z i.

• For negative image, we have two types ofsamples:

1. in-class negative: we draw in-classnegative samples p−i with the samedistribution as the positive image. Wealso require that the margin betweenthe relevance score r i,i + and r i,i −

should be larger than T r

2. out-of-class negative: drawn uni-formly from all the images in di�erentcategories.

Online triplet sampling: reservoir sampling:Buffers for queries

Image sample

Find buffer of the query

Triplets

Query

Positive

Negative

E XPERIMENTS

Comparison with hand-crafted features:

Method Precision Score-30Wavelet 62.2% 2735Color 62.3% 2935

SIFT-like 65.5% 2863Fisher 67.2% 3064HOG 68.4% 3099

SPMKtexton1024max 66.5% 3556L1HashKPCA 76.2% 6356

OASIS 79.2% 6813Golden Features 80.3% 7165DeepRanking 85 .7% 7004

Comparison of di�erent architectures:

Method Precision Score-30ConvNet 82.8% 5772

Single-scale Ranking 84.6% 6245OASIS on Single-scale Ranking 82.5% 6263Single-Scale & Visual Feature 84.1% 6765

DeepRanking 85 .7% 7004

Comparison of di�erent sampling methods:

0 0.2 0.4 0.6 0.8 16600

6700

6800

6900

7000

7100

7200

Fraction of out−of−class negative samples

Scor

e at

30

weighted samplinguniform sampling

0 0.2 0.4 0.6 0.8 10.83

0.84

0.85

0.86

0.87

Fraction of out−of−class negative samples

Ove

rall

prec

isio

n

weighted samplinguniform sampling

R ANKING E XAMPLES

ConvN

etO

ASISD

eepR

ankingC

onvNet

OASIS

Deep

Ranking

stluseR gniknaRyreuQ

A CKNOWLEDGMENT

The work was done when the first author isworking as an intern at Google.

D ATA

High quality image triplet evaluation dataset:Available athttps://sites.google.com/site/imagesimilaritydata/