Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Learning Globally-Consistent Local Distance Functions for Shape-Based

Image Retrieval and Classification

Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Nearest neighbor classification

D ( , )

Learning a Distance Metric from Relative Comparisons

[Schulz & Joachims, NIPS ’03]

D ( , ) D ( , )

D ( , ) = ( - )T ( - )

Approach

image j

image i

Approach

image j

image i

dji,m

Approach

image j

image i

Dji =Σ wj,mdji,m

image k

Approach

image j

image i

Dki

image k

Dji <

Core

image j

wj,m ?

image j

image i

Dki

image k

Dji<

Derivations

• Notation• Large-margin formulation• Dual problem• Solution

NotationsDji =Σ wj,mdji,m Dji =wj ·dji

Dki > Dji wk ·dki > wj ·dji

for triplet i, j, k

wk ·dki - wj ·dji ≥ 1

W w1w2…wk…wj…

Xijk 0 0 … dki…-dji…

wk ·dki - wj ·dji ≥ 1 W·Xijk ≥ 1

kji

ijkXW,,

]1[

Large-margin formulation

kji

ijkXW,,

]1[

kji

ijkXWCW,,

2 ]1[||||21

Soft-margin SVM

Derivation

0,0,0

)1(||||21),,,,( 2

ijk

ijkijkijkijk

ijkijkijk

ijk WXWCWWL

Cijk 0ijkijkijk

CL

ijkijkijk XW

WL

ijk

ijkijk XW

Dual

0

1||||

1)(

2

abcabcabcabc

abcijkijkijk

abcijk

ijkijkabc

XXXX

XXF

22 ||||

)(1

||||

)(1

abc

abcabcijk

ijkijk

abc

abcabcabcijk

ijkijk

abc X

XX

X

XXX

Details – Features and descriptors

• Find ~400 features per image• Compute geometric blur descriptor

Descriptors

• Geometric blur

Descriptors

• Two sizes of geometric blur (42 pixels and 70 pixels)– Each is 204 dimensions (4 orientations and 51 samples each)

• HSV histograms of 42-pixel patches

Choosing triplets

• Caltech101 – at 15 images per class– 31.8 million triplets– Many are easy to satisfy

• For each image j, for each feature– Find the N images I with closest features– For each negative example i in I, form triplets (j, k, i)

• Eliminates ~ half of triplets

Choosing C

Choosing C• Train with multiple values of C, testing on a held-

out part of the training set• Choose whichever gives the best results

• For each C, run online version of the training algorithm– Make one sweep through training triplets– For each misclassified triplet (i,j,k), update weights for

the three images– Choose C which gets the most right answers

Results

• At 15 training examples per class: 63.2% (~3% improvement)• At 20 training examples per class: 66.6% (~5% improvement)

Results

• Confusion matrix

Hardest categories: crocodile, cougar_body, cannon, bass

Questions

• Is there any disadvantage to a non-metric distance function?

• Could the images be embedded in a metric space?• Why not learn everything?

– Include a feature for each image pixel– Include multiple types of descriptors

• Could this be used for to do unsupervised learning for sets of tagged images (e.g., for image segmentation)?

• Can you learn a single distance per class?

Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Documents

Transcript of Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik