Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik
description
Transcript of Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik
Learning Globally-Consistent Local Distance Functions for Shape-Based
Image Retrieval and Classification
Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik
Goal
Nearest neighbor classification
D ( , )
Learning a Distance Metric from Relative Comparisons
[Schulz & Joachims, NIPS ’03]
D ( , ) D ( , )
D ( , ) = ( - )T ( - )
Approach
image j
image i
Approach
image j
image i
dji,m
Approach
image j
image i
Dji =Σ wj,mdji,m
image k
Approach
image j
image i
Dki
image k
Dji <
Core
image j
wj,m ?
image j
image i
Dki
image k
Dji<
Derivations
• Notation• Large-margin formulation• Dual problem• Solution
NotationsDji =Σ wj,mdji,m Dji =wj ·dji
Dki > Dji wk ·dki > wj ·dji
for triplet i, j, k
wk ·dki - wj ·dji ≥ 1
W w1w2…wk…wj…
Xijk 0 0 … dki…-dji…
wk ·dki - wj ·dji ≥ 1 W·Xijk ≥ 1
kji
ijkXW,,
]1[
Large-margin formulation
kji
ijkXW,,
]1[
kji
ijkXWCW,,
2 ]1[||||21
SVM
SVM
SVM
SVM
Soft-margin SVM
Derivation
0,0,0
)1(||||21),,,,( 2
ijk
ijkijkijkijk
ijkijkijk
ijk WXWCWWL
Cijk 0ijkijkijk
CL
ijkijkijk XW
WL
ijk
ijkijk XW
Dual
0
1||||
1)(
2
abcabcabcabc
abcijkijkijk
abcijk
ijkijkabc
XXXX
XXF
22 ||||
)(1
||||
)(1
abc
abcabcijk
ijkijk
abc
abcabcabcijk
ijkijk
abc X
XX
X
XXX
Details – Features and descriptors
• Find ~400 features per image• Compute geometric blur descriptor
Descriptors
• Geometric blur
Descriptors
• Two sizes of geometric blur (42 pixels and 70 pixels)– Each is 204 dimensions (4 orientations and 51 samples each)
• HSV histograms of 42-pixel patches
Choosing triplets
• Caltech101 – at 15 images per class– 31.8 million triplets– Many are easy to satisfy
• For each image j, for each feature– Find the N images I with closest features– For each negative example i in I, form triplets (j, k, i)
• Eliminates ~ half of triplets
Choosing C
Choosing C• Train with multiple values of C, testing on a held-
out part of the training set• Choose whichever gives the best results
• For each C, run online version of the training algorithm– Make one sweep through training triplets– For each misclassified triplet (i,j,k), update weights for
the three images– Choose C which gets the most right answers
Results
• At 15 training examples per class: 63.2% (~3% improvement)• At 20 training examples per class: 66.6% (~5% improvement)
Results
• Confusion matrix
Hardest categories: crocodile, cougar_body, cannon, bass
Questions
• Is there any disadvantage to a non-metric distance function?
• Could the images be embedded in a metric space?• Why not learn everything?
– Include a feature for each image pixel– Include multiple types of descriptors
• Could this be used for to do unsupervised learning for sets of tagged images (e.g., for image segmentation)?
• Can you learn a single distance per class?