Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine...

29
Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors: James Philbin and Andrew Zisserman

Transcript of Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine...

Page 1: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Near Duplicate Image Detection:min-Hash and tf-idf weighting

Ondřej ChumCenter for Machine Perception

Czech Technical University in Prague

co-authors: James Philbin and Andrew Zisserman

Page 2: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Outline

• Near duplicate detection and large databases (find all groups of near duplicate images in a database)

• min-Hash review• Novel similarity measures• Results on TrecVid 2006• Results on the University of Kentucky database (Nister & Stewenius)• Beyond near duplicates

Page 3: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Scalable Near Duplicate Image Detection

• Images perceptually (almost) identical but not identical (noise, compression level, small motion, small occlusion)

• Similar images of the same object / scene• Large databases• Fast – linear in the number of duplicates• Store small constant amount of data per image

Page 4: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Image Representation

0

4

0

2

...

Feature detector SIFT descriptor [Lowe’04]

Visual vocabulary

Vector quantization

…0

1

0

1

...

Bag of words

Set of words

Page 5: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

min-Hash

A1 ∩ A2

A1 U A2

A1 A2

Image similarity measured as a set overlap (using min-Hash algorithm)Spatially related images share visual words

Min-Hash is a locality sensitive hashing (LSH) function m that selects elements m(A1) from set A1 and m(A2) from set A2 so that

P{m(A1) == m(A2)} = sim (A1 , A2)

Page 6: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

145263

0.630.880.550.940.310.19

0.070.750.590.220.900.41

min-Hash

A C D EB F

Vocabulary

A CB C DB A E F

f1: C C F

f2: 453621 A B Af3: 546123 C C Af4: 216534 B B E

Set A Set B Set C

Ordering min-Hash

overlap (A,B) = 3/4 (1/2) overlap (A,C) = 1/4 (1/5) overlap (B,C) = 0 (0)

~ Un (0,1)

~ Un (0,1)

Page 7: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

min-Hash Retrieval

Q

A

E

V

Y

J

}}}}}}

C

A

E

V

Q

Z

}}}}}}

k hash tables

sketch s-tuple of

min-Hashes

Sketch collisionA B

...... sim(A, B)s

1 – (1 - sim(A, B)s)k

s – size of the sketchk – number of hash tables

Probability of sketch collision

Probability of retrieval (at least one sketch collision)

Page 8: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Probability of Retrieving an Image Pair

similarity (set overlap)

Near duplicate imagesImages of the same object

prob

abili

ty o

f re

trie

val

s = 3, k = 512

Unrelated images

Page 9: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

More Complex Similarity Measures

Page 10: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Document / Image / Object Retrieval

idfW = log# docs containing XW

# documents

0

4

0

2

...

t

Term Frequency – Inverse Document Frequency (tf-idf) weighting scheme

[1] Baeza-Yates, Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.[2] Sivic, Zisserman. Video Google: A text retrieval approach to object matching in

videos. ICCV’03.[3] Nister, Stewenius. Scalable recognition with a vocabulary tree. CVPR’06.[4] Philbin, Chum, Isard, Sivic, Zisserman. Object retrieval with large vocabularies and

fast spatial matching. CVPR’07.

Words common to many documents are less informative

Frequency of the words is recorded(good for repeated structures,

textures, etc…)

Page 11: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

• Bag of words representation (frequency is recorded)• Histogram intersection similarity measure• Different importance of visual words importance dw of word Xw

More Complex Similarity Measures

• Set of words representation• Different importance of visual words importance dw of word Xw

Page 12: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Word Weighting for min-Hash

all words Xw have the same chance to be a min-Hash

For hash function (set overlap similarity)

For hash function

the probability of Xw being a min-Hash is proportional to dw

A Q VE RJC ZA U B: YdA dC dE dVdJ dQ dY dZdR

Page 13: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Histogram Intersection Using min-HashIdea: represent a histogram as a set, use min-Hash set machinery

A1 C1B1

A2 C2

C3

C1 D1B1

B2 C2

C3

A1 C1 D1B1A2 B2 C2 C3min-Hash vocabulary:

Bag of words A / set A’ Bag of words B / set B’

A1 C1 D1B1A2 B2 C2 C3A’ U B’:

Set overlap of A’ of B’ is a histogram intersection of A and B

A C DBVisual words:

tA = (2,1,3,0) tB = (0,2,3,1)

Page 14: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Results

• Quality of the retrieval

• Speed – the number of documents considered as near-duplicates

Page 15: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

TRECVid Challange• 165 hours of news footage, different channels, different countries

• 146,588 key-frames, 352×240 pixels

• No ground truth on near duplicates

Page 16: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Min-Hash on TrecVid• DoG features• vocabulary of 64,635 visual words• 192 min-Hashes, 3 min-Hashes per a sketch, 64 sketches• similarity threshold 35%

• Examples of images with 24 – 45 near duplicates• # common results / set overlap only / weighted set overlap only• Quality of the retrieval appears to be similar

Page 17: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Comparison of Similarity Measures

Images only sharing uninformative visual words

do not generate sketch collisions for the proposed

similarity measures

Num

ber

of s

ketc

h co

llisi

ons

Image pair similarity

Set overlap

Weighted set overlap

Weighted histogram

Page 18: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

University of Kentucky Dataset

• 10,200 images in groups of four• Querying by each image in turn• Average number of correct retrievals in top 4 is measured

Page 19: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

EvaluationVocabulary sizes 30k and 100kNumber of min-Hashes 512, 640, 768, and 8962 min-Hashes per sketchNumber of sketches 0.5, 1, 2, and 3 times the number of min-Hashes

Score on average: weighted histogram intersection 4.6 % better than weighted set overlap weighted set overlap 1.5 % better than set overlap

Number of considered documents on average: weighted histogram intersection 1.7 times less than weighted set overlap weighted set overlap 1.5 times less than set overlap

Absolute numbers for weighted histogram intersection:

min-Hashes sketches score 30k score 100k docs 30k docs 100k

Usable 640 640 2.928 2.889 488.2 117.6Best 896 2688 3.090 3.166 1790.8 452.8

Retrieval tf-idf flat scoring [Nister & Stewenius] score 3.16Number of considered documents (non-zero tf-idf) 10,089.9 (30k) and 9,659.4 (100k)

Page 20: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Query Examples

Query image:

ResultsSet overlap, weighted set overlap, weighted histogram intersection

Page 21: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Beyond Near Duplicate Detection

Page 22: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Discovery of Spatially Related Images

Find and match ALL groups (clusters) of spatially related images in a large database, using only visual information, i.e. not using (flicker) tags, EXIF info, GPS, ….

Chum, Matas: Large Scale Discovery of Spatially Related Images, TR May 2008 available at http://cmp.felk.cvut.cz/~chum/Publ.htm

Page 23: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Probability of Retrieving an Image Pair

similarity (set overlap)

Near duplicate imagesImages of the same object

prob

abili

ty o

f re

trie

val

Page 24: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Image Clusters as Connected Components

Randomized clustering method:

1. Seed Generation – hashing (fast, low recall) characterize images by pseudo-random numbers stored in a hash table time complexity equal to the sum of second moments of Poisson random variable -- linear for database size D ≈ 240

2. Seed Growing – retrieval (thorough – high recall) complete the clusters only for cluster members c << D, complexity O(cD)

Page 25: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Clustering of 100k Images

Hertford

Keble

Magdalen

Pitt Rivers

Radcliffe Camera

All Soul's

Ashmolean

Balliol

Bodleian

Christ Church

Cornmarket

Images downloaded from FLICKRIncludes 11 Oxford Landmarks with manually labelled ground truth

Page 26: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Results on 100k Images

Number of images: 104,844Timing: 17 min + 16 min = 0.019 sec / image

Component Recall (CR)

Good OK Unrelated CRAll Souls 24 54 0 97.44Ashmolean 12 13 0 68.00Balliol 5 7 0 33.33Bodleian 13 11 1 95.83Christ Church 51 27 0 89.74Cornmarket 5 4 0 66.67Hertford 35 19 1 96.30Keble 6 1 0 85.71Magdalen 13 41 0 5.56Pitt Rivers 3 3 0 100Radcliffe Camera 105 116 0 98.64

Chum, MatasTR, May 2008

Page 27: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Results on 100k Images

Good OK Unrelated CRAll Souls 24 54 0 97.44Ashmolean 12 13 0 68.00Balliol 5 7 0 33.33Bodleian 13 11 1 95.83Christ Church 51 27 0 89.74Cornmarket 5 4 0 66.67Hertford 35 19 1 96.30Keble 6 1 0 85.71Magdalen 13 41 0 5.56Pitt Rivers 3 3 0 100Radcliffe Camera 105 116 0 98.64

CR966033967167655720

10098

Philbin, Sivic, ZissermanBMVC 2008

Chum, MatasTR, May 2008

5,062?

Number of images: 104,844Timing: 17 min + 16 min = 0.019 sec / image

Component Recall (CR)

Page 28: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Conclusions

• New similarity measures were derived for the min-Hash framework– Weighted set overlap– Histogram intersection– Weighted histogram intersection

• Experiments show that the similarity measures are superior to the state of the art– in the quality of the retrieval (up to 7% on University of Kentucky

dataset)– in the speed of the retrieval (up to 2.5 times)

• min-Hash is a very useful tool for randomized image clustering

Page 29: Near Duplicate Image Detection: min-Hash and tf-idf weighting Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors:

Thank you!