Indexing in Large Scale Image Collections: Scaling - CiteSeerX

35
Indexing in Large Scale Image Collections: Scaling Properties, Parameter Tuning, and Benchmark Mohamed Aly Computational Vision Lab Electrical Engineering, Caltech Pasadena, CA 91125, USA vision.caltech.edu/malaa Mario Munich Evolution Robotics Pasadena, CA 91106, USA www.evolution.com Pietro Perona Computational Vision Lab Electrical Engineering, Caltech Pasadena, CA 91125, USA vision.caltech.edu October 2010 Abstract Indexing quickly and accurately in a large collection of images has become an important problem with many applications. Given a query image, the goal is to retrieve matching images in the collection. We compare the structure and properties of seven different methods based on the two leading approaches: voting from matching of local descriptors vs. matching his- tograms of visual words, including some new methods. We derive theoretical estimates of how the memory and computational cost scale with the number of images in the database. We evaluate these properties empirically on four real-world datasets with different statistics. We discuss the pros and cons of the different methods and suggest promising directions for future research. 1 Introduction Indexing in a large scale collection of images has become an important problem of widespread applications, especially with the increasing popularity of smart phones. There are currently several applications that allow the user to take a photo with the smart phone camera and search a database of stored images e.g. Google Goggles 1 , Snaptell 2 , and Barnes and Noble 3 app. These image collections typically include images of book covers, CD/DVD covers, retail products, and buildings and landmark images. The size of databases involved vary from 10 5 to 10 7 images, but they can conceivably reach billions of images. The ultimate goal is to identify the database image containing the object depicted in a probe image, e.g. an image of a book cover from a different view point and scale. The correct image can then be presented to the user, together with some revenue generating information, e.g. sponsor ads or referral links. This application poses three main challenges: storage, computational cost, and recognition performance. For example, if we consider 1 billion images, and store only 100KB per image, we need to store 100 TB of data just for the feature descriptors, and naively searching for the nearest feature in a database of 10 12 features takes 4 minutes on a supercomputer with 1 TFLOPS. Clearly we need clever ways of indexing these huge databases of images that will reduce either or both storage requirements and run time, to allow it to be practical for user interactive applications. In this work we explore how the memory usage, computational cost, and recognition performance of the two leading approaches scale with the number of images in the database. These two approaches are based on extracting local features from the images, e.g. SIFT [17] features, and then indexing these features or some information extracted therefrom. This information is then used to find the best match of a probe image in the database images. The first approach, the full rep- resentation [17] approach, stores the exact features and uses efficient data structures to quickly search the huge number of database features and identify candidate images that are further verified for geometric consistency. We consider different variants of this approach, which use different data structures to perform the fast approximate search in the features database. The second approach, the bag-of-words [22, 10, 9, 20, 11, 14, 15] approach, stores only occurrence counts of vector quantized features, and uses a efficient search techniques to get candidate image matches. We consider two variants, the exact inverted file method [25, 20], and the approximate Min-Hash method [7, 6, 9]. We did not implement and test recent improvements, e.g. [14, 21, 11, 15, 24, 8], since their scaling properties are essentially the same. 1 www.google.com/mobile/goggles/ 2 www.snaptell.com 3 www.barnesandnoble.com/iphone 1

Transcript of Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Page 1: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Indexing in Large Scale Image Collections: Scaling Properties,

Parameter Tuning, and Benchmark

Mohamed Aly

Computational Vision Lab

Electrical Engineering, Caltech

Pasadena, CA 91125, USA

vision.caltech.edu/malaa

[email protected]

Mario Munich

Evolution Robotics

Pasadena, CA 91106, USA

www.evolution.com

[email protected]

Pietro Perona

Computational Vision Lab

Electrical Engineering, Caltech

Pasadena, CA 91125, USA

vision.caltech.edu

[email protected]

October 2010

Abstract

Indexing quickly and accurately in a large collection of images has become an important problem with many applications.

Given a query image, the goal is to retrieve matching images in the collection. We compare the structure and properties of

seven different methods based on the two leading approaches: voting from matching of local descriptors vs. matching his-

tograms of visual words, including some new methods. We derive theoretical estimates of how the memory and computational

cost scale with the number of images in the database. We evaluate these properties empirically on four real-world datasets

with different statistics. We discuss the pros and cons of the different methods and suggest promising directions for future

research.

1 Introduction

Indexing in a large scale collection of images has become an important problem of widespread applications, especially with

the increasing popularity of smart phones. There are currently several applications that allow the user to take a photo with

the smart phone camera and search a database of stored images e.g. Google Goggles1, Snaptell2, and Barnes and Noble3

app. These image collections typically include images of book covers, CD/DVD covers, retail products, and buildings and

landmark images. The size of databases involved vary from 105 to 107 images, but they can conceivably reach billions of

images. The ultimate goal is to identify the database image containing the object depicted in a probe image, e.g. an image of

a book cover from a different view point and scale. The correct image can then be presented to the user, together with some

revenue generating information, e.g. sponsor ads or referral links.

This application poses three main challenges: storage, computational cost, and recognition performance. For example, if

we consider 1 billion images, and store only 100KB per image, we need to store 100 TB of data just for the feature descriptors,

and naively searching for the nearest feature in a database of 1012 features takes 4 minutes on a supercomputer with 1 TFLOPS.

Clearly we need clever ways of indexing these huge databases of images that will reduce either or both storage requirements

and run time, to allow it to be practical for user interactive applications.

In this work we explore how the memory usage, computational cost, and recognition performance of the two leading

approaches scale with the number of images in the database. These two approaches are based on extracting local features

from the images, e.g. SIFT [17] features, and then indexing these features or some information extracted therefrom. This

information is then used to find the best match of a probe image in the database images. The first approach, the full rep-

resentation [17] approach, stores the exact features and uses efficient data structures to quickly search the huge number of

database features and identify candidate images that are further verified for geometric consistency. We consider different

variants of this approach, which use different data structures to perform the fast approximate search in the features database.

The second approach, the bag-of-words [22, 10, 9, 20, 11, 14, 15] approach, stores only occurrence counts of vector quantized

features, and uses a efficient search techniques to get candidate image matches. We consider two variants, the exact inverted

file method [25, 20], and the approximate Min-Hash method [7, 6, 9]. We did not implement and test recent improvements,

e.g. [14, 21, 11, 15, 24, 8], since their scaling properties are essentially the same.

1www.google.com/mobile/goggles/2www.snaptell.com3www.barnesandnoble.com/iphone

1

Page 2: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Algorithm 1 Basic Algorithm

1. Extract local features {fi j} j from every image i

2. Store some function of these features g(fi j) and build some data structure d(g) based on g()

3. Given the probe image, extract its local features and compute g(fq j)

4. Search through the data structure d(g) for “nearest neighbors”

5. Every nearest neighbor votes for the database image i it comes from, accumulating to its score si

6. Sort the database images based on their score si

7. Post-process the sorted list of images to enforce some geometric consistency and obtain a final list of sorted images s′i.

The geometric consistency check is done using a RANSAC algorithm to fit an affine transformation between the query

image q and the database image i.

We provide the following contributions:

1. We provide a benchmark of seven different methods based on the two leading approaches: local feature matching (full

representation) and visual words histogram matching (bag of words). We compare recognition rate and matching time

as the number of images increases. We also include in the benchmark three new methods: using spherical LSH hash

functions [23] and using Hierarchical K-Means [19] for matching local features. We implemented these methods, and

we plan to make this code publicly available for further comparisons.

2. We provide theoretical estimates of the storage requirement and computational cost of these methods as a function of

the database size. We also explore how these methods are amenable to parallelization.

3. We empirically evaluate these properties on four real world datasets with different statistics, and

4. We present a number of empirical new findings for some of these methods. In particular, we show that:

(a) using L1 distance with L1 normalized histograms in the inverted file method is superior to the standard way of

using dot product with tf-idf weighted histograms [20] and to using binarized histograms [15].

(b) using spherical LSH hash functions [23] provide comparable recognition performance to the standard L2 LSH [2]

function but faster search time.

5. Based on the benchmark, we suggest promising directions for future research on efficient image matching in large

databases.

2 Methods Overview

In this work we consider the two leading approaches for image matching: Full Representation (FR) and Bag-of-Words rep-

resentations (BoW). They are based on extracting local features from image e.g. SIFT features. They can all be seen as

representing the same approach with varying degrees of approximation to improve speed and/or storage requirements. The

basic idea is described in Algorithm 1. In the following, we provide only a brief description of the algorithm for brevity, so

please refer to the relevant references for further details.

2.1 Full Representation (FR)

This follows the same general approach outlined in Algorithm 1. The specifics are:

1. The function g(fi j) = fi j is the identity function i.e. the full features are stored

2. The data structure d({fi j}i j) tries to facilitate faster nearest neighbor look up at the expense of some additional storage.

Four general algorithms are considered here:

(a) Exhaustive: just use brute force search to get the nearest neighbor for each query image feature fq j.

2

Page 3: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

(b) Kd-Tree: a number of randomized Kd-trees [3, 17] are built for the database features {fi j}i j to allow for logarith-

mic retrieval time

(c) LSH: a number of locality sensitive hash functions [2, 16] are extracted from the database features {fi j}i j, and

are arranged in a set of tables. Each table has a set of has functions, which are then concatenated to get the index

of the bucket within the table where the feature should go. All features with the same hash value go to the same

bucket. Three different hash functions are considered:

i. L2: this approximates the Euclidean distance [2], where the hash function is h(x) =⌊

<x,r>+bw

where < ., . >

is the dot product, r is a random unit vector, b is a random offset, and w is the bin width. For normalized

feature vectors, it basically projects the feature onto a random direction and then returns the bin number where

the projection lies.

ii. Spherical-Simplex: this approximates distances on the hyper sphere [23], where the hash function is h(x) =argmini < x,yi > where yi are the vertices of a random simplex inscribed in the unit hyper sphere. The hash

value is the index of the nearest vertex of the simplex.

iii. Spherical-Orthoplex: this approximates distances on the hyper sphere [23], where the hash function is

h(x) = argmini < x,yi > where yi are the vertices of a random orthoplex inscribed in the unit hyper sphere.

The hash value is the index of the nearest vertex of the orthoplex.

(d) Hierarchical K-Means: where a hierarchical decomposition [19] is built from the database features {fi j}i j. At

each level of the tree, K-means algorithm is performed to obtain a clustering of the data in the current node, and

the process is repeated recursively until the maximum depth allowed is reached.

2.2 Bag-of-Words Representation (BoW)

The differences from algorithm 1 are:

1. The function g(fi j) represents a clustering of the input features. For pre-processing, a “dictionary” is built from the

database images by clustering features into representative “visual words”. Then, each image is represented by a his-

togram of occurrences of these visual words {hi}i, and the original local feature vectors are thrown away. This is

inspired from text search applications [25].

2. We consider two data structures that try to perform faster search through the database histograms:

(a) Inverted File: where an inverted file [4, 25] structure is built from the database images.

(b) Min-Hash: a number of locality sensitive hash functions [6, 7, 5, 9] are extracted from the database histograms

{hi}i, and are arranged in a set of tables. The histograms are binarized (counts are ignored), and each image is

represented as a “set” of visual words {bi}i. The hash function is defined as h(b) = minπ(b) where π is a random

permutation of the numbers {1, ...,W} where W is the number of words in the dictionary.

3 Theoretical Scaling Properties

3.1 Description

We provide theoretical estimates of how the storage and computational cost of these methods scale with the number of images

in the database. We present the definitions of the parameters in table 1 and summary of the results in table 2, with details

in Appendix A. We note that these calculations are based on minimum theoretical storage and average case scenarios. We

also note that we compute distance between vectors using the dot product, which is equivalent to euclidean distance since

we assume feature vectors are normalized. We do not consider any compression technique that might decrease storage (e.g.

run-length encoding). Moreover, actual run times of these algorithms will differ from the average case presented here, for

example for inverted file method or spherical LSH methods, see figure 4.

3.2 Discussion

3.2.1 Storage and Run-time

Figures 1a show how storage requirements and run-time scale with the number of images in the database, assuming they are

implemented on one machine with infinite storage and 10 GFLOPS processor. We note the following:

3

Page 4: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

104

106

108

10−4

10−2

100

102

104

Total Storage

Number of Images

Tota

l S

tora

ge (

TB

)

104

106

108

10−5

100

105

1010

1015

Running Time

Number of Images

Tim

e p

er

image (

msec)

exhaustive

kd−tree

lsh−l2

lsh−sph−sim

lsh−sph−orth

hkm

bow−inv−file

bow−min−hash

(a) Theoretical storage vs. size (left), and run time vs. size (right), assuming single machine with infinite memory.

(b) Advanced parallelization scheme for Kd-trees.The root machine stores

the top of the tree, while leaf machines store the leaves of the tree.

100

101

102

103

104

105

100

101

102

103

Number of Machines

Tim

e p

er

ima

ge

(m

se

c)

kd−tree

kd−tree−adv

lsh−l2

bow−inv−file

(c) Time per image vs min. no. of machines required M for different dataset

sizes. Each point represents a dataset size from 106 to 109 images, going left

to right. kd-tree-adv is the advanced parallelization scheme

Figure 1: Theoretical Scaling Properties. Refer to tables 1-2

4

Page 5: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Parameter Description Typical Value Parameter Description Typical Value

I no. of images 109 Hl2 # hash fun LSH-L2 50

s bytes/feature dim 1 Blsh # buckets 106

d feature dimension 128 Hsph # funcs for LSH-Spherical 5

F #features/image 1,000 D depth of HKM tree 7

M # machines varies k branching factor of HKM 10

C main memory/machine 50GB W # words for BoW 106

Tkdt # kd-trees 4 Tmh # tables for Min-Hash 50

L length of ranked lists 100 Hmh # hash funs for Min-Hash 1

Bkdt # backtracking 250 Bmh # buckets in Min-Hash 106

Tlsh # lsh tables 4

Table 1: Parameter definitions and typical values

• BoW methods take one order of magnitude less storage than FR methods, due to the fact that we don’t need to store the

feature vectors

• Run-time for Kd-trees and Min-hash grows very slowly with the database size.

• Inverted file and Lsh methods have asymptotically similar run-time. The theoretical run time increases linearly with

the number of images. However, in practice, inverted files have much smaller run time than the average case depicted

above.

3.2.2 Parallelization

The parallelization considered here is the simplest: for every method, we determine the minimum number of machines, M,

that can fit the storage required in their main memory, assuming machines with 50GB of memory. Then we split the images

evenly across these machines and each will take a copy of the probe image and search its own share of images. Finally, all

the machines will communicate their ranked list of images (of length L) and produce a final list of candidate matches that is

further geometrically checked.

More sophisticated parallelization techniques are possible, that can take advantage of the properties of the method at hand.

For example, in the case of Kd-trees, one such advanced approach is to store the top part of the Kd-tree on one machine,

and divide the bottom subtrees across other machines, see fig. 1b). For 1012 features, we have 40 levels in the Kd-tree,

and so we can store up to 30 levels in one root machine, and split the bottom 10 levels evenly across the leaf machines.

Given a probe image, we first query the root machine and get the list of branches (and hence subtrees) that we need to search

with the backtracking. Then the appropriate leaf machines will process these query features and update their ranked list of

images. This approach has the advantage of significant speed up and better utilization of the machines at hand, since not all

the machines will be working on the same query feature at the same time, rather they will be processing different features

from the probe image concurrently, see figure 1c. However, it also has some drawbacks: 1) the geometric verification step is

more complicated as the node machines will not, in general, have all the features of any particular image, and will have to

collect these features from other node machines; 2) the root machine will become a bottleneck for processing input images,

but that can be alleviated by adding more replicas of the root machine to process more features concurrently; 3) some features

will be stored more than once when using more than one kd-tree, since, in general, features will not fall into the subtrees of

one particular leaf machine.

Figure 1b shows a schematic of the advanced parallelization scheme for Kd-Trees, while figure 1c shows the time per

image vs the minimum number of machines M for different dataset sizes from 106 to 109. We note the following:

• The advanced parallelization method provides significant speed ups starting at 108 images. It might seem paradoxical

that increasing the dataset size decreases the run time, however it makes sense when we notice that adding more ma-

chines allows us to interleave processing of more features concurrently, and thus allows faster processing. This however

comes at a cost of more storage, see figure 1c.

• We also note that many of the FR methods have parameters that affect the trade off between run time and accuracy,

e.g. number of hash functions or bin size for LSH. These parameters have to be tuned for every database size under

consideration, and we should not use the same settings when enlarging the database. This poses a problem for these

methods, which have to be continuously updated, as opposed to Kd-trees which adapt to the current database size and

need very little tuning.

5

Page 6: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Method Storage (bytes) Ex. (TB) Comp. (FLOP/im) Ex. (GFLOP/im)

Exhaustive (sd +4)IF 132 F2I(2d +1) 256×106

Kd-Tree IF(sd +4+2Tkdt +Tkdtlog2 IF

8) 160 Bkdt F(2d +1+ log2 FI) 0.074

LSH-L2 IF(sd +4+Tlshlog2 IF

8) 152 FTlsh(Hl2(2d +2)+ FI

Blsh(2d +1)) 1028

LSH-Sim IF(sd +4+Tlshlog2 IF

8) 152 FTlsh(Hsph(2d2 +3d)+ FI

Blsh(2d +1)) 1028

LSH-Orth IF(sd +4+Tlshlog2 IF

8) 152 FTlsh(Hsph(2d2 +3d)+ FI

Blsh(2d +1)) 1028

HKM IF(sd +4)+ kD−1k−1

ksd 132 FD(2d + k)+ F2 I

kD × (2d +1)) 25.7

Inverted File Wsd +FI(5+log2 I

8) 9 FBkdt (2d +1+ log2 W )+F(2+ I) 1

Min-Hash Wsd +FI(4+log2 W

8)+Tmh

log2 I

87 FBkdt (2d +1+ log2 W )+4FTmhHmh +

TmhI

MBmh0.07

(a) Theoretical storage requirement (Storage), computational cost on a single machine with infinite memory (Comp.)

Parallel (FLOP/im) Ex. (GFLOP/im)

Exhaustive F2 IM

(2d +1)+L(M−1) 128×103

Kd-Tree FB(2d +1+ log2FIM

)+L(M−1) 0.071

Kd-Tree-adv FB log2C

4Tkdt+ FB

M(2dBkdt +Bkdt )+F log2

4FITC

+L(min(FB,M)−1) 0.012

LSH-L2 FTlsh(Hl2(2d +2)+ FIMBlsh

(2d +1))+L(M−1) 85

LSH-Sim FTlsh(Hsph(2d2 +3d)+ FIMBlsh

(2d +1))+L(M−1) 85

LSH-Orth FTlsh(Hsph(2d2 +3d)+ FIMBlsh

(2d +1))+L(M−1) 85

HKM FD(2d + k)+ F2 I

MkD × (2d +1)+L(M−1) 0.021

Inverted File FBkdt (2d +1+ log2 W )+F(2+ IFMW

) 0.075

Min-Hash FBkdt (2d +1+ log2 W )+4FTmhHmh +TmhI

Bmh0.07

(b) Theoretical parallel computational cost with minimum required number of machines M. Kd-Tree-adv

is the advanced parallelization scheme, see fig. 1b.

Table 2: Theoretical Scaling Properties. Refer to figure 1.

4 Evaluation Details

4.1 Datasets

We have two kinds of datasets:

1. Distractors: A set of images that constitute the bulk of the database to be searched. It represents the clutter, or distractor

images that the algorithm must go through in order to find the right image. In the actual setting, this would include all

the objects of interest e.g. book covers, CD covers, ... etc.

2. Probe: A set of labeled images used for benchmarking purposes. This has two types of images per object:

(a) Model Image: represents the ground truth image to be retrieved for that object

(b) Probe Images: used for querying the database, representing the object in the model image from different view

points, lighting conditions, scales, ... etc.

4.1.1 Distractor Datasets

• D1: Caltech-Covers A set of ~ 100K images of CD/DVD covers used in [1].

• D2: Flickr-Buildings A set of ~1M images of buildings collected from flickr.com

• D3: Image-net A set of ~400K images of “objects” collected from image-net.org, specifically images under synsets:

instrument, furniture, and tools.

• D4: Flickr-Geo A set of ~1M geo-tagged images collected from flickr.com

Figure 2 shows some examples of images from these distractor sets.

6

Page 7: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Figure 2: Example distractor images. Each row depicts a different set: D1, D2, D3, and D4, respectively.

total #model #probe

P1 485 97 388

P2 750 125 525

P3 720 80 640

P4 957 233 724

Table 3: Probe Sets

4.1.2 Probe Sets

• P1: CD Covers: A set of 5×97=485 images of CD/DVD covers. The model images come from freecovers.net while

the probe images come from the dataset used in [19]. This was also used in [1].

• P2: Pasadena-Buildings A set of 6×125=750 images of buildings around Pasadena, CA from [1]. The model image

is image2 (frontal view in the afternoon), and the probe images are images taken at two different times of day from

different viewpoints.

• P3: ALOI A set of 9×80=640 3D objects images from the ALOI collection [13] with different illuminations and view

points. We use the first 80 objects, with the frontal view of each object as the model image, and four orientations and

four illuminations as the probe images.

• P4: INRIA-Holidays a set of 957 images, which forms a subset of images from [14], with groups of at least 3 images.

There are 233 model images and 724 probe images. The first image in each group is the model image, and the rest are

the probe images.

Figure 3 shows some examples of images from these distractor sets. Table 3 summarizes the properties of these probe sets.

4.2 Setup

We used four different evaluation scenarios, where in each we use a specific distractor/probe set pair. Table 4 displays a list

of scenarios used. Evaluation was done by increasing the size of the dataset from 100, 1k, 10k, 50k, 100k, up to 400k. For

each such size, we include all the model images to the specified number of distractor images e.g. for 1k images, we have 1000

images from the distractor set in addition to all images in the probe set.

Performance is measured as the percentage of probe images correctly matched to their ground truth model image i.e.

whether the correct model image is the first ranked image returned. In addition, we measure performance before and after the

7

Page 8: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Figure 3: Example probe images. Each row depicts a different set: P1, P2, P3, and P4, respectively. Each row shows two

model images and 2 or 3 of its probe images. Details in Table 3.

Scenario Distractor Probe

1 D1 D1

2 D2 P2

3 D3 P3

4 D4 P4

Table 4: Evaluation Scenarios. See Sec. 4.2.

geometric consistency check, which is done using RANSAC algorithm [12] to fit an affine transformation between the probe

and the model images. Ranking before the geometric check is based on the number of nearest features in the image, while

ranking after is based on the number of inliers of the affine transform.

We want to emphasize the difference between the setup used here and the setup used in other “image retrieval”-like papers

[14, 11, 20]. In our setup, we have only ONE correct ground truth image to be retrieved and several probe images, while in the

other setting there are a number of images that are considered correct retrievals. Our setting is like the dual of image retrieval

setting. In fact, in our probe sets, we invert the role of model and probe images e.g. P1 and P4 above.

We use SIFT [17] feature descriptors with hessian affine [18] feature detectors. We used the binary available from

tinyurl.com/vgg123. All experiments were performed on machines with Intel dual Quad-Core Xeon E5420 2.5GHz processor

and 32GB of RAM. We implemented all the algorithms using a mix of Matlab and Mex/C++ scripts 4.

4.3 Parameter Tuning

All of the methods we compare have different parameters that affect their run time, storage, and recognition performance. We

performed parameter tuning in two steps: first quick tuning to get a set of promising parameters using a subset of of probe

images from scenario 1, and then we ran experiments using this set of parameters for the four scenarios with sizes from 100 up

to 10K images. Based on these results, we chose the settings that made most sense in terms of their recognition performance

and run time. Please see the details in Appendix B. Table 5 summarizes the parameters chosen for the full benchmark.

5 Results and Discussion

Figure 4 shows the recognition performance for different dataset sizes, before and after the geometric consistency check.

Figure 5 shows recognition performance vs time for one dataset size (10K images), before and after geometric checks. We

note the following:

4available at vision.caltech.edu/malaa/software/research/image-search

8

Page 9: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Parameters

Kd-Trees T = 1 tree, B =100 backtracking steps

LSH-L2 T =4 tables, H =25 hash functions, bin size w =0.25

LSH-Sim T =4 tables, H =5 hash functions

LSH-Orth T =4 tables, H =5 hash functions

HKM tree depth D = 5, branching factor k = 10

tf-idf weighting, l2 normalization, cos distance

Inverted File raw histograms, l1normalization, l1 distance

binary histograms, l2 normalization, cos distance

Min-Hash T = 100 tables, H = 1 hash function

Table 5: The chosen methods parameters for the full experiments. See Sec. 4.3 and Appendix B.

• Full representation (FR) methods based on (approximately) matching the local features provide superior recognition

performance to bag-of-words (BoW) methods.

• BoW methods provide nearly constant search time compared to linear increase in case of most FR methods, with the

exception of Kd-trees (before exhausting all the main memory, see the big jump in figure 4 for 50K images) which

provide logarithmic increase in search time.

• FR methods take an order of magnitude more memory than BoW methods e.g. we can easily fit up to 400K images for

BoW while for some scenarios we can only fit up to 50K images for some FR methods.

• It is important to use diverse datasets for measuring and quantifying performance of different methods. While the trend

is similar for all the four scenarios, we notice that scenarios 1 and 3 are the easiest, while scenarios 2 and 4 provide

much more difficulty. We believe that is the case because images in scenarios 2 and 4 have much more clutter than the

other two scenarios, and that makes the recognition task much harder.

• We notice that BoW techniques can provide acceptable accuracy in some situations e.g. scenarios 1 & 3. This suggests

that for some applications we can use BoW methods, which have the advantage of near constant search time and less

storage requirements.

• FR methods pose a trade off between recognition rate and search time. In particular, for scenarios 2 & 4, we notice that

LSH methods are generally better than Kd-tree in terms of recognition rate but inferior in terms of search time which

rises sharply with database size. This suggests that we can trade off extra search time for better recognition rate.

• Spherical LSH methods provide comparable recognition rate to LSH-L2 method, but they provide better search time.

• Using the novel combination of l1 normalization with l1 distance in the inverted file method provides better performance

than the standard way of using tf-idf weighting with l2 normalization and dot product. Using binary histograms also

outperforms the standard tf-idf scheme, as reported in [15].

• Kd-trees provide the best overall trade off between recognition performance and computational cost. We notice that its

run time grows very slowly with the number of images while giving excellent performance. Moreover, with smart and

more complicated parallelization schemes, like the one described in §3.2.2, we can have significant speedup when run-

ning on multiple machines. The only drawback, which is shared with all FR methods, is the larger storage requirements.

• The geometric verification step is in general not necessary, and gives performance comparable to that beforehand, except

in the case of Min-Hash method. This is especially true for hard scenarios, e.g. 2 & 4, where there is a lot of noise

features i.e. features belonging to clutter around the object, like trees around the house. These features get matched to

noise features in the probe image and thus make the geometric step harder.

• We believe the two main promising directions for further research in large scale image indexing are:

1. Devising ways to reduce run time and storage requirements for FR methods. This includes developing better

features that take less storage, finding ways to store less features in the database, and compressing information of

features using projection methods (e.g. PCA, Random Projection, ... etc.).

2. Devising ways to improve the recognition performance of BoW methods. This includes finding better ways to

generate the visual words dictionaries and encoding geometric information in the BoW representation.

9

Page 10: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

6 Summary

We explored the problem of indexing in large scale image databases both theoretically and experimentally on four real world

benchmark datasets. We compared seven methods of the two leading approaches in terms of their storage, recognition per-

formance, and computational cost. We reported new results for LSH and inverted file methods, and we suggested promising

directions for future research.

102

103

104

105

0

20

40

60

80

100Scenario 1

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(b

efo

re)

102

103

104

105

0

20

40

60

80

100Scenario 1

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(a

fte

r)

102

103

104

105

10−2

10−1

100

101

102

103

Scenario 1

Number of Images

Tim

e (

be

fore

) (s

ec /

im

ag

e)

102

103

104

105

10−2

10−1

100

101

102

103

Scenario 1

Number of Images

Tim

e (

aft

er)

(se

c /

im

ag

e)

kdt

lsh−l2

lsh−sim

lsh−orth

hkm

invf−none−l1−l1

invf−tfidf−l2−cos

invf−bin−l2−l2

minhash

102

104

106

0

20

40

60

80

100Scenario 2

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(b

efo

re)

102

104

106

0

20

40

60

80

100Scenario 2

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(a

fte

r)

102

104

106

10−2

10−1

100

101

102

103

Scenario 2

Number of Images

Tim

e (

be

fore

) (s

ec /

im

ag

e)

102

104

106

10−2

10−1

100

101

102

103

Scenario 2

Number of Images

Tim

e (

aft

er)

(se

c /

im

ag

e)

102

104

106

0

20

40

60

80

100Scenario 3

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(b

efo

re)

102

104

106

0

20

40

60

80

100Scenario 3

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(a

fte

r)

102

104

106

10−2

10−1

100

101

102

103

Scenario 3

Number of Images

Tim

e (

be

fore

) (s

ec /

im

ag

e)

102

104

106

10−2

10−1

100

101

102

103

Scenario 3

Number of Images

Tim

e (

aft

er)

(se

c /

im

ag

e)

102

104

106

0

20

40

60

80

100Scenario 4

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(b

efo

re)

102

104

106

0

20

40

60

80

100Scenario 4

Number of Images

Re

co

gn

itio

n P

erf

orm

an

ce

(a

fte

r)

102

104

106

10−2

10−1

100

101

102

103

Scenario 4

Number of Images

Tim

e (

be

fore

) (s

ec /

im

ag

e)

102

104

106

10−2

10−1

100

101

102

103

Scenario 4

Number of Images

Tim

e (

aft

er)

(se

c /

im

ag

e)

Figure 4: Recognition Performance and Time Vs Dataset Size. First two rows show recognition performance before and

after the geometric step. Lower two rows show total processing time per image before and after the geometric step. Every

column represents a different experimental scenario, see tables 4 and 5.

10

Page 11: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

10−2

100

102

0

20

40

60

80

100Scenario 1

Time before (sec/image)

Pe

rfo

rma

nce

(b

efo

re)

10−2

100

102

0

20

40

60

80

100Scenario 1

Time after (sec/image)

Pe

rfo

rma

nce

(a

fte

r)

kdt

lsh−l2

lsh−sim

lsh−orth

hkm

invf−none−l1−l1

invf−tfidf−l2−cos

invf−bin−l2−l2

minhash

10−2

100

102

0

20

40

60

80

100Scenario 2

Time before (sec/image)

Pe

rfo

rma

nce

(b

efo

re)

10−2

100

102

0

20

40

60

80

100Scenario 2

Time after (sec/image)

Pe

rfo

rma

nce

(a

fte

r)

10−2

100

102

0

20

40

60

80

100Scenario 3

Time before (sec/image)

Pe

rfo

rma

nce

(b

efo

re)

10−2

100

102

0

20

40

60

80

100Scenario 3

Time after (sec/image)

Pe

rfo

rma

nce

(a

fte

r)

10−2

100

102

0

20

40

60

80

100Scenario 4

Time before (sec/image)

Pe

rfo

rma

nce

(b

efo

re)

10−2

100

102

0

20

40

60

80

100Scenario 4

Time after (sec/image)

Pe

rfo

rma

nce

(a

fte

r)

Figure 5: Recognition Performance Vs Time. Left column shows recognition performance vs. total matching time per

image before the geometric step, while right column shows performance vs time after the geometric step for dataset size of

10K images. Each row corresponds to a different scenario, see table 4.

11

Page 12: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

A Theoretical Scaling Properties

A.1 Full Representation

A.1.1 Exhaustive Search

Description

• Here we do the nearest neighbor search using exhaustive linear search through the set of database features {{fi j} j}i i.e.

for every feature of the query image fq j, we find the feature that minimizes the Euclidean distance: nq j = argmini jfi j

where nq j is the index of the feature, and i is the image in the database containing this feature.

• For every database image, we count the number of such nearest neighbors si, and for all images with a preset amount of

nearest neighbors (10 in our experiments), we do the RANSAC affine step to obtain s′i where s′i =# inliers of the affine

transformation.

Storage The storage requirements for this method are just the full feature vectors and feature information:

• Feature vectors:

S f v = sdIF

where s is the number of bytes per feature dimension, d is the dimension of the feature vector, I is the number of images

in the database, and F is the average number of features per image.

• Feature information:

S f i = 4IF

where for every feature we need to store its location (x and y position in the image) in addition to its scale and orientation.

We assume we can discretize this information and fit it into 1 byte each.

• Total:

Sexhaustive = (sd +4)IF

• Example: for the prototypical case with

s = 1

d = 128 dimensions

I = 109 images

F = 103 feature/image

Sex = 132×109 ×103

= 132 TB

Speed

• The main bottleneck is the nearest neighbor search. For every feature, we have to search through the whole feature

database to find the nearest feature:

Tnn,ex = F(2FId +FI) = F2I(2d +1)

where for every query feature we need to perform d multiplications and d additions per database feature, in addition to

finding the minimum value among FI values.

• Example:

Tnn,ex ≈ 256e3 TFLOP/image

Parameters None

12

Page 13: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Parallelization

• Divide ~100 TB of data into ~2000 machines with ~50GB memory each. For every query feature, search the 2000

machines simultaneously.

• Every machine processes the images it is storing, and produces a final sorted list of geometrically checked images. This

list is then broadcast or sent to a central machine, which will merge them into one list and returns the result.

• Speedup: having more machines here corresponds to linear speedup of the search process (in the no. of images).

Consider having M machines, then the each machine will perform

T pnn,ex =

F2I

M(2d +1)+L(M−1)

where the second term is the time to merge M sorted lists of length L.

• Example:

L = 100

T pnn,ex = 103 ×128×109 +100×1999

≈ 128 TFLOP/image

A.1.2 KD-Trees

Description

• A number of randomized Kd-trees are built for the database features {{fi j} j}i

• The nearest neighbor search is performed by going down the set of Kd-trees using a single priority queue, and the result

is an “approximate” nearest neighbor

Storage

• We still need to store all the features and their information

Sex = (sd +4)IF

• In addition, there is storage for the trees. For a tree with n leaves, there are 2n− 1 total leaves i.e. n-1 internal nodes

and n leaves. Each tree here has IF leaves (features). Storage for each tree is:

Str = 2IF + IF⌈log2 IF⌉

8

Str = IF(2+⌈log2 IF⌉

8)

where the first term is for the internal nodes where we need to store the dimension and the threshold values (assumed to

be discretized to single bytes), and the second term is for the leaves where we need to store the index of the feature it

contains

• Total:

Skdt = Sex +T Str

Skdt = IF(sd +4+2T +T⌈log2 IF⌉

8)

• Example:

T = 4

Skdt = 1012(128+4+8+4×5)

= 160×1012

= 160 TB

13

Page 14: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Speed

• Searching through the Kd-trees involves comparison operations at each level of the tree, in addition to exhaustive search

with the final list of features, typically ~250, corresponding to the number of backtracking steps

Tnn,kdt = FB(2d +1+ log2 FI)

where the second term is for finding the minimum among the B distance values, while the third term is the number of

comparisons performed

• Example:

B = 250

Tnn,kdt = 250×1e3(2×128+1+ log2 1012)

= 250e3×296

≈ 74 MFLOP/image

Parameters There are two parameters that affect the speed, accuracy and storage of the Kd-trees

• Number of trees T : having more trees increases accuracy and storage, without much affecting speed due to the single

search queue.

• Number of backtracking steps B: this poses a trade-off between accuracy and speed. Having more steps will give more

accuracy but takes more time.

Parallelization There are two ways to parallelize Kd-tree operations. The first is a straightforward extension of the exhaus-

tive search case:

• Divide ~160 TB of data into ~3200 machines with ~50GB memory each. For every query feature, search the machines

simultaneously.

• Combine the search results as for the exhaustive case, by having a local sorted list of images, and then broadcasting

these lists to get a global sorted list, which is then geometrically verified.

• Speedup: having more machines here corresponds to logarithmic speedup of the search process. Consider having M

machines, then the each machine will perform

Tp

nn,kdt = F(2dB+B+B log2

FI

M)+L(M−1)

where L is the size of the list on each machine, and the second term is the time required to merge M sorted lists of size

L.

• Example:

B = 250

Tp

nn,kdt = 103(64e3+7.5e3)+100×3199

= 71 MFLOP/image

The second is:

• Instead of having independent Kd-trees in each machine, there will be a single (or multiple) root machines with copies

of the top part of the Kd-trees. For example, with 50GB of memory available, we can fit 25 billion internal nodes.

Therefore, for storing 4 trees, we need to store ~6billion nodes per tree i.e. 3 billion leaves, which corresponds to

storing the top log2 3e9=30 levels for each of the 4 trees.

• The rest of the trees are then divided to the rest of the ~3200 machines (called leaf machines) that will also store the

feature vectors and information.

• The root machine(s) will get query features, compute an initial list of nodes to explore based on the top of the tree.

Then, it will dispatch that feature to all leaf machines that have that part of the tree. Once a leaf machine gets a feature,

it will handle backtracking within its own subtree, and updates its ranked list of images.

14

Page 15: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

• Speedup: we have two parts for the running time, one in the root machine and one in the leaf machines:

Tp

nn,kdt,r = FB log2

C

2×2T

Tp

nn,kdt,l =FB

M(2dB+B)+F log2

4FIT

C+L(min(FB,M)−1)

where C is the memory capacity of each machine, Tp

nn,kdt,r is the time to search the root machine, while Tp

nn,kdt,l is the

time for the leaf machines, where the first term is the time taken to search all the F image features concurrently where

each feature will be sent to at most B machines (and so we need FB machines to process them in the worst case, where

each backtracking step goes to a different leaf machine), the second term is the time to get the minimum, the third term

to go down the subtree excluding the top part in the root machine, and the last term is to merge the min(M,FB) sorted

lists (equals the number of machines processing the image).

• Example

Tp

nn,kdt,2 = Tp

nn,kdt,r +Tp

nn,kdt,l

= 103 ×250×30+1e3×250

3200(256×250+250)

= 7.5e6+5e6

≈ 12.5 MFLOP/image

• The motivations for this distinction between root and leaf machines are:

– most of the storage for the tree is in the leaves. Therefore we can store most of the top of the tree in a single

machine, which will then dispatch jobs to the node machines.

– the time spent searching the top of the tree is smaller than that searching the lower parts of the tree, so we can gain

more speedup by splitting the bottom part of the tree instead of splitting the whole tree.

– Searching the tree will generally not activate all of the node machines i.e. the list of explored nodes will not span

all the node machines. Therefore, the central machine can process more features, and node machines can interleave

computations of different features simultaneously. This offers more speedup over the first parallelization approach

above. So, even though the search time per feature is similar to that in the first case, interleaving computations

will make this much smaller.

• However, a drawback with this approach is the extra storage requirements. That is because the lower parts of the trees

will generally not store the same features for different trees, and therefore each machine will have to store the features

in the subtrees it is holding, and as an upper bound the extra storage will be proportional to the number of kd-trees i.e.

we will need to store each feature K times, where K is the number of trees.

A.1.3 Locality Sensitive Hashing

Description

• A number of LSH tables are computed, where each consists of a number of hash functions

• The nearest neighbor search is performed by computing the hash value for the query feature in all tables, and then

processing the returned list exhaustively

Storage

• We still need to store all the features and their information

Sex = (sd +4)IF

• In addition, there is storage for the tables. In each table, we need to store the hash functions, in addition to the indices

of all the features available:

Sta = IFlog2 IF

8+Sh

where Shis the storage for the hash functions, which depends on the specific hash function used and is negligible

compared to the first term.

15

Page 16: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

• Total:

Slsh = Sex +T Sta

Slsh = IF(sd +4+T⌈log2 IF⌉

8)

• Example:

T = 4

Slsh = 1012(128+4+4×5)

= 152×1012

= 152 TB

Speed

• Searching through the LSH index involves computing the hash function values, and then exhaustively checking the

features that share the same hash value. The time can be represented as

Tnn,lsh = F(Th +(2d +1)FI

BT )

where the first term is the time to compute the hash functions:

– L2: Th,l2 = T H(2d + 2) where we have H hash functions, and for each we need to compute dot product (one

addition and multiplication per dimension) plus another addition and division.

– Spherical-Simplex: Th,sim = T H(2d(d+1)+d) = T H(2d2 +3d) where we need to compute product of a d+1×d

matrix and a vector of d dimensions, and then need to get the closest vertex.

– Spherical-Orthoplex: Th,orth = T H(2d2 +2d) = 2T H(d2 +d) where we need to multiply a d×d matrix with a d

vector, and then choose the nearest vertex from the 2d choices, where we know that one vertex is just the opposite

sign of the other.

• The second two terms above is the time to compute exhaustive nearest neighbors, and it depends on the size of the LSH

bucket, which is not constant, and grows linearly with the number of total features, in addition to it varying a lot for the

same number of features. We assume we have a preset number of unique bucket values B, and then the average bucket

size would be FIB

.

• Example:

H = 50

T = 4

B = 106

Tnn,lsh,l2 = F(T H(2d +2)+FI

B(2d +1)×T )

= 1e3(200(2×128+2)+106(2×128+1))

= 51e6+1028e9

≈ 1028 GFLOP/image

H = 5

Tnn,lsh,sim = F(T H(2d2 +3d)+FI

B(2d +1)T )

= 1e3(20(2×1282 +3×128)+1028e6)

= 663e6+1028e9

≈ 1028 GFLOP/image

Tnn,lsh,orth = F(2T H(d2 +d)+FI

B(2d +1)T )

= 1e3(40× (1282 +128)+1028e6)

= 660e6+1028e9

≈ 1028 GFLOP/image

16

Page 17: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Parameters There are general two parameters for any LSH method:

• T the number of tables. Generally, having more tables improves performance but increases run time as well (in case of

variable B checks).

• H the number of hash functions. Having more functions increases run time.

However, L2 LSH includes one more parameter, which is w the bin size for the projection.

Parallelization There are two ways to parallelize LSH operations. The first is a straightforward extension of the exhaustive

search case:

• Divide ~152 TB of data into ~3000 machines with ~50GB memory each. For every query feature, search the machines

simultaneously.

• Combine the search results as for the exhaustive case

• Speedup: having more machines here corresponds to linear speedup. Consider having M machines, then the each

machine will perform

Tp

nn,lsh ≈ FFI

MBT (2d +1)+L(M−1)

• Example:

Tp

nn,lsh,l2 = FT H(2d +2)+F2I

MBT (2d +1)+L(M−1)

= 51.6e6+85e9+3e5

≈ 85 GFLOP/image

The second is similar to Kd-trees:

• Instead of having independent LSH Indexes in different machines, there will be a central machine that computes all the

hash functions, and has a list of which machines store which buckets.

• The rest of the machines will store features that belong to a subset of buckets.

• The central machine(s) will get query features, compute the hash values (buckets), and then dispatches that feature to

machines containing that bucket.

• Speedup: we have two parts for the running time, one in the central machine and one in the node machines:

Sm2,c = FTnn,lsh,x

Sm2,n =F

T((2d +1)

FI

B

where Sm,c is the time to compute the hash values on the central machine, while Sm,n is the time taken for the node

machines to search that bucket. This method will work if Sm2,c is much smaller compared to Sm2,n. On the other hand,

it will be a bottleneck if the two times are comparable.

A.1.4 Hierarchical K-Means (HKM)

Description HKM performs a hierarchical clustering tree of the features. At each level node, k-means is performed on the

features in that node, and then features are split into the child nodes depending on their distance to the cluster centers. This

continues until the maximum depth is reached.

17

Page 18: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Storage

• We still need to store all the features and their information

Sex = (sd +4)IF

• In addition, there is storage for the clustering tree. For a tree with depth D and branching factor k, there are kD leaf

nodes, which will store the actual features (in subsets of k). The number of internal nodes is

n =D−1

∑i=0

ki =kD −1

k−1

and for each internal node we need to store the k cluster centers, therefore the total storage is:

Shkm =kD −1

k−1× ksd +(sd +4)IF

• Example:

k = 10 branches

D = 7

Shkm = 1.1e6×10×128+132×1012

= 142e7+132e12

≈ 132 TB

Speed

• At each level of the clustering tree, we need to find the nearest cluster, in addition to finding the nearest feature in the

leaf node:

Tnn,hkm = F(D× (2d + k)+FI

kD× (2d +1))

where the second term comes from searching the features in the leaf node, assuming each leaf node will have the same

share of features.

• Example:

k = 10

D = 7

Thkm = 1e3(7×266+1e5×257)

≈ 25.7 GFLOP/image

Parameters There are two parameters that affect the amount of storage and running time:

• D the maximum depth of the tree.

• k the branching factor of the tree.

Parallelization There are two ways to parallelize HKM operations. The first is a straightforward extension of the exhaustive

search case:

• Divide ~132 TB of data into ~2650 machines with ~50GB memory each. For every query feature, search the machines

simultaneously.

• Combine the search results as for the exhaustive case.

• Speedup: having more machines here corresponds to less search time. Consider having M machines, then the each

machine will perform

Tp

nn,hkm = FD(2d + k)+F2I

MkD× (2d +1)+L(M−1)

18

Page 19: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

• Example:

Tnn,hkm = 1e3×7×266+3.77e4+100×2649

≈ 2.1 MFLOP/image

The second is similar to Kd-trees:

• Instead of having independent HKM trees in different machines, there will be a central machine that stores the upper

part of the HKM tree (internal nodes).

• The rest of the machines, node machines, will store subsets of the leaf nodes and their features.

• The central machine(s) will get query features, goes through the top of the tree, and the dispatch features to the appro-

priate node machines.

• Speedup: we have two parts for the running time, one in the central machine and one in the node machines:

Sm2,c = D(2d + k)

Sm2,n = 2d ×FI

kD

where Sm,c is the time to go through the top of the HKM tree on the central machine, while Sm,n is the time taken for the

node machines to search the leaf nodes.

A.2 Bag-of-Words (BoW) Representation

In this approach, we only store histogram counts of visual words for each image, in addition to storing features information

(location).

A.2.1 Inverted File

Description The inverted file is a data structure designed to allow fast searching through a large index of documents rep-

resented by histograms of word occurrences. Instead of having a “forward” file, in which a list is stored for every document

listing the words it contains, the opposite is done i.e. a list is stored for each word listing the documents that have this word.

This has the advantage of very fast search through the index, because only documents with shared words are considered during

the search process.

Storage

• We need to store the feature information (e.g. location) to allow the spatial consistency step afterwards

S f i = 4FI

• We also need storage for the inverted file:

Si f = Wsd +FI(log2 I

8+1)

where the first term is the storage for the W words, and the second term is storage for the lists of images, where we have

a total of FI features and for each we need to store the image id and the feature count.

• For large dictionaries of visual words, ~1M visual words, image histograms tend to be binary i.e. there is one-to-one

mapping between features and visual words, so we do not need to store counts

Si f ,bin = Wsd +FIlog2 I

8

19

Page 20: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

• Total

St = Si f +S f i

= 4FI +Wsd +FI(log2 I

8+1)

= Wsd +FI(5+log2 I

8)

St,bin = Wsd +FI(4+log2 I

8)

• Example

W = 106

F = 103

I = 109

St = 106 ×128+1012(5+4)

≈ 9 TB

St,bin ≈ 8 TB

Speed There are two components for the run time

• Time for computing the visual words for every feature, which is usually done with either Kd-trees or with HKM:

Tvw,kdt = 2dB+B+B log2 W = B(2d +1+ log2 W )

Tvw,hkm = D(2d + k)

• Time for searching the inverted file, which is hard to estimate precisely because it depends on the amount of overlap

between the query image and the database images. Assuming image features are distributed evenly among the visual

words, each query feature will encounter roughly IFW

images per word, and so the time would be:

Ti f = 2+min(IF

W, I)

where the first term is for the normalization of the histogram, and the second is for computing the distance function,

e.g. L2 distance, between these two values.

• Total time

Tbow,i f ,kdt = F(B(2d +1+ log2 W )+2+min(IF

W, I))

Tbow,i f ,hkm = F(D(2d + k)+2+min(IF

W, I))

• Example

B = 250

W = 106

Tbow,i f ,kdt = 1e3(250(257+20)+2+1e12

1e6)

≈ 1 GFLOP/image

D = 6

k = 10

Tbow,i f ,hkm = 1e3(6(256+10)+2+1e12

1e6)

≈ 1 GFLOP/image

20

Page 21: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Parameters There are a number of parameters:

• W the number of visual words in the dictionary. Generally, the larger the dictionary, the better the performance and the

faster the actual search, because the number of overlapping images is smaller.

• The weighting of the histogram counts:

– none: use raw histogram counts

– binary: binarize the histogram counts

– tf-idf: use the term frequency-inverse document frequency weighting to emphasize more informative words

• The normalization of the histograms:

– none: use un-normalized histograms

– L1: use L1 normalization i.e. ∑i hi = 1

– L2: use L2 normalization i.e. ∑i h2i = 1

• The distance function:

– L1: use distance dl1(hi,h j) = ∑k |hik −h jk|

– L2: use distance dl2(hi,h j) =√

∑k(hik −h jk)2

– Cosine: use distance dcos(hi,h j) = 1−∑k hikh jk

Parallelization Parallelization is straightforward:

• Divide the storage of 9 TB among 180 machines, with about ~50GB each.

• Given a query image, all machines are searched simultaneously to get a ranked list of images in that machine

• Broadcast the top image per machine, and each machine will compile a list of 100 images

• The top 100 images are searched for spatial consistency by the machines that are storing them, and the results are again

broadcast to get a final list of 100 ranked images

• The results are returned to the user

• Speedup: the time to search in parallel in the different machines will be smaller, but not as much as the theoretical

number above suggests. This is because these calculations assume the worst case run time, which is not usually the

case.

Tbow,i f ,kdt = F(B(2d +1+ log2 W )+2+min(IF

MW,

I

M))

Tbow,i f ,hkm = F(D(2d + k)+2+min(IF

MW,

I

M))

A.2.2 Min-Hash

Description The min-hash is a type of LSH. The histograms are binarized i.e. represented as a “set” of visual words without

word counts. The hash function is just the minimum of a random permutation of the numbers 0, ..,W −1: h(bi) = minπ(bi).

Storage

• We need to store the feature information (e.g. location) to allow the spatial consistency step afterwards

S f i = 4FI

21

Page 22: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

• We also need storage for LSH tables:

Slsh,mh = Wsd +FIlog2 W

8+T × (

log2 I

8+Sh)

where the first term is the storage for the W words, the second is for storing the sparse histograms, and the third term

is storage for the T LSH tables, where for each we need to store indices to the images in the different buckets plus the

negligible Sh to store the hash functions themselves.

• Total

St = Slsh,mh +S f i

= Wsd + IFlog2 W

8+T × (

log2 I

8+Sh)+2FI

= Wsd +FI(4+log2 W

8)+T

log2 I

8

• Example

W = 106

F = 103

I = 109

T = 50

St = 106 ×128+1012(4+3)+50×4

≈ 7 TB

Speed There are two components for the run time

• Time for computing the visual words for every feature, which is usually done with either Kd-trees or with HKM:

Tvw,kdt = F(2dB+B+B log2 W ) = FB(2d +1+ log2 W )

Tvw,hkm = FD(2d + k)

• Time for computing the hash functions and the exhaustive search for images within the bucket:

Tmh = FT H × (3+1)+TI

B

where the first term is for computing the hash function (the permutation is computed as π(x) = ax + b mod W , and so

requires three operations), and the second term is for computing the distance for images within that bucket, which is

assumed evenly distributed IB

with B unique buckets.

• Total time

Tbow,mh,kdt = FB(2d +1+ log2 W )+4FT H +T I

B

Tbow,mh,hkm = FD(2d + k)+4FT H +T I

B

• Example

B = 106

W = 106

T = 50

H = 1

Tbow,mh,kdt = 1e3×250(257+20)+200e3+50×1e9

1e6≈ 70 MFLOP/image

D = 6

k = 10

Tbow,i f ,hkm = 1e3(6(256+10)+200+50)

≈ 2 MFLOP/image

22

Page 23: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

Parameters There are a number of parameters:

• W the number of visual words in the dictionary. Generally, the larger the dictionary, the better the performance and the

faster the actual search, because the number of overlapping images is smaller.

• T the number of hash tables

• H the number of hash functions

Parallelization Parallelization is straightforward:

• Divide the storage of 7 TB among 140 machines, with about ~50GB each.

• Given a query image, all machines are searched simultaneously to get a ranked list of images in that machine

• Broadcast the top image per machine, and each machine will compile a list of 100 images

• The top 100 images are search for spatial consistency by the machines that are storing them, and the results are again

broadcast to get a final list of 100 ranked images

• The results are returned to the user

• Speedup: computations will be faster when using more machines, as the number of images per bucket will go down

Tmh = T H × (3+1)+TI

BMF

B Parameter Tuning

Some of the algorithms have a large number of parameter combinations that greatly affect its performance, in terms of trading

off speed and recognition performance. In order to get a quick feel for which combinations of parameters to try out, we

performed some quick experiments to estimate the performance of various combinations, in terms of speed and recognition

rate. The tuning is done with a random subset of 100 images from Scenario 1. The matching rate is the percentage of nearest

neighbor features that are actually in the ground truth model image. Based on these quick assessments, we chose certain

combinations and performed further experiments on all the datasets.

B.1 Kd-Tree

For Kd-trees we have two parameters: T the number of trees and B the number of backtracking steps. Figure 6(a) shows

results for trying 1,4, and 8 trees with 50,100,250, and 500 backtracking steps. Using 8 trees takes much more memory with

minimal gain in matching rate, while using 50 or 500 backtracking steps takes too much time. Based on the plot, we decided

to go forward with using 1 and 4 trees with 100 and 250 backtracking steps. Figure 6(b) shows results for running these

parameters on the four scenarios. We chose the combination of {T,B} = {1,100} as the representative from this method in

later comparisons.

We notice the following: 1) we generally do not need the geometric consistency check, as the recognition performance

before is very comparable to the recognition performance after; 2) the geometric check time generally decreases with increas-

ing the database size. This is intuitive as with increasing the number of features in the database, the number of matches for

the distractor images would go down, and most of them would have less matches such that they do not undergo the geometric

step

B.2 Lsh

B.2.1 Lsh-L2

Here we have three parameters: T the number of tables, H the number of hash functions, and w the bin size. We tried different

combinations of using 1, 4, or 8 tables, 5, 10, 25, or 100 hash functions, and bin size of 0.1, 0.25, 0.5, or 1. Based on

results in figure 7(a), we decided to use (T,H,w) = {4,10,0.1}, {4,25,0.25}, {4,50,0.5}, and {4,100,1} for the rest of the

experiments, which are the points at the knee of the plot, and represent the most promising combinations. Those results are

shown in figure 7(b). We notice that they provide very similar recognition performance and searching time. We chose the

combination (T,H,w) = {4,25,0.25} as the representative from this method in later comparisons.

23

Page 24: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

10−2

100

102

104

10

20

30

40

50No. of distractors: 100

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

exhaust−l2

kdt−t1−c50

kdt−t1−c100

kdt−t1−c250

kdt−t1−c500

kdt−t4−c50

kdt−t4−c100

kdt−t4−c250

kdt−t4−c500

kdt−t8−c50

kdt−t8−c100

kdt−t8−c250

kdt−t8−c500

10−2

100

102

104

10

20

30

40

50No. of distractors: 1k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

10−2

100

102

104

10

20

30

40

50No. of distractors: 10k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

(a) Quick tuning for Kd-trees

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 1

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 2

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 2

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 3

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 3

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 4

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 4

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

(b) Results for selected Kd-trees parameters on 4 scenarios

Figure 6: Kd-trees Parameter Tuning

24

Page 25: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

10−2

100

102

104

10

20

30

40

50No. of distractors: 100

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

exhaust−l2

lsh−l2−t1−f10−w.01

lsh−l2−t1−f25−w.01

lsh−l2−t1−f50−w.01

lsh−l2−t1−f100−w.01

lsh−l2−t1−f10−w.1

lsh−l2−t1−f25−w.1

lsh−l2−t1−f50−w.1

lsh−l2−t1−f100−w.1

lsh−l2−t1−f10−w.25

lsh−l2−t1−f25−w.25

lsh−l2−t1−f50−w.25

lsh−l2−t1−f100−w.25

lsh−l2−t4−f10−w.01

lsh−l2−t4−f25−w.01

lsh−l2−t4−f50−w.01

lsh−l2−t4−f100−w.01

lsh−l2−t4−f10−w.1

lsh−l2−t4−f25−w.1

lsh−l2−t4−f50−w.1

lsh−l2−t4−f100−w.1

lsh−l2−t4−f10−w.25

lsh−l2−t4−f25−w.25

lsh−l2−t4−f50−w.25

lsh−l2−t4−f100−w.25

lsh−l2−t4−f50−w.5

lsh−l2−t4−f100−w1

lsh−l2−t8−f10−w.1

lsh−l2−t8−f25−w.1

lsh−l2−t8−f25−w.25

lsh−l2−t8−f50−w.25

lsh−l2−t8−f50−w.5

lsh−l2−t8−f100−w1

10−2

100

102

104

10

20

30

40

50No. of distractors: 1k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

10−2

100

102

104

10

20

30

40

50No. of distractors: 10k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

(a) Quick tuning for Lsh-L2

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 1

lsh−l2−t4−f10−w.1

lsh−l2−t4−f25−w.25

lsh−l2−t4−f50−w.5

lsh−l2−t4−f100−w1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 2

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 2

lsh−l2−t4−f10−w.1

lsh−l2−t4−f25−w.25

lsh−l2−t4−f50−w.5

lsh−l2−t4−f100−w1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 3

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 3

lsh−l2−t4−f10−w.1

lsh−l2−t4−f25−w.25

lsh−l2−t4−f50−w.5

lsh−l2−t4−f100−w1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 4

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 4

lsh−l2−t4−f10−w.1

lsh−l2−t4−f25−w.25

lsh−l2−t4−f50−w.5

lsh−l2−t4−f100−w1

(b) Results for selected parameters for LSH L2

Figure 7: Lsh L2 Parameter Tuning

25

Page 26: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

B.2.2 Lsh Spherical Simplex

Here we have two parameters: T the number of tables, and H the number of hash functions. We tried using 1 and 4 tables with

3, 5, or 7 hash functions. Results are in figure 8(a). We then decided to use 4 tables with 5 and 7 hash functions, whose results

are in figure 8(b). We notice that the recognition performance of using 5 functions is consistently better, while having similar

search time. We hence chose the combination {T,H} = {4,5} as the representative from this method in later comparisons.

B.2.3 Lsh Spherical Orthoplex

Here we have two parameters: T the number of tables, and H the number of hash functions. We tried using 1 and 4 tables

with 3, 5, or 7 hash functions. Results are in figure 9(a). We then decided to use 4 tables with 5 and 7 hash functions, whose

results are in figure 9(b). Again, we notice that the recognition performance of using 5 functions is consistently better, while

having similar search time. We hence chose the combination {T,H} = {4,5} as the representative from this method in later

comparisons.

B.3 Hierarchical K-Means

HKM has two parameters: D the depth of the tree and k the branching factor. We tried depths of 5, 6 and 7 and with branching

factor of 10. Results for quick tuning are in figure 10(a), and full results are in figure 10(b). From the quick tuning, all the

combinations are quite similar, so we ran those on the four scenarios. We notice that they have similar performance, with

using a depth of 5 yielding slightly better performance. We chose this combination {D,k} = {5,10} as the representative of

the method in later comparisons.

B.4 Inverted File

We have several parameters:

• The number of words in the dictionary. We try using 104, 105, and 106 words.

• The way to generate the dictionary, and we compare two popular ways: a vocabulary tree using Hierarchical K-Means

procedure (HKM) [19], and a flat dictionary built using Approximate K-Means employing a set of random Kd-trees to

perform the nearest neighbor step (AKM) [20]. The dictionaries are built using features from 105 images from each

distractor set.

• Weighting the histogram, normalization of the histogram, and distance function. We used the following combinations:

{w,n,d}={none,l1,l1}, {none,l2,l2}, {tf-idf,l2,cos}, {bin,l1,l1}, and {bin,l2,l2}.

Results of tuning are show in figure 11. From these results, we decided to go ahead with using three combinations: {tf-

idf,l2,cos}, {bin,l2,l2}, and {none,l1,l1}. The first is the standard way in the literature to use inverted file search. The second

has also been shown to work competitively with larger dictionaries [15], while the third is a novel combination. We note that

increasing the number of words in the dictionary enhances performance, and that HKM is a bit worse than AKM, and so we go

ahead and use the AKM dictionary with 1M words. Figure 12 shows results of these chosen parameters on the four scenarios.

We notice several points:

• Using the novel combination {none,l1,l1} is comparable or outperforms the other two combinations on most scenarios,

especially before the geometric step

• Using the binary histograms generally outperforms the standard way of tf-idf weighting the histograms.

• Results before the geometric consistency checks are generally better than results afterwards. We believe this is because

of the way geometric check is performed, where feature matches are based on visual words, which provides a crude

way of matching the features between the two images, and will not be one-to-one when more than one feature have the

same visual word.

B.5 Min-Hash

Here we have similar parameters pertaining to the dictionary: number of words and how the dictionary is generated. In

addition, we have two more parameters: T the number of hash tables and H the number of hash functions. We try 1, 5, 25, and

100 tables with 1, 2, and 3 hash functions. Based on the results in figure 13, which show these combinations with different

dictionary sizes and dictionary generation, we chose to go ahead with Akm-1M dictionary and using 25 and 100 tables with 1

hash function. Figure 14 shows the results of these combinations on the four scenarios. We notice the following:

26

Page 27: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 100

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

exhaust−l2

lsh−sph−sim−t1−f3

lsh−sph−sim−t1−f5

lsh−sph−sim−t1−f7

lsh−sph−sim−t4−f3

lsh−sph−sim−t4−f5

lsh−sph−sim−t4−f7

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 1k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 10k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

(a) Quick Tuning

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 1

lsh−sph−sim−t4−f5

lsh−sph−sim−t4−f7

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 2

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 2

lsh−sph−sim−t4−f5

lsh−sph−sim−t4−f7

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 3

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 3

lsh−sph−sim−t4−f5

lsh−sph−sim−t4−f7

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 4

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 4

lsh−sph−sim−t4−f5

lsh−sph−sim−t4−f7

(b) Results for selected parameters

Figure 8: LSH Spherical Simplex Parameter Tuning

27

Page 28: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 100

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

exhaust−l2

lsh−sph−or−t1−f3

lsh−sph−or−t1−f5

lsh−sph−or−t1−f7

lsh−sph−or−t4−f3

lsh−sph−or−t4−f5

lsh−sph−or−t4−f7

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 1k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 10k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

(a) Quick Tuning

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 1

lsh−sph−or−t4−f5

lsh−sph−or−t4−f7

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 2

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 2

lsh−sph−or−t4−f5

lsh−sph−or−t4−f7

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 3

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 3

lsh−sph−or−t4−f5

lsh−sph−or−t4−f7

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 4

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 4

lsh−sph−or−t4−f5

lsh−sph−or−t4−f7

(b) Results for selected parameters

Figure 9: Lsh Spherical Orthoplex Parameter Tuning

28

Page 29: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 100

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

exhaust−l2

hkm−l5−b10

hkm−l6−b10

hkm−l7−b10

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 1k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

10−2

100

102

104

0.1

1

10

20

304050

No. of distractors: 10k

Time per feature (ms)

Pe

rce

nta

ge

Co

rre

ct

Ma

tch

es

(a) Quick Tuning

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 1

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 1

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 1

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 2

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 2

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 2

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 3

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 3

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 3

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

befo

re)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Searc

h tim

e (

sec/im

age)

Scenario 4

102

103

104

30

40

50

60

70

80

90

100

Number of Images

Perf

orm

ance (

after)

Scenario 4

102

103

104

10−2

100

102

Number of Images

Geom

etr

ic C

heck tim

e (

sec/im

age)

Scenario 4

kdt−t1−c100

kdt−t1−c250

kdt−t4−c100

kdt−t4−c250

(b) Results for selected parameters

Figure 10: Hierarchical K-Means Parameter Tuning

29

Page 30: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

102

103

104

105

30

40

50

60

70

80

90

100bow−akmeans−10k

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

bow−akmeans−10k

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

30

40

50

60

70

80

90

100bow−akmeans−10k

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

bow−akmeans−10k

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

ivf−none−l1−l1

ivf−none−l2−l2

ivf−tfidf−l2−cos

ivf−bin−l1−l1

ivf−bin−l2−l2

102

103

104

105

30

40

50

60

70

80

90

100bow−akmeans−100k

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

bow−akmeans−100k

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

30

40

50

60

70

80

90

100bow−akmeans−100k

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

bow−akmeans−100k

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

ivf−none−l1−l1

ivf−none−l2−l2

ivf−tfidf−l2−cos

ivf−bin−l1−l1

ivf−bin−l2−l2

102

103

104

105

30

40

50

60

70

80

90

100bow−akmeans−1M

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

bow−akmeans−1M

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

30

40

50

60

70

80

90

100bow−akmeans−1M

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

bow−akmeans−1M

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

ivf−none−l1−l1

ivf−none−l2−l2

ivf−tfidf−l2−cos

ivf−bin−l1−l1

ivf−bin−l2−l2

102

103

104

105

30

40

50

60

70

80

90

100bow−hkm−1M

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

bow−hkm−1M

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

30

40

50

60

70

80

90

100bow−hkm−1M

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

bow−hkm−1M

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

ivf−none−l1−l1

ivf−none−l2−l2

ivf−tfidf−l2−cos

ivf−bin−l1−l1

ivf−bin−l2−l2

Figure 11: Inverted File Parameter Tuning. Rows correspond to dictionaries: akm-10K, akm-100K, 1km-1M, and hkm-1M,

where the number corresponds to the number of words. First column depicts the recognition performance before geometric

checks, second column shows the search time through the inverted file, third column shows the recognition performance after

geometric step, while the fourth column shows the geometric check time.

30

Page 31: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

102

103

104

105

20

40

60

80

100Scenario 1

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 1

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

20

40

60

80

100Scenario 1

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 1

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

ivf−none−l1−l1

ivf−tfidf−l2−cos

ivf−bin−l2−l2

102

103

104

105

20

40

60

80

100Scenario 2

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 2

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

20

40

60

80

100Scenario 2

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 2

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

102

103

104

105

20

40

60

80

100Scenario 3

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 3

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

20

40

60

80

100Scenario 3

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 3

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

102

103

104

105

20

40

60

80

100Scenario 4

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 4

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

20

40

60

80

100Scenario 4

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 4

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

Figure 12: Inverted File results on the four scenarios for the chosen parameters.

31

Page 32: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

• Min-Hash performance is generally worse than Inverted File. This is expected, as it is just an approximation that works

best for near identical images, and not for images with large distortions [9, 11].

• Geometric consistency checks is crucial for Min-Hash, as the first step provides very poor performance.

References

[1] Mohamed Aly, Peter Welinder, Mario Munich, and Pietro Perona. Scaling object recognition: Benchmark of current

state of the art techniques. In ICCV Workshop WS-LAVD, 2009.

[2] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimen-

sions. Commun. ACM, 51(1):117–122, 2008.

[3] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu. An optimal algorithm for approximate nearest

neighbor searching. Journal of the ACM, 45:891–923, 1998.

[4] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.

[5] Andrei Broder, Moses Charikar, and Michael Mizenmacher. Min-wise independent permutations. Journal of Computer

and System Sciences, 60:630–659, 2000.

[6] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Com-

puter Networks and ISDN Systems, 29:8–13, 1997.

[7] A.Z. Broder. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences

1997, pages 21–29, 1997.

[8] O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: Finding a (thick) needle in a haystack. In CVPR, 2009.

[9] O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot detection. In CIVR, pages

549–556, 2007.

[10] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative

feature model for object retrieval. In ICCV, 2007.

[11] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In British

Machine Vision Conference, 2008.

[12] D. Forsyth and J. Ponce. Computer Vision: A modern approach. Prentice Hall, 2002.

[13] J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The amsterdam library of object images. IJCV, 61:103–112,

2005.

[14] H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search.

In ECCV, 2008.

[15] H. Jégou, M. Douze, and C. Schmid. Packing bag-of-features. In ICCV, sep 2009.

[16] Y. Ke, R. Sukthankar, and L. Huston. An efficient parts-based near-duplicate and sub-image retrieval system. In MUL-

TIMEDIA, pages 869–876, 2004.

[17] David Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.

[18] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. IJCV, 2004.

[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006.

[20] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial

matching. CVPR, 2007.

[21] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in

large scale image databases. In CVPR, 2008.

32

Page 33: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

102

103

104

105

0

20

40

60

80

100bow−akmeans−10k

Number of Images

Pe

rfo

rma

nce

(b

efo

re)

102

103

104

105

10−6

10−4

10−2

100

bow−akmeans−10k

Number of Images

Se

arc

h t

ime

(se

c/im

ag

e)

102

103

104

105

0

20

40

60

80

100bow−akmeans−10k

Number of Images

Pe

rfo

rma

nce

(a

fte

r)

102

103

104

105

10−6

10−4

10−2

100

bow−akmeans−10k

Number of Images

Ge

om

etr

ic C

he

ck T

ime

(se

c/im

ag

e)minhash−t1−f1

minhash−t1−f2

minhash−t1−f3

minhash−t5−f1

minhash−t5−f2

minhash−t5−f3

minhash−t25−f1

minhash−t25−f2

minhash−t25−f3

minhash−t100−f1

minhash−t100−f2

minhash−t100−f3

102

103

104

105

0

20

40

60

80

100bow−akmeans−100k

Number of Images

Pe

rfo

rma

nce

(b

efo

re)

102

103

104

105

10−6

10−4

10−2

100

bow−akmeans−100k

Number of Images

Se

arc

h t

ime

(se

c/im

ag

e)

102

103

104

105

0

20

40

60

80

100bow−akmeans−100k

Number of Images

Pe

rfo

rma

nce

(a

fte

r)

102

103

104

105

10−6

10−4

10−2

100

bow−akmeans−100k

Number of Images

Ge

om

etr

ic C

he

ck T

ime

(se

c/im

ag

e)

102

103

104

105

0

20

40

60

80

100bow−akmeans−1M

Number of Images

Pe

rfo

rma

nce

(b

efo

re)

102

103

104

105

10−6

10−4

10−2

100

bow−akmeans−1M

Number of Images

Se

arc

h t

ime

(se

c/im

ag

e)

102

103

104

105

0

20

40

60

80

100bow−akmeans−1M

Number of Images

Pe

rfo

rma

nce

(a

fte

r)

102

103

104

105

10−6

10−4

10−2

100

bow−akmeans−1M

Number of Images

Ge

om

etr

ic C

he

ck T

ime

(se

c/im

ag

e)

102

103

104

105

0

20

40

60

80

100bow−hkm−1M

Number of Images

Pe

rfo

rma

nce

(b

efo

re)

102

103

104

105

10−6

10−4

10−2

100

bow−hkm−1M

Number of Images

Se

arc

h t

ime

(se

c/im

ag

e)

102

103

104

105

0

20

40

60

80

100bow−hkm−1M

Number of Images

Pe

rfo

rma

nce

(a

fte

r)

102

103

104

105

10−6

10−4

10−2

100

bow−hkm−1M

Number of Images

Ge

om

etr

ic C

he

ck T

ime

(se

c/im

ag

e)

Figure 13: Min-Hash Parameter Tuning. Rows correspond to dictionaries: akm-10K, akm-100K, akm-1M, and hkm-1M,

where the number corresponds to the number of words. First column depicts the recognition performance before geometric

checks, second column shows the search time through the inverted file, third column shows the recognition performance after

geometric step, while the fourth column shows the geometric check time.

33

Page 34: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

102

103

104

105

0

20

40

60

80

100Scenario 1

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 1

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

0

20

40

60

80

100Scenario 1

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 1

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

minhash−t25−f1

minhash−t100−f1

102

103

104

105

0

20

40

60

80

100Scenario 2

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 2

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

0

20

40

60

80

100Scenario 2

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 2

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

102

103

104

105

0

20

40

60

80

100Scenario 3

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 3

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

0

20

40

60

80

100Scenario 3

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 3

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

102

103

104

105

0

20

40

60

80

100Scenario 4

Number of Images

Perf

orm

ance (

befo

re)

102

103

104

105

10−2

10−1

100

101

102

Scenario 4

Number of Images

Searc

h tim

e (

sec/im

age)

102

103

104

105

0

20

40

60

80

100Scenario 4

Number of Images

Perf

orm

ance (

after)

102

103

104

105

10−2

10−1

100

101

102

Scenario 4

Number of Images

Geom

etr

ic C

heck T

ime (

sec/im

age)

Figure 14: Min-Hash results on the four scenarios for the chosen parameters.

34

Page 35: Indexing in Large Scale Image Collections: Scaling - CiteSeerX

[22] J. Sivic, B. C. Russell, A.. A. Efros, A. Zisserman, and W.T. Freeman. Discovering object categories in image collections.

In ICCV, 2005.

[23] T. Terasawa and Y Tanaka. Spherical lsh for approximate nearest neighbor search on unit hypersphere. Proceedings of

the Workshop on Algorithms and Data Structures, 2007.

[24] Z. Wu, Q. Ke, M. Isard, and J. Sun. Bundling features for large scale partial-duplicate web image search. In CVPR,

2009.

[25] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., (2), 2006.

35