Mining Multiple Queries for Image Retrieval: On-the-fly...

8
Mining Multiple Queries for Image Retrieval: On-the-fly learning of an Object-specific Mid-level Representation Basura Fernando and Tinne Tuytelaars KU Leuven, ESAT-PSI, iMinds Belgium [email protected] Abstract In this paper we present a new method for object re- trieval starting from multiple query images. The use of mul- tiple queries allows for a more expressive formulation of the query object including, e.g., different viewpoints and/or viewing conditions. This, in turn, leads to more diverse and more accurate retrieval results. When no query images are available to the user, they can easily be retrieved from the internet using a standard image search engine. In particular, we propose a new method based on pattern mining. Using the minimal description length principle, we derive the most suitable set of patterns to describe the query object, with patterns corresponding to local feature config- urations. This results in a powerful object-specific mid-level image representation. The archive can then be searched efficiently for similar images based on this representation, using a combination of two inverted file systems. Since the patterns already encode local spatial information, good results on several standard image retrieval datasets are obtained even without costly re-ranking based on geometric verification. 1. Introduction The rapid growth of available image data, thanks to photo sharing sites such as Flickr and Picasa, has raised new challenges in image and video retrieval. Whereas traditional audiovisual archives come with carefully curated metadata, allowing easy access based on a predefined thesaurus, user generated content is hardly annotated – apart from a few, often not very informative, tags. The same holds for many older audiovisual archives that only recently have been dig- itized. This calls for content-based retrieval methods, that analyze the images directly rather than relying on meta- data. In this context, good results have been reported using bag-of-visual-words based methods [3, 5, 16] following the query by example paradigm. Starting from a single query Figure 1. Multi-query examples with different viewing conditions for query Eiffel tower and London bridge. image, these allow efficient retrieval of similar images from the archive building on an inverted file system, similar to the standard text-based retrieval methods using bag-of-words. The result is a ranked list, with images ranked according to their similarity to the given query image. The question then arises: How to select a good query im- age? An object may be pictured from different viewpoints or under different viewing conditions (e.g. different lighting conditions) – see Figure 1. Which images are retrieved then depends, to a large extent, on the selected query image. To some extent, this can and has been alleviated by de- veloping more robust image representations that are invari- ant to viewpoint changes and viewing conditions. The im- age retrieval and computer vision communities have gained considerable progress in this direction with the help of affine invariant representations, domain adaptation tech- niques, and ‘tricks’ such as query expansion [3, 5, 6, 18]. Even then, extreme changes in viewing conditions cannot be recovered. Moreover, a single image can only show one aspect of an object (e.g. the front of a car, but not the rear) and may contain substantial amount of image clutter. Image retrieval starting from multiple queries (MQIR) takes a different approach to tackle these problems. By specifying multiple query images, the user can provide more information to the system about the specific object he wants to find without the need to select the best one. Another interesting application of MQIR is content- based image retrieval based on a textual query. Indeed, by 1

Transcript of Mining Multiple Queries for Image Retrieval: On-the-fly...

Page 1: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

Mining Multiple Queries for Image Retrieval:On-the-fly learning of an Object-specific Mid-level Representation

Basura Fernando and Tinne TuytelaarsKU Leuven, ESAT-PSI, iMinds

[email protected]

Abstract

In this paper we present a new method for object re-trieval starting from multiple query images. The use of mul-tiple queries allows for a more expressive formulation ofthe query object including, e.g., different viewpoints and/orviewing conditions. This, in turn, leads to more diverse andmore accurate retrieval results. When no query images areavailable to the user, they can easily be retrieved from theinternet using a standard image search engine.

In particular, we propose a new method based on patternmining. Using the minimal description length principle, wederive the most suitable set of patterns to describe the queryobject, with patterns corresponding to local feature config-urations. This results in a powerful object-specific mid-levelimage representation.

The archive can then be searched efficiently for similarimages based on this representation, using a combination oftwo inverted file systems. Since the patterns already encodelocal spatial information, good results on several standardimage retrieval datasets are obtained even without costlyre-ranking based on geometric verification.

1. Introduction

The rapid growth of available image data, thanks tophoto sharing sites such as Flickr and Picasa, has raised newchallenges in image and video retrieval. Whereas traditionalaudiovisual archives come with carefully curated metadata,allowing easy access based on a predefined thesaurus, usergenerated content is hardly annotated – apart from a few,often not very informative, tags. The same holds for manyolder audiovisual archives that only recently have been dig-itized. This calls for content-based retrieval methods, thatanalyze the images directly rather than relying on meta-data. In this context, good results have been reported usingbag-of-visual-words based methods [3, 5, 16] following thequery by example paradigm. Starting from a single query

Figure 1. Multi-query examples with different viewing conditionsfor query Eiffel tower and London bridge.

image, these allow efficient retrieval of similar images fromthe archive building on an inverted file system, similar to thestandard text-based retrieval methods using bag-of-words.The result is a ranked list, with images ranked according totheir similarity to the given query image.

The question then arises: How to select a good query im-age? An object may be pictured from different viewpointsor under different viewing conditions (e.g. different lightingconditions) – see Figure 1. Which images are retrieved thendepends, to a large extent, on the selected query image.

To some extent, this can and has been alleviated by de-veloping more robust image representations that are invari-ant to viewpoint changes and viewing conditions. The im-age retrieval and computer vision communities have gainedconsiderable progress in this direction with the help ofaffine invariant representations, domain adaptation tech-niques, and ‘tricks’ such as query expansion [3, 5, 6, 18].Even then, extreme changes in viewing conditions cannotbe recovered. Moreover, a single image can only show oneaspect of an object (e.g. the front of a car, but not the rear)and may contain substantial amount of image clutter.

Image retrieval starting from multiple queries (MQIR)takes a different approach to tackle these problems. Byspecifying multiple query images, the user can providemore information to the system about the specific object hewants to find without the need to select the best one.

Another interesting application of MQIR is content-based image retrieval based on a textual query. Indeed, by

1

Page 2: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

leveraging the power of internet image search engines suchas Google Images, a textual query can be used to retrievea set of images from the internet. These can then serve asqueries for the actual MQIR system. Of course, the set ofimages returned by the internet search engine may containerrors (i.e., images not representing the object of interest).Therefore, for this type of application, it is important thatthe MQIR system is robust to noisy/non-relevant images.

Given the variety in viewing conditions, different as-pects, as well as image clutter, the challenge then is for thesystem to figure out, purely based on the provided queryimages, What exactly is the object of interest?. Most MQIRmethods in the literature (e.g. [2, 12]) do not try to answerthis question, yet simply perform a search based on onequery image at a time and then fuse the results, or fall backto rather simple schemes to merge the features from the dif-ferent query images (see also section 2 on related work). Incontrast, we rely on an unsupervised pattern mining methodthat builds an object-specific mid-level representation thatbest models the query images based on the minimal descrip-tion length principle (MDL) [10].

To this end, we first create local bag-of-words from asmall neighbourhood around each key point to exploit lo-cal spatial and structural information. Using these local his-tograms, we mine the query images to discover a set of mid-level patterns that best explain the query images. Our mid-level patterns are compositions of visual words in a specificspatial configuration. They possess local spatial and struc-tural information. We show they are a powerful mid-levelrepresentation for image retrieval.

We adapt the recently proposed data mining algorithmKRIMP [25] to discover the visual patterns. KRIMP isbased on the minimal description length principle [10]. Assuch, the standard representation consisting of typicallyhigh-dimensional (e.g. 106-D) bag-of-words is effectivelyshrinked into a compact model consisting of just a few hun-dred patterns.

Our approach has several advantages: (1) It is an unsu-pervised approach. It does not require any negative images– as opposed to exemplar SVM [2, 14], Joint-SVM [2] orthe method presented in [12]. (2) Our method is robust tonoisy images as it explores the regularity in the query im-ages using the MDL principle. (3) The resulting KRIMPbased representation is compact (only a few hundred pat-terns). (4) Using our new mid-level representation, rele-vant images can be retrieved accurately and efficiently, evenwithout costly geometric verification step.

Our major contributions are as follows: (1) We presenta multiple query based image retrieval method that sys-tematically outperforms other MQIR methods even withoutcostly spatial verification. (2) We adapt a minimum descrip-tion length based pattern mining algorithm to discover localstructural patterns for specific object retrieval. To the best

Figure 2. An example query from our new dataset is shown in thefirst row and retrieved results using our method are shown in thenext two rows.

of our knowledge, we are the first to use MDL based min-ing for discovering visual patterns. (3) We present a novelquery expansion method based on pattern mining which wecall pattern based query expansion (PQE). (4) We introducea new dataset to evaluate MQIR methods (See Figure 2).

The rest of the paper is organized as follows. First wediscuss related work in section 2. Then we present our pat-tern based multiple query retrieval method in section 3. Theexperimental evaluation is described in section 4 and theconclusion is given in section 5.

2. Related Work

Single query methods Several methods for large scaleimage retrieval starting from a single query have been pro-posed [7, 13, 15, 17, 23]. They almost invariably consist ofthe following components: bag-of-words creation, index-ing, query expansion, and geometric verification.

In query expansion [3, 5, 6] (QE), the original query im-age is replaced with a more representative set of imagesconstructed based on the top ranked images. This, in ef-fect, turns the single query retrieval problem into a multiplequery retrieval one.

Geometric verification is typically used to improve theresults by re-ranking the top-K retrieved images. This stepis linear in the number of top-K images, and therefore needsto be implemented efficiently, e.g. using LO-RANSAC [4].

In addition to these relatively standard components,other improvements have been proposed as well. For in-stance, Zhang et al. [28] propose to use geometry preservingvisual phrases. This method captures geometric informationat the feature construction level. However, it is sensitive tolocal deformations as the structures it discovers are rigid.

Multiple query methods and textual queries In [27] aquery specific feature fusion method is proposed. Eventhough it is not presented as a MQIR method it can beextended to perform MQIR. They combine multiple re-trieval sets using a graph based link analysis method andk-reciprocal nearest neighbors [19]. In the work of Liu etal. [12], a textual query based image retrieval method is

2

Page 3: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

presented that uses labeled positive/negative images down-loaded from the internet to train a classifier. To handlecross-domain issues they merge domain specific classifiersas proposed in [21]. However, this is only possible if onehas sufficient labelled images from each domain. Finally,Arandjelovic and Zisserman [2] present several extensionsof standard Bag-of-words to perform MQIR. These are de-scribed in section 4.1 and serve as baseline for our exper-imental results. In contrast with all these methods, to thebest of our knowledge we are the first to mine query im-ages to discover object-specific patterns to build mid-levelfeatures to retrieve images.

Pattern mining In computer vision, pattern mining hasbeen used before mostly for image classification [8, 20, 26]and action classification [9]. In all these cases a set of pat-terns is mined using the arguably outdated A-priori algo-rithm [1] or LCM [24], using all the images in the dataset.In contrast we use pattern mining for image retrieval. Wemine for patterns specific for an object using small num-ber of query images. We do not mine patterns using theentire dataset. To make sure that the obtained patterns gen-eralize to the image archive, we rely on the minimum de-scription length principle (MDL) [10] to provide us withpatterns that are representative of the object. None of theabove mining-based methods [8, 9, 20, 26] use MDL to ob-tain visual patterns. We adapt the recently proposed datamining algorithm KRIMP [25]. Typically KRIMP is usedfor pattern based classification using a probabilistic model.In contrast we propose to make use of the discovered pat-terns directly to build mid-level object specific features. Tothis end, we use the top-k patterns that best represent thequery object where the best patterns are selected using acrude MDL principle [10].

3. Pattern Based Image Retrieval

Before we explain our overall process (section 3.3), wefirst introduce some mining specific notations and terminol-ogy (section 3.1) and explain how we construct transactions(section 3.2). We explain pattern mining in detail in sec-tion 3.4, our image scoring function in section 3.5, MQBMstep in section 3.6 and our query expansion method in sec-tion 3.7.

3.1. Notations and terminology

Let 𝐼 be a set of items. In our case, each item is a visualword label (or a label generated for a visual word in a spe-cific configuration–(See section 3.2)). A transaction 𝑡 is aset of items (𝑡 ∈ 𝑃𝑜𝑤𝑒𝑟𝑆𝑒𝑡(𝐼)), e.g., 𝑡1 = {𝑖1, 𝑖2, 𝑖5, 𝑖9},with each 𝑖𝑗 a visual word. Let 𝐷 = {𝑡1, 𝑡2, 𝑡3 . . . 𝑡𝑛} bewhat is known as a database of transactions that is used tomine patterns from. A pattern is a generalization of a set

of transactions and consists of a set of items. We denote apattern by 𝑥. Roughly speaking patterns can be consideredas subsets of transactions. For example, 𝑥1 = {𝑖1, 𝑖2, 𝑖9}could be a pattern. We say a pattern 𝑥 is matched/mappedto transaction 𝑡 if 𝑥 ⊆ 𝑡. For example 𝑥1 is matched to 𝑡1.A model 𝐻 is a set of patterns1.

3.2. From visual words to transactions

In our framework, we use root-SIFT descriptors ex-tracted from Hessian-affine regions using the method pro-posed in [16] along with the root-SIFT descriptor of [3]. Avisual vocabulary is created using K-means clustering andeach SIFT descriptor extracted from a key point is assignedto one of the visual words (hard assignment). We use localbag-of-words (LBOW) as representation for a local regionsurrounding a key point. These LBOWs are constructed ashistograms over the visual words in the spatial neighbour-hood of a detected key point.

Given a key point location and its affine region, we findthe spatially k-nearest key points within the affine region toconstruct a local bag-of-words. The affine region is furthersplit into 4 quadrants aligned with local affine frame, andyet consistent with the upright assumption. This results ina histogram dimension of five times the visual vocabularysize.

Finally, LBOWs are transformed into transactions byconsidering each non-zero bin as an item.

3.3. Overall Process

Before describing the pattern mining in-detail, we firstdescribe the overall process of our system starting with theoffline process, i.e. indexing the archive. Indeed to be appli-cable in a retrieval scenario, it must be possible to retrieveimages from the archive in sub-linear time.

Offline Process : For the archive images, we process thedata offline. We create local bag-of-words, transform theminto transactions and index them using two inverted file sys-tems (IFS). The first inverted file system, 𝐼𝐹𝑆1, indexes thearchive transactions such that the key is a visual word labeland the corresponding entry consists of a list of transactionscontaining this visual word label. The second inverted filesystem, 𝐼𝐹𝑆2, maps each transaction id to the set of cor-responding images. Obviously, only those transactions thatactually occur in our database images, are entered in the in-verted file systems2.

Online Process : Given a set of query images and theinverted file systems 𝐼𝐹𝑆1 and 𝐼𝐹𝑆2, we retrieve similarimages using the algorithm presented in Algorithm 1. Firstwe extract descriptors for all query images; we assign themto visual words and create transactions as explained in sec-tion 3.2. Then we find a set of consistent key points (𝐶), i.e.

1In section 1.1 of the suppl. material a mining example is presented.2See detailed indexing algorithm in suppl. material (Section 1.2).

3

Page 4: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

key-points that can be found in more than one query image(see section 3.6). Next, we create a database of visual trans-actions (𝐷) based on the LBOWs describing spatial neigh-bourhoods around consistent key-points. Then we mine forpatterns using the MDL principle and obtain a set of pat-terns 𝐻 that best explains the query image visual transac-tions. For each pattern 𝑥 ∈ 𝐻 , we use 𝐼𝐹𝑆1 to find trans-action IDs containing this pattern 𝑥. Using 𝐼𝐹𝑆2 and theretrieved transaction IDs we then find images that containthis visual pattern 𝑥. We repeat this IFS based search foreach pattern in the model 𝐻 . This way we count how manytimes each pattern occurs in the database images and con-struct bag-of-patterns. Comparing the bag-of-patterns al-lows to rank the database images. Finally, this ranked list isrefined based on query expansion to obtain the final rank-ing.3

Data: Set of query images, 𝐼𝐹𝑆1, 𝐼𝐹𝑆2

Result: Ranked list1. Extract features and create local transactions (3.2);2. Find a set of consistent key-points 𝐶 (3.6);3. Create database 𝐷 = {𝑡𝑖 . . . 𝑡𝑛} only from thetransactions centering on key-points of 𝐶 (3.2);4. 𝐻 = {𝑥1 . . . 𝑥𝑘} ← Pattern-Mining(D)(3.4);5. for 𝑥𝑖 ← 1 to k do

5.1 𝑇𝑖 = {𝑡1 . . . 𝑡𝑚} ←− Find archive imagetransactions containing 𝑥𝑖 (i.e. 𝑥𝑖 ⊆ 𝑡𝑚) using𝐼𝐹𝑆1;5.2 𝑀𝑖 ←− Find set of archive images containingtransactions 𝑇𝑖 using 𝐼𝐹𝑆2;

end6. Construct bag-of-patterns from 𝑀𝑖, 𝑖 = 1 . . . 𝑘;7. Ranked list←− apply PQE (3.7) and obtain scores;

Algorithm 1: Online image retrieval process.

3.4. Pattern mining

Now that we have created a database of transactions 𝐷based on the query images representing the object of inter-est, we need to explain pattern mining techniques to dis-cover relevant patterns, that can serve as an object-specificmid-level representation for better image retrieval. At thispoint it’s worth noting that a single pattern 𝑥 only describesa part of the query object. A set of patterns that togetherdescribe the query object is called a model denoted by 𝐻 .

The most widely used pattern mining technique, knownas Frequent Itemset Mining (FIM) [1], only discovers pat-terns that are frequent (as often used in computer vi-sion [9, 20, 26]). With FIM, all closed frequent patterns areused to represent the query object. As a result, a large num-ber of selected patterns describe the same parts of the object

3See retrieval algorithm in supplementary material (Section 1.3) formore details.

and are, in fact, redundant. This model overly represents thedata and, consequently, represents the noise too. Hence it isnot the most suitable approach for image retrieval. Instead,what we need is a model that best explains the query imagedata (i.e., the query object) without redundancy.

The KRIMP algorithm4 [25] provides such solution. Ituses the minimal description length (MDL) principle to dis-cover a set of patterns that together best explain the data.The KRIMP algorithm follows the philosophy : The bestmodels are the ones that compress the data best.It exploits the regularity in the query image data to discoverrelevant patterns to represent the intended object. This ap-proach finds a balance between complexity of the model(number of patterns and complexity of the patterns) and rep-resentation of the query data.

Some more terminology Before we explain KRIMP indetail, we introduce a few concepts we will need later on.

A cover function is a function of a model 𝐻 and transac-tion 𝑡, 𝑐𝑜𝑣𝑒𝑟 : {𝐻 × 𝑡} → 𝑃𝑜𝑤𝑒𝑟𝑆𝑒𝑡(𝑃𝑜𝑤𝑒𝑟𝑆𝑒𝑡(𝐼)) thatindicates which patterns in the model contribute to cover thetransaction. Intuitively, one could say it selects the modelpatterns that are used to encode all items in the transaction.Let 𝑥, 𝑦 be patterns such that 𝑥, 𝑦 ∈ 𝐻 and 𝑡 be a trans-action (𝑡 ∈ 𝐷). A cover function has the following threeproperties: (1) 𝑥 ∈ 𝑐𝑜𝑣𝑒𝑟(𝐻, 𝑡) =⇒ 𝑥 ∈ 𝐻; (2) if𝑥, 𝑦 ∈ 𝑐𝑜𝑣𝑒𝑟(𝐻, 𝑡) =⇒ (𝑥 = 𝑦) or (𝑥 ∩ 𝑦 = ∅); and(3) 𝑡 = ∪𝑥∈𝑐𝑜𝑣𝑒𝑟(𝐻,𝑡)𝑥. We refer to [25] for more details onhow to compute the cover function.

The usage of a pattern 𝑥 ∈ 𝐻 by the query image basedtransactional database 𝐷 is computed as follows:

𝑢𝑠𝑎𝑔𝑒(𝑥∣𝐷) = ∣{𝑡 ∈ 𝐷 : 𝑥 ∈ 𝑐𝑜𝑣𝑒𝑟(𝐻, 𝑡)}∣. (1)

In words, usage measures how many times the pattern 𝑥 isused to encode the query images.

KRIMP The KRIMP algorithm then proceeds to selectthe best model, i.e. the best set of patterns, based on theminimum description length principle, as follows. Given aset of models ℍ, the best model (𝐻∗) is the one that mini-mizes

𝐻∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝐻∈ℍ𝐿(𝐻) + 𝐿(𝐷∣𝐻) (2)

in which 𝐿(𝐻) is the length of the model in bits and𝐿(𝐷∣𝐻) is the length of the query image data once it isencoded with the model H. First we show how to computethe length of the model 𝐿(𝐻):

𝐿(𝐻) =∑𝑥∈𝐻

[𝐿(𝑥∣𝐻) + 𝐿(𝑥∣𝐻𝑠𝑡)] (3)

where 𝐻𝑠𝑡 is the standard model consisting of only single-ton items. To compute the length of a pattern 𝑥 given the

4http://www.cs.uu.nl/groups/ADA/krimp/index.php

4

Page 5: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

model, we use Shannon entropy:

𝐿(𝑥∣𝐻) = −𝑙𝑜𝑔(𝑃 (𝑥∣𝐷)), (4)

The quantity 𝑃 (𝑥∣𝐷) is computed using the query imageinformation as follows:

𝑃 (𝑥∣𝐷) =𝑢𝑠𝑎𝑔𝑒(𝑥∣𝐷)∑

∀𝑦∈𝐻 𝑢𝑠𝑎𝑔𝑒(𝑦∣𝐷). (5)

Next, we look at the second term in eq. 2. The length ofthe entire query based transactional database 𝐷 when en-coded by the model 𝐻 , 𝐿(𝐷∣𝐻), is the summation of allthe lengths of transactions in 𝐷 once encoded by the model𝐻:

𝐿(𝐷∣𝐻) =∑𝑡∈𝐷

𝐿(𝑡∣𝐻). (6)

The length of a transaction given the model 𝐻 and a coverfunction is then given by

𝐿(𝑡∣𝐻) =∑

𝑥∈𝑐𝑜𝑣𝑒𝑟(𝐻,𝑡)

𝐿(𝑥∣𝐻) (7)

where 𝐿(𝑥∣𝐻) is computed as before using equation 4.Note that it can be shown that 𝐿(𝑡∣𝐻) is equal to−𝑙𝑜𝑔(𝑃 (𝑡∣𝐷)). Now we can solve the objective functionin Equation 2 as described in [25]. By optimizing the aboveobjective function we find the best model 𝐻∗ that best ex-plains the image query data 𝐷 with a minimal model com-plexity. We can use this model directly for image retrieval,following a fully probabilistic approach and ranking thedatabase images based on their probability given the model.Alternatively, we can also consider the set of discoveredpatterns in the optimal model (𝐻∗) as a new set of mid-level features. Using these patterns we can then constructa histogram (bag-of-patterns) for each image. In practice,we do not use all patterns from 𝐻∗, but only the N patternswith smallest length 𝐿(𝑥∣𝐻). We found experimentally thatthe fully probabilistic approach is slightly less performingcompared to the bag-of-patterns scheme. So we opt to usethe latter for all our experiments.

Given a set of multiple query images, we use KRIMPto discover the patterns, followed by Algorithm 1 presentedin section 3.3 to build bag-of-patterns and to retrieve simi-lar images. Next we explain the weighting scheme and thesimilarity function we use to rank images.

3.5. Final score/ranking function

After we have constructed the bag-of-patterns, we usetf-idf weighting. Then we 𝐿2 normalize the tf-idf weightedbag-of-patterns. To obtain the final score for each databaseimage, we use the square-root histogram intersection sim-ilarity [8]. The final score is obtained by summing thesquare-root histogram intersection similarity over all queryimages, i.e. for a database image 𝐼𝑑 we get

𝑠𝑐𝑜𝑟𝑒(𝐼𝑑) =∑𝑞

∑𝑖

𝑚𝑖𝑛(√𝑤𝑖

𝑑,√𝑤𝑖

𝑞) (8)

where 𝑤𝑖𝑑 is the tf-idf weighted L2 normalized 𝑖𝑡ℎ bin of

the histogram (bag-of-patterns) from database image 𝑑 and𝑤𝑖

𝑞 is the same for the 𝑞𝑡ℎ query image.

3.6. Selecting consistent keypoints: Multiple querybasic matching (MQBM)

We have now explained most steps of Algorithm 1, ex-cept for steps 2 and 7, which we describe in this and nextsection.

Before we construct the database of transactions, we se-lect consistent key-points. This preprocessing step identi-fies a set of key points that are consistent across multiplequery images. Inconsistent key points, that are found onlyin a single query image, are filtered out. We call this stepMultiple query basic matching or MQBM.

This is done efficiently using quantized SIFT features,with processing time linear in the number of query images.For this step, we create binary bag-of-words. If a visualword appears in the image, we set it to one, otherwise tozero. By summing all binary-bag-of-words of query im-ages we find visual words that appear at-least among twoor more images. Using such visual words, we can findthe corresponding consistent key-points. Only transactionscentered at consistent key points are selected to constructthe database of visual transactions 𝐷. This initial multiplequery matching step removes unstable and noisy key points,significantly reduces the size of the transactional databaseand allows to explore larger spatial patterns.

3.7. Pattern based query expansion (PQE)

Query expansion is widely used [3, 5] to obtain bet-ter query image representations. For instance, the mostwidely used average query expansion takes the average ofthe top-k ranked image histograms. Instead, in our novelpattern based query expansion (PQE), we combine the top-k ranked images along with the query images to find a bet-ter model (a set of patterns). This can be combined withany pattern mining algorithm. The same objective functionused for pattern discovery is used for PQE. For example,for FIM, we select the most frequent set of patterns fromboth the query images and the top-k ranked images. WhenKRIMP is used, we re-discover patterns using the query andtop-k images using the same MDL objective function (equa-tion 2).

4. Experiments

To evaluate the proposed multiple query image retrievalsystem we use two mainstream image retrieval datasets: theOxford-Buildings-105K [17] dataset and the Paris-dataset

5

Page 6: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

(6K) [18] with 100K Flicker distractor images (collectedfrom MIRFLICKR-1M [11]) here after referred to as Paris-105K dataset. Additionally, we introduce a new challeng-ing dataset to evaluate the performance of multiple querybased image retrieval methods. Most of the existing im-age retrieval datasets are not designed to evaluate multi-query approaches. We have collected old and new imagesof buildings, statues, churches, castles and other landmarksin Belgium. We call this dataset the Belgium landmarksdataset. Each query consists of five images of a specificplace or object from different time periods, viewing condi-tions and viewpoints. The objective is to retrieve more im-ages of the same place or object using the five query imagesgiven. There are eleven queries in the dataset and 5500 im-ages altogether. An example query and the results we obtainusing our method is shown in Figure 2.

We use a visual dictionary of one million visual wordscreated using the K-means algorithm. We follow the sameexperimental setup as in [2]. For both Oxford-105K andParis-105K datasets we use 5 images per query. Altogetherthere are 11 multiple query sets for each dataset. We alsoevaluate all methods using query images obtained usingGoogle image search. We use the top 8 Google images re-turned for a textual query such as “Eiffel Tower Paris” fromthe Google image search API5. To evaluate the performancewe use mean average precision (mAP).

4.1. Comparing several multi-query approaches

In this experiment by default we use 10 spatial neigh-bours of a key-point to construct the LBOWs, KRIMP basedpattern discovery and PQE for query expansion using thetop-5 ranked images (see detailed analysis in section 4.2 forhow we selected these parameters). We use only 300 pat-terns to represent an query object. In Table 1 we compareour method against several multiple query based methodsintroduced in [2]. These include: i) Joint AVG.: a jointaverage query method that takes the average of all queryhistograms; ii) MQ. AVG.: a method that takes the aver-age of scores returned by each single image in the query;iii) MQ. MAX.: a method that takes the maximum insteadof the mean of the scores returned by each single image inthe query; and iv) Joint SVM: a method that learns a SVMover all query images using a fixed pool of negatives. ForOxford-105K, using dataset queries we report the results forthe baselines as reported in [2]. For the other experiments,we used our own implementation, including query expan-sion. We evaluate the performance with and without usingcostly spatial verification (SP) step. We did not include theexemplar SVM method included in [2], as it did not per-form so well in their experiments on Oxford-105K (mAP of0.846 with SP). For the spatial verification we use a maxi-mum of 200 top ranked images per query image as in [2].

5https://developers.google.com/image-search/

Figure 3. Left: mAP by varying the number of Google images ina query, Right: standard deviation in AP

The KRIMP pattern based method systematically outper-forms all other methods. The difference is especially out-spoken when not using spatial verification. When Googlequeries are used, the performance of all methods drops. Yetthe ordering of the methods remains the same. On our newchallenging dataset, the overall performance of all methodsreduces drastically but still the KRIMP mid-level patternbased approach performs best. The single query approachperforms very poorly on all datasets. This suggests that,for practical applications and given the ease of collectingquery images from the web, multiple queries should be usedwhenever possible.

In Figure 3 we show how the performance varies withthe number of top-K Google images used on Paris-105Kdataset. Especially when the number of query images in-creases, our method outperforms the other methods. More-over, it also shows less variance than the other methods asthe number of images in the query grows.

4.2. Detailed analysis

Next we evaluate some of the design choices we madeusing Oxford-105K and Paris-105K datasets. For theseexperiments, we use a smaller visual dictionary of 200Kwords to keep the computational time low.

Comparing query expansion methods: First, we com-pare several query expansion methods that can be used incombination with our pattern based approach. In this exper-iment, we use the FIM pattern mining method to build thequery object model and a maximum of 3 neighbors to cre-ate LBOWs/transactions. We compare pattern based queryexpansion (PQE) with the most commonly used averagequery expansion (AQE) and discriminative query expansion(DQE) introduced in [3]. For DQE we use LibLinear6 totrain the SVM. For all these methods we use the query im-ages along with the top 5 retrieved images for query expan-sion (in the case of DQE, 5 query images + 5 top rankedimages are used as positives and a fixed set of 200 nega-tives). Results are shown in Figure 4.

From this result we conclude that for our pattern basedapproach, pattern based query expansion (PQE) is a betterchoice compared to DQE or AQE. DQE results are disap-

6http://www.csie.ntu.edu.tw/cjlin/liblinear/

6

Page 7: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

Method Oxford-105k Paris-105k New Dataset-5KDataset queries Dataset queries-our Google queries Dataset queries Google queries Dataset queries

No SV SV No SV SV No SV SV No SV SV No SV SV No SV No SV

Single query 0.622 [2] 0.725 [2] 0.616 0.713 0.602 0.668 0.696 0.737 0.539 0.550 0.481 0.501

Joint AVG. 0.886 [2] 0.933 [2] 0.874 0.927 0.743 0.755 0.813 0.892 0.803 0.821 0.738 0.793MQ. AVG. 0.888 [2] 0.937 [2] 0.876 0.938 0.746 0.755 0.813 0.892 0.769 0.820 0.739 0.794MQ. Max. 0.826 [2] 0.929 [2] 0.823 0.919 0.673 0.764 0.819 0.889 0.782 0.816 0.752 0.785Joint SVM 0.886 [2] 0.926 [2] 0.867 0.918 0.750 0.767 0.849 0.888 0.830 0.839 0.675 0.786

Our 0.947 0.944 0.947 0.944 0.767 0.790 0.907 0.913 0.865 0.869 0.791 0.797Table 1. Comparison of several multiple query approaches with and without spatial verification (SV). For all the methods we compare within which we don’t report results directly from [2], we also include query expansion. For our KRIMP-based method we use pattern basedquery expansion.

Figure 4. Comparison of query expansion methods.

Figure 5. Left: The effect of multiple query basic SIFT matching,Right: Comparison of pattern mining methods: FIM vs. KRIMP.

pointing, may be due to their sensitivity to the selectionof negatives. Note that in [6] it is proposed to use thetop at most 50 spatially verified results in query expansionand low tf-idf scores provide the negative training data forDQE [3]. Instead, we followed the typical query expansionsetup. This could be one of the reason for law performanceof DQE.

Effect of initial basic query matching: In this experi-ment we show the effect of query side basic SIFT matching(MQBM). We construct LBOWs/transactions using a max-imum of 10 nearest neighbors and PQE. Results are shownin Figure 5 (left).

Comparison between FIM and KRIMP: Now wecompare FIM based pattern mining with KRIMP based pat-tern mining using all of the above improvements such asPQE and MQBM. In this experiment transactions are cre-ated using a maximum of 10 nearest neighbors. Results areshown in Figure 5 (right). Clearly the KRIMP based ap-proach outperforms FIM based approach. In Figure 6 weevaluate the performance of KRIMP pattern based methodwith varying model size (number of patterns). We sort thepatterns by length (i.e. 𝐿(𝑥∣𝐻)) and select the top-k short-est patterns. Even with a small number of patterns such as

Figure 6. Performance vs. number of patterns using our approach.

100, the KRIMP pattern based model outperforms the base-lines of Table 1, with 0.91 mAP without spatial verificationand 0.928 with spatial verification on Oxford-105K. At 300patterns, the curves level off.

Execution times and complexity analysis: Our MQIRmethod consists of two major steps: (1) processing thequery images (SIFT feature extraction, MQBM, transactioncreation, pattern mining), and (2) pattern based image re-trieval using inverted file systems. The execution time forstep (1) only depends on the number of query images. Thesecond step only depends on the size of the image archive.The first step takes about 2-3 seconds for 8 query imageswhile the second step takes less than 0.1 seconds, both mea-sured on a single CPU of 2.6GHz for 250K image archive.In Figure 7, we show how image retrieval time varies withthe size of the image archive. The first step of our systemcan easily be parallelized if multiple CPUs/GPU are avail-able which is a common feature in modern computers. Oneof the key aspects of our mid-level feature is that they canbe indexed for efficient image retrieval. It should be notedhere that most of the mid-level feature representations in theliterature can’t be indexed (e.g. [22]) or at least it is not clearhow to do that.

The inverted file systems 𝐼𝐹𝑆1 and 𝐼𝐹𝑆2 take about1500Mb of ram to index 105K images.7

5. Discussion and Conclusion

In this paper we present a new method for image re-trieval starting from multiple query images. We learn a

7See memory analysis in supplementary material (Section 1.5).

7

Page 8: Mining Multiple Queries for Image Retrieval: On-the-fly ...users.cecs.anu.edu.au/~basura/papers/MQIR_ICCV_2013.pdf · Figure 2. An example query from our new dataset is shown in

Figure 7. Pattern based retrieval time (excluding the processingtime of the query images which is independent of the databasesize) for 300 patterns by varying the number of images in the im-age archive.

query object model on-the-fly using a minimal descriptionlength principle based pattern mining approach. This waywe construct a new mid-level feature representation for thequery object. Our mid-level features capture local spatialand structural information so we don’t need to rely on costlygeometric verification. We also introduced pattern basedquery expansion which is suitable for pattern based im-age retrieval methods. We demonstrate excellent results onstandard image retrieval benchmarks. Compared to othermethods the proposed method shows steady performanceimprovement as the number of images in a query increases.

We introduced a new challenging dataset to evaluate theperformance of multi-query based image retrieval methods.In this dataset even without the costly spatial verificationour method again outperforms all others. We show that sin-gle query based approaches perform poorly in all datasets.We claim that using multiple queries is a much better choicefor image retrieval whenever it is possible to collect sev-eral query images. When it is possible to apply multi querybased image retrieval, the proposed method seems the mostsuitable.

Acknowledgements : The authors acknowledge thesupport of the EC FP7 project AXES and iMinds Impactproject Beeldcanon.

References

[1] R. Agrawal and R. Srikant. Fast algorithms for mining asso-ciation rules in large databases. In VLDB, 1994.

[2] R. Arandjelovic and A. Zisserman. Multiple queries for largescale specific object retrieval. In BMVC, 2012.

[3] R. Arandjelovic and A. Zisserman. Three things everyoneshould know to improve object retrieval. In CVPR, 2012.

[4] O. Chum, J. Matas, and S. Obdrzalek. Enhancing RANSACby generalized model optimization. In ACCV, 2004.

[5] O. Chum, A. Mikulik, M. Perdoch, and J. Matas. Total recallii: Query expansion revisited. In CVPR, 2011.

[6] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman.Total recall: Automatic query expansion with a generativefeature model for object retrieval. In ICCV, 2007.

[7] O. Chum, J. Philbin, and A. Zisserman. Near duplicate imagedetection: min-hash and tf-idf weighting. In BMVC, 2008.

[8] B. Fernando, E. Fromont, and T. Tuytelaars. Effective use offrequent itemset mining for image classification. In ECCV,2012.

[9] A. Gilbert, J. Illingworth, and R. Bowden. Fast realisticmulti-action recognition using mined dense spatio-temporalfeatures. In ICCV, 2009.

[10] P. D. Grunwald. The Minimum Description Length Principle.THE MIT PRESS, 2007.

[11] M. J. Huiskes and M. S. Lew. The mir flickr retrieval evalu-ation. In ACM MIR, 2008.

[12] Y. Liu, D. Xu, I.-H. Tsang, and J. Luo. Textual query ofpersonal photos facilitated by large-scale web data. PAMI,33:1022 –1036, 2011.

[13] A. Makadia. Feature tracking for wide-baseline image re-trieval. In ECCV, 2010.

[14] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble ofexemplar-svms for object detection and beyond. In ICCV,2011.

[15] D. Nister and H. Stewenius. Scalable recognition with a vo-cabulary tree. In CVPR, 2006.

[16] M. Perd’och, O. Chum, and J. Matas. Efficient representationof local geometry for large scale object retrieval. In CVPR,2009.

[17] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-man. Object retrieval with large vocabularies and fast spatialmatching. In CVPR, 2007.

[18] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman.Lost in quantization: Improving particular object retrieval inlarge scale image databases. In CVPR, 2008.

[19] D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. van Gool.Hello neighbor: Accurate object retrieval with k-reciprocalnearest neighbors. In CVPR, 2011.

[20] T. Quack, V. Ferrari, B. Leibe, and L. Van Gool. Efficientmining of frequent and distinctive feature configurations. InICCV, 2007.

[21] G. Schweikert, C. Widmer, B. Scholkopf, and G. Ratsch. Anempirical analysis of domain adaptation algorithms for ge-nomic sequence analysis. In NIPS, 2008.

[22] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discoveryof mid-level discriminative patches. In ECCV, 2012.

[23] J. Sivic and A. Zisserman. Video google: A text retrievalapproach to object matching in videos. In ICCV, 2003.

[24] T. Uno, T. Asai, Y. Uchida, and H. Arimura. Lcm: An effi-cient algorithm for enumerating frequent closed item sets. InFIMI, 2003.

[25] J. Vreeken, M. Leeuwen, and A. Siebes. Krimp: miningitemsets that compress. Data Min. Knowl. Discov., 23:169–214, 2011.

[26] B. Yao and L. Fei-Fei. Grouplet: A structured image repre-sentation for recognizing human and object interactions. InCVPR, 2010.

[27] S. Zhang, M. Yang, T. Cour, K. Yu, and D. Metaxas. Queryspecific fusion for image retrieval. In ECCV, 2012.

[28] Y. Zhang, Z. Jia, and T. Chen. Image retrieval with geometry-preserving visual phrases. In CVPR, 2011.

8