Mobile Visual Search

Mobile Visual Search

Oge Marques Florida Atlantic University

Boca Raton, FL - USA

TEWI Kolloquium – 24 Jan 2012

Take-home message

Oge Marques

Mobile Visual Search (MVS) is a fascinating research field with many open challenges and opportunities which have the potential to impact the way we organize, annotate, and retrieve visual data (images and videos) using mobile devices.

Outline

•  This talk is structured in four parts:

1.  Opportunities

2.  Basic concepts

3.  Technical details

4.  Examples and applications

Oge Marques

Part I

Opportunities

Mobile visual search: driving factors

•  Age of mobile computing

Oge Marques h<p://60secondmarketer.com/blog/2011/10/18/more-‐mobile-‐phones-‐than-‐toothbrushes/


•  Smartphone market

Oge Marques h<p://www.idc.com/getdoc.jsp?containerId=prUS23123911


•  Smartphone market

Oge Marques h<p://www.cellular-‐news.com/story/48647.php?s=h


•  Why do I need a camera? I have a smartphone…

Oge Marques h<p://www.cellular-‐news.com/story/52382.php


•  Powerful devices

1 GHz ARM Cortex-A9 processor, PowerVR SGX543MP2, Apple A5 chipset

Oge Marques h<p://www.apple.com/iphone/specs.html h<p://www.gsmarena.com/apple_iphone_4s-‐4212.php


Social networks and mobile devices ��(May 2011)

Oge Marques h<p://jess3.com/geosocial-‐universe-‐2/


•  Social networks and mobile devices – Motivated users: image taking and image sharing are

huge!

Oge Marques : h<p://www.onlinemarkeVng-‐trends.com/2011/03/facebook-‐photo-‐staVsVcs-‐and-‐insights.html


•  Instagram: – 13 million registered (although not

necessarily active) users (in 13 months)

– 7 employees – Several apps based on it!

Oge Marques h<p://venturebeat.com/2011/11/18/instagram-‐13-‐million-‐users/


•  Food photo sharing!

Oge Marques h<p://mashable.com/2011/05/09/foodtography-‐infographic/


•  Legitimate (or not quite…) needs and use cases

Oge Marques h<p://www.slideshare.net/dtunkelang/search-‐by-‐sight-‐google-‐goggles h<ps://twi<er.com/#!/courtanee/status/14704916575


•  A natural use case for CBIR with QBE (at last!) – The example is right in front of the user!

Oge Marques Girod et al. IEEE MulVmedia 2011

IEEE SIGNAL PROCESSING MAGAZINE [62] JULY 2011

! The mobile client processes the query image, extracts fea-tures, and transmits feature data. The image-retrieval algo-rithms run on the server using the feature data as query.

! The mobile client downloads data from the server, and all image matching is performed on the device.One could also imagine a hybrid of the approaches men-

tioned above. When the database is small, it can be stored on the phone, and image-retrieval algorithms can be run locally [8]. When the database is large, it has to be placed on a remote server and the retrieval algorithms are run remotely.

In each case, the retrieval framework has to work within stringent memory, computation, power, and bandwidth constraints of the mobile device. The size of the data transmit-ted over the network needs to be as small as possible to reduce network latency and improve user experience. The server laten-cy has to be low as we scale to large databases. This article reviews the recent advances in content-based image retrieval with a focus on mobile applications. We first review large-scale image retrieval, highlighting recent progress in mobile visual search. As an example, we then present the Stanford Product Search system, a low-latency interactive visual search system. Several sidebars in this article invite the interested reader to dig deeper into the underlying algorithms.

ROBUST MOBILE IMAGE RECOGNITIONToday, the most successful algorithms for content-based image retrieval use an approach that is referred to as bag of features (BoFs) or bag of words (BoWs). The BoW idea is borrowed from text retrieval. To find a particular text document, such as a Web page, it is sufficient to use a few well-chosen words. In the database, the document itself can be likewise represented by a

bag of salient words, regardless of where these words appear in the text. For images, robust local features take the analogous role of visual words. Like text

retrieval, BoF image retrieval does not consider where in the image the features occur, at least in the initial stages of the retrieval pipeline. However, the variability of features extracted from different images of the same object makes the problem much more challenging.

A typical pipeline for image retrieval is shown in Figure 2. First, the local features are extracted from the query image. The set of image features is used to assess the similarity between query and database images. For mobile applications, individual features must be robust against geometric and photometric dis-tortions encountered when the user takes the query photo from a different viewpoint and with different lighting compared to the corresponding database image.

Next, the query features are quantized [9]–[12]. The parti-tioning into quantization cells is precomputed for the database, and each quantization cell is associated with a list of database images in which the quantized feature vector appears some-where. This inverted file circumvents a pairwise comparison of each query feature vector with all the feature vectors in the data-base and is the key to very fast retrieval. Based on the number of features they have in common with the query image, a short list of potentially similar images is selected from the database.

Finally, a geometric verification (GV) step is applied to the most similar matches in the database. The GV finds a coherent spatial pattern between features of the query image and the can-didate database image to ensure that the match is plausible. Example retrieval systems are presented in [9]–[14].

For mobile visual search, there are considerable challenges to provide the users with an interactive experience. Current deployed systems typically transmit an image from the client to the server, which might require tens of seconds. As we scale to large databases, the inverted file index becomes very large, with memory swapping operations slowing down the feature-match-ing stage. Further, the GV step is computationally expensive and thus increases the response time. We discuss each block of the retrieval pipeline in the following, focusing on how to meet the challenges of mobile visual search.

[FIG1] A snapshot of an outdoor mobile visual search system being used. The system augments the viewfinder with information about the objects it recognizes in the image taken with a camera phone.

Database

QueryImage

FeatureExtraction

FeatureMatching

GeometricVerification

[FIG2] A Pipeline for image retrieval. Local features are extracted from the query image. Feature matching finds a small set of images in the database that have many features in common with the query image. The GV step rejects all matches with feature locations that cannot be plausibly explained by a change in viewing position.

MOBILE IMAGE-RETRIEVAL APPLICATIONS POSE A UNIQUE

SET OF CHALLENGES.

Part II

Basic concepts

MVS: technical challenges

•  How to ensure low latency (and interactive queries) under constraints such as: – Network bandwidth

– Computational power – Battery consumption

•  How to achieve robust visual recognition in spite of low-resolution cameras, varying lighting conditions, etc.

•  How to handle broad and narrow domains

Oge Marques

MVS: Pipeline for image retrieval


3 scenarios


Part III

Technical details

Part III - Outline

•  The MVS pipeline in greater detail

•  Datasets for MVS research

•  MPEG Compact Descriptors for Visual Search (CDVS)

Oge Marques

MVS: descriptor extraction

•  Interest point detection •  Feature descriptor computation


Interest point detection •  Numerous interest-point detectors have been

proposed in the literature: – Harris Corners (Harris and Stephens 1988) –  Scale-Invariant Feature Transform (SIFT) Difference-of-

Gaussian (DoG) (Lowe 2004) – Maximally Stable Extremal Regions (MSERs) (Matas et al.

2002) – Hessian affine (Mikolajczyk et al. 2005) –  Features from Accelerated Segment Test (FAST) (Rosten

and Drummond 2006) – Hessian blobs (Bay, Tuytelaars and Van Gool 2006) –  etc.

Oge Marques Girod et al. IEEE Signal Processing Magazine 2011

Interest point detection •  Different tradeoffs in repeatability and complexity: –  SIFT DoG and other affine interest-point detectors are slow to

compute but are highly repeatable. –  SURF interest-point detector provides significant speed up over

DoG interest-point detectors by using box filters and integral images for fast computation. •  However, the box filter approximation causes significant anisotropy, i.e.,

the matching performance varies with the relative orientation of query and database images.

–  FAST corner detector is an extremely fast interest-point detector that offers very low repeatability.

•  See (Mikolajczyk and Schmid 2005) for a comparative performance evaluation of local descriptors in a common framework.


Feature descriptor computation

•  After interest-point detection, we compute a visual word descriptor on a normalized patch.

•  Ideally, descriptors should be: –  robust to small distortions in scale, orientation, and

lighting conditions; – discriminative, i.e., characteristic of an image or a small

set of images; – compact, due to typical mobile computing constraints.


Feature descriptor computation •  Examples of feature descriptors in the literature: –  SIFT (Lowe 1999) –  Speeded Up Robust Feature (SURF) interest-point

detector (Bay et al. 2008)

– Gradient Location and Orientation Histogram (GLOH) (Mikolajczyk and Schmid 2005)

– Compressed Histogram of Gradients (CHoG) (Chandrasekhar et al. 2009, 2010)

•  See (Winder, (Hua,) and Brown CVPR 2007, 2009) and (Mikolajczyk and Schmid PAMI 2005) for comparative performance evaluation of different descriptors.


Feature descriptor computation

•  What about compactness? – Several attempts in the literature to compress off-the-

shelf descriptors did not lead to the best-rate-constrained image-retrieval performance.

– Alternative: design a descriptor with compression in mind.


Feature descriptor computation •  CHoG (Compressed Histogram of Gradients) ��

(Chandrasekhar et al. 2009, 2010)

–  Based on the distribution of gradients within a patch of pixels

–  Histogram of gradient (HoG)-based descriptors [e.g. (Lowe 2004), (Bay et al. 2008), (Dalal and Triggs 2005), (Freeman and Roth 1994), and (Winder et al. 2009)] have been shown to be highly discriminative at low bit rates.


CHoG: Compressed Histogram of Gradients

Oge Marques Chandrasekhar et al. CVPR 09,10 Bernd Girod: Mobile Visual Search

Patch

01101

101101

CHoG

Descriptor

Gradient distributions

for each bin Gradients

Spatial

binning

Histogram

compression

dx

dy

0100011 111001 0010011 01100 1010100

dx

dy

0100101

011101

Encoding descriptor’s location information

•  Location Histogram Coding (LHC)



and higher query times, as longer inverted files need to be tra-versed due to the smaller vocabulary size.

As database size increases, the amount of memory used to index the database features can become very large. Thus, devel-oping a memory-efficient indexing structure is a problem of increasing interest. Chum et al. use a set of compact min-hash-es to perform near-duplicate image retrieval [52], [53]. Zhang et al. decompose each image’s set of features into coarse and refinement signatures [54]. The refinement signature is subsequently indexed by an LSH. Schemes that take advantage of the structure of the database have been proposed recently in [55]–[57]. These schemes are typically applied to databases where there is a lot of redundancy, e.g., each object is represent-

ed by images taken from multiple view points. The size of the inverted index is reduced by using geometry to find matching features across images, and only retaining useful features and discarding irrelevant clutter features.

To support the popular VT-scoring framework, inverted index compression methods for both hard-binned and soft-binned VTs have been developed by us [58], as explained in “Inverted Index Compression.” The memory for BoF image signatures can be alternatively reduced using the mini-BoF approach [59]. Very recently, visual word residuals on a small BoF codebook have shown promising retrieval results with low memory usage [60], [61]. The residuals are indexed either with PCA and product quantizers [60] or with LSH [61].

LOCATION HISTOGRAM CODING LHC is used to compress feature location data efficiently. We note that the interest points in the images are spatially clustered, as shown in Figure S3.

To encode location data, we first generate a two-dimensional (2-D) histogram from the locations of the descriptors (Figure S4). We divide the image into spatial bins and count the number of features within each spatial bin. We compress the binary map, indicating which spatial bins contains features, and a sequence of feature counts, representing the number of features in occupied bins. We encode the binary map using a trained context-based arithmetic coder, with the neighboring bins being used as the context for each spatial bin.

LHC provides two key benefits. First, encoding the locations of a set of N features as the histogram reduces the bit rate by log(N!) compared to encoding each feature location in sequence [47]. Here, we exploit the fact that the features can be sent in any order. Consider the sample space that represents N features. There are N! number of codes that represent the same feature set because the order does not matter. Thus, if we fix the ordering for the

feature set, i.e., using the LHC scheme described earlier, we can achieve a bit savings of log(N!). For example, for a feature set of 750 features, we achieve a rate savings of log(750!)/750 z 8 b/feature.

Second, with LHC, we exploit the spatial correlation between the locations of different descriptors, as illustrat-ed in Figure S3. Different interest-point detectors result in different coding gains. In our experiments, Hessian Laplace has the highest gain followed by SIFT and SURF interest-point detectors. Even if the feature points are uniformly scattered in the image, LHC is still able to exploit the ordering gain, which results in log(N!) saving in bits.

In our experiments, we found that quantizing the (x, y) location to four-pixel blocks is sufficient for GV. If we use a simple fixed-length coding scheme, then the rate will be log(640/4) 1 log(640/4) z 14 b/feature for a VGA size image. Using LHC, we can transmit the same location data with z 5 b/descriptor; z 12.5 times reduc-tion in data compared to a 64-b floating point represen-tation and z 2.8 times rate reduction compared to fixed-length coding [48].

[FIG S3] Interest-point locations in images tend to cluster spatially.

[FIG S4] We represent the location of the descriptors using a location histogram. The image is first divided into evenly spaced blocks. We enumerate the features within each spatial block by generating a location histogram.

1

11

1

1 1 1

13

2

– Rationale: Interest-point locations in images tend to cluster spatially.

Encoding descriptor’s location information •  Method:

1.  Generate a 2D histogram from the locations of the descriptors. •  Divide the image into spatial bins and

count the number of features within each spatial bin.

2.  Compress the binary map, indicating which spatial bins contains features, and a sequence of feature counts, representing the number of features in occupied bins.

3.  Encode the binary map using a trained context-based arithmetic coder, with the neighboring bins being used as the context for each spatial bin.



and higher query times, as longer inverted files need to be tra-versed due to the smaller vocabulary size.

As database size increases, the amount of memory used to index the database features can become very large. Thus, devel-oping a memory-efficient indexing structure is a problem of increasing interest. Chum et al. use a set of compact min-hash-es to perform near-duplicate image retrieval [52], [53]. Zhang et al. decompose each image’s set of features into coarse and refinement signatures [54]. The refinement signature is subsequently indexed by an LSH. Schemes that take advantage of the structure of the database have been proposed recently in [55]–[57]. These schemes are typically applied to databases where there is a lot of redundancy, e.g., each object is represent-

ed by images taken from multiple view points. The size of the inverted index is reduced by using geometry to find matching features across images, and only retaining useful features and discarding irrelevant clutter features.

To support the popular VT-scoring framework, inverted index compression methods for both hard-binned and soft-binned VTs have been developed by us [58], as explained in “Inverted Index Compression.” The memory for BoF image signatures can be alternatively reduced using the mini-BoF approach [59]. Very recently, visual word residuals on a small BoF codebook have shown promising retrieval results with low memory usage [60], [61]. The residuals are indexed either with PCA and product quantizers [60] or with LSH [61].

LOCATION HISTOGRAM CODING LHC is used to compress feature location data efficiently. We note that the interest points in the images are spatially clustered, as shown in Figure S3.

To encode location data, we first generate a two-dimensional (2-D) histogram from the locations of the descriptors (Figure S4). We divide the image into spatial bins and count the number of features within each spatial bin. We compress the binary map, indicating which spatial bins contains features, and a sequence of feature counts, representing the number of features in occupied bins. We encode the binary map using a trained context-based arithmetic coder, with the neighboring bins being used as the context for each spatial bin.

LHC provides two key benefits. First, encoding the locations of a set of N features as the histogram reduces the bit rate by log(N!) compared to encoding each feature location in sequence [47]. Here, we exploit the fact that the features can be sent in any order. Consider the sample space that represents N features. There are N! number of codes that represent the same feature set because the order does not matter. Thus, if we fix the ordering for the

feature set, i.e., using the LHC scheme described earlier, we can achieve a bit savings of log(N!). For example, for a feature set of 750 features, we achieve a rate savings of log(750!)/750 z 8 b/feature.

Second, with LHC, we exploit the spatial correlation between the locations of different descriptors, as illustrat-ed in Figure S3. Different interest-point detectors result in different coding gains. In our experiments, Hessian Laplace has the highest gain followed by SIFT and SURF interest-point detectors. Even if the feature points are uniformly scattered in the image, LHC is still able to exploit the ordering gain, which results in log(N!) saving in bits.

In our experiments, we found that quantizing the (x, y) location to four-pixel blocks is sufficient for GV. If we use a simple fixed-length coding scheme, then the rate will be log(640/4) 1 log(640/4) z 14 b/feature for a VGA size image. Using LHC, we can transmit the same location data with z 5 b/descriptor; z 12.5 times reduc-tion in data compared to a 64-b floating point represen-tation and z 2.8 times rate reduction compared to fixed-length coding [48].

[FIG S3] Interest-point locations in images tend to cluster spatially.

[FIG S4] We represent the location of the descriptors using a location histogram. The image is first divided into evenly spaced blocks. We enumerate the features within each spatial block by generating a location histogram.

1

11

1

1 1 1

13

2

•  Location Histogram Coding (LHC)

MVS: feature indexing and matching •  Goal: produce a data structure that can quickly return a short

list of the database candidates most likely to match the query image. –  The short list may contain false positives as long as the correct match

is included. –  Slower pairwise comparisons can be subsequently performed on just

the short list of candidates rather than the entire database.


MVS: feature indexing and matching

•  Vocabulary Tree (VT)-Based Retrieval


During a query, scoring the database images

can be made fast by using an inverted index

associated with the BoF codebook. To generate

a much larger codebook, Nister and Stewenius

use hierarchical k-means clustering to create a

vocabulary tree (VT).2 Additional details about

a VT can be found in the ‘‘Vocabulary Tree-

Based Retrieval’’ sidebar. Several alternative

search techniques, such as locality-sensitive

hashing and various improvements in tree-

based approaches, have also been developed.8-11

Geometric verificationGeometric verification typically follows the

feature-matching step. In this stage, location

information of features in query and database

images is used to confirm that the feature

matches are consistent with a change in view-

point between the two images. This process is

illustrated in Figure 5. The geometric transform

between the query and database image is usu-

ally estimated using robust regression tech-

niques such as random sample consensus

[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 90

Vocabulary Tree-Based RetrievalA vocabulary tree (VT) for a particular database is con-

structed by performing hierarchical k-means clustering on

a set of training feature descriptors representative of the

database. Initially, k large clusters are generated for all the

training descriptors. Then, for each large cluster, k-means

clustering is applied to the training descriptors assigned to

that cluster, to generate k smaller clusters. This recursive di-

vision of the descriptor space is repeated until there are

enough bins to ensure good classification performance.

Figure B1 shows a VT with only two levels, branching factor

k ! 3, and 32 ! 9 leaf nodes. In practice, VT can be much

larger, for example, with height 6, branching factor k !10, and containing 106 ! 1 million nodes.

The associated inverted index structure maintains two

lists for each VT leaf node, as shown in Figure B2. For a

leaf node x, there is a sorted array of image identifiers

ix1,. . ., ixNx indicating which Nx database images have fea-

tures that belong to a cluster associated with this node.

Similarly, there is a corresponding array of counters

cx1,. . ., cxNx indicating how many features in each image

fall in same cluster.

During a query, the VT is traversed for each feature in

the query image, finishing at one of the leaf nodes. The

corresponding lists of images and frequency counts are

subsequently used to compute similarity scores be-

tween these images and the query image. By pulling

images from all these lists and sorting them according

to the scores, we arrive at a subset of database images

that is likely to contain a true match to the query

image.

Figure B. (1) Vocabulary tree and (2) inverted index structures.

1 2

3

87

9

4

6

5

Training descriptor

Root node

1st level intermediate node

2nd level leaf node

(1)

1 2 3 4 5 6 7 8 9

i11

Vocabulary tree

(2)

Inverted index

i12 i13

c11 c12 c13

i21 i22 i23

c21 c22 c23

i1N1. . .. . .

. . .

. . .

c1N1

c2N2

i2N2

IEEE

Mu

ltiM

ed

ia

90

Industry and Standards

MVS: geometric verification

•  Goal: use location information of features in query and database images to confirm that the feature matches are consistent with a change in view-point between the two images.


MVS: geometric verification

•  Method: perform pairwise matching of feature descriptors and evaluate geometric consistency of correspondences.

Oge Marques


[11] use weak geometric consistency checks to rerank images based on the orientation and scale information of all features. The authors in [53] and [69] propose incor-porating geometric information into the VT matching or hashing step. In [70] and [71], the authors investigate how to speed up RANSAC estimation itself. Philbin et al. [72] use single pairs of matching features to propose hypotheses of the geometric transformation model and verify only possible sets of hypotheses. Weak geometric consistency checks are typically used to rerank a larger candidate list of images, before a full GV is performed on a shorter candidate list.

To speed up GV, one can add a geometric reranking step before the RANSAC GV step, as illustrated in Figure 5. In [73], we propose a reranking step that incorporates geometric information directly into the fast index lookup stage and use it to reorder the list of top matching images (see “Fast Geometric Reranking”). The main advantage of the scheme is that it only requires x, y fea-ture location data and does not use scale

INVERTED INDEX COMPRESSION For a database containing 1 million images and a VT that uses soft binning, each image ID can be stored in a 32-b unsigned integer, and each fractional count can be stored in a 32-b float in the inverted index. The memory usage of the entire inverted index is gk

k51 Nk# 64 bits, where Nk is the length of

the inverted list at the kth leaf node. For a database of 1 mil-lion product images, this amount of memory reaches 10 GB, a huge amount for even a modern server. Such a large memory footprint limits the ability to run other concurrent processes on the same server, such as recognition systems for other databases. When the inverted index’s memory usage exceeds the server’s available random access memory (RAM), swap-ping between main and virtual memory occurs, which signifi-cantly slows down all processes.

A compressed inverted index [58] can significantly reduce memory usage without affecting recognition accuracy. First, because each list of IDs 5ik1, ik2, c, ikNk

6 is sorted, it is more efficient to store consecutive ID differences 5dk15 ik1,dk25 ik22 ik1, c, dkNk

5 ikNk2 ik1Nk2126 in place of the IDs. This

practice is also commonly used in text retrieval [62]. Second, the fractional visit counts can be quantized to a few repre-sentative values using Lloyd-Max quantization. Third, the dis-tributions of the ID differences and visit counts are far from uniform, so variable-length coding can be much more rate efficient than fixed-length coding. Using the distributions of the ID differences and visit counts, each inverted list can be encoded using an arithmetic code (AC) [63]. Since keeping the decoding delay low is very important for interactive mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred over AC. The carryover code

[64] and recursive bottom-up complete (RBUC) code [65] have been shown to be at least ten times faster in decoding than AC, while achieving comparable compression gains as AC. The carryover and RBUC codes attain these speedups by enforcing word-aligned memory accesses.

Figure S6(a) compares the memory usage of the invert-ed index with and without compression using the RBUC code. Index compression reduces memory usage from near-ly 10 GB to 2 GB. This five times reduction leads to a sub-stantial speedup in server-side processing, as shown in Figure S6(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine. After compression, memory swapping is avoided and memory congestion delays no longer contribute to the query latency.

Uncod

ed

Coded

Uncod

ed

Coded

0

5

10

Mem

ory

Usa

ge (

GB

)

(a)

0

2

4

6

Que

ry L

aten

cy (

s)

(b)

[FIG S6] (a) Memory usage for inverted index with and without compression. A five times savings in memory is achieved with compression. (b) Server-side query latency (per image) with and without compression. The RBUC code is used to encode the inverted index.

QueryData VT

GeometricReranking GV

IdentifyInformation

[FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometric reranking stage.

[FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometric model. True feature matches are shown in red. False feature matches are shown in green. Girod et al. IEEE MulVmedia 2011

MVS: geometric verification •  Techniques: – The geometric transform between the query and database

image is usually estimated using robust regression techniques such as: •  Random sample consensus (RANSAC) (Fischler and Bolles 1981) •  Hough transform (Lowe 2004)

– The transformation is often represented by an affine mapping or a homography.

•  GV is computationally expensive, which is why it’s only used for a subset of images selected during the feature-matching stage.


MVS: geometric reranking

•  Speed-up step between Vocabulary Tree building and Geometric Verification.

Oge Marques


[11] use weak geometric consistency checks to rerank images based on the orientation and scale information of all features. The authors in [53] and [69] propose incor-porating geometric information into the VT matching or hashing step. In [70] and [71], the authors investigate how to speed up RANSAC estimation itself. Philbin et al. [72] use single pairs of matching features to propose hypotheses of the geometric transformation model and verify only possible sets of hypotheses. Weak geometric consistency checks are typically used to rerank a larger candidate list of images, before a full GV is performed on a shorter candidate list.

To speed up GV, one can add a geometric reranking step before the RANSAC GV step, as illustrated in Figure 5. In [73], we propose a reranking step that incorporates geometric information directly into the fast index lookup stage and use it to reorder the list of top matching images (see “Fast Geometric Reranking”). The main advantage of the scheme is that it only requires x, y fea-ture location data and does not use scale

INVERTED INDEX COMPRESSION For a database containing 1 million images and a VT that uses soft binning, each image ID can be stored in a 32-b unsigned integer, and each fractional count can be stored in a 32-b float in the inverted index. The memory usage of the entire inverted index is gk

k51 Nk# 64 bits, where Nk is the length of

the inverted list at the kth leaf node. For a database of 1 mil-lion product images, this amount of memory reaches 10 GB, a huge amount for even a modern server. Such a large memory footprint limits the ability to run other concurrent processes on the same server, such as recognition systems for other databases. When the inverted index’s memory usage exceeds the server’s available random access memory (RAM), swap-ping between main and virtual memory occurs, which signifi-cantly slows down all processes.

A compressed inverted index [58] can significantly reduce memory usage without affecting recognition accuracy. First, because each list of IDs 5ik1, ik2, c, ikNk

6 is sorted, it is more efficient to store consecutive ID differences 5dk15 ik1,dk25 ik22 ik1, c, dkNk

5 ikNk2 ik1Nk2126 in place of the IDs. This

practice is also commonly used in text retrieval [62]. Second, the fractional visit counts can be quantized to a few repre-sentative values using Lloyd-Max quantization. Third, the dis-tributions of the ID differences and visit counts are far from uniform, so variable-length coding can be much more rate efficient than fixed-length coding. Using the distributions of the ID differences and visit counts, each inverted list can be encoded using an arithmetic code (AC) [63]. Since keeping the decoding delay low is very important for interactive mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred over AC. The carryover code

[64] and recursive bottom-up complete (RBUC) code [65] have been shown to be at least ten times faster in decoding than AC, while achieving comparable compression gains as AC. The carryover and RBUC codes attain these speedups by enforcing word-aligned memory accesses.

Figure S6(a) compares the memory usage of the invert-ed index with and without compression using the RBUC code. Index compression reduces memory usage from near-ly 10 GB to 2 GB. This five times reduction leads to a sub-stantial speedup in server-side processing, as shown in Figure S6(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine. After compression, memory swapping is avoided and memory congestion delays no longer contribute to the query latency.

Uncod

ed

Coded

Uncod

ed

Coded

0

5

10

Mem

ory

Usa

ge (G

B)

(a)

0

2

4

6

Que

ry L

aten

cy (s

)

(b)

[FIG S6] (a) Memory usage for inverted index with and without compression. A five times savings in memory is achieved with compression. (b) Server-side query latency (per image) with and without compression. The RBUC code is used to encode the inverted index.

QueryData VT

GeometricReranking GV

IdentifyInformation

[FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometric reranking stage.

[FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometric model. True feature matches are shown in red. False feature matches are shown in green.

Fast geometric reranking

•  The location geometric score is computed as follows: a)  features of two images are matched based on VT quantization; b)  distances between pairs of features within an image are calculated; c)  log-distance ratios of the corresponding pairs (denoted by color) are

calculated; and d)  histogram of log-distance ratios is computed.

•  The maximum value of the histogram is the geometric similarity score. –  A peak in the histogram indicates a similarity transform between

the query and database image.

Oge Marques


or orientation information as in [11]. As scale and orientation data are not used, they need not be transmitted by the client, which reduces the amount of data transferred. We typically run fast geometric reranking on a large set of candidate data-base images and reduce the list of images that we run RANSAC on.

Before discussing system performance results, we provide a list of important references for each module in the matching pipeline in Table 2.

SYSTEM PERFORMANCEWhat performance can we expect for a mobile visual search system that incorporates all the ideas discussed so far? To answer this question, we have a closer look at the experimen-tal Stanford Product Search System (Figure 6). For evalua-tion, we use a database of 1 million CD, digital versatile disk (DVD), and book cover images, and a set of 1,000 query images (500 3 500 pixel resolution) [75] exhibiting challenging pho-tometric and geometric distortions, as shown in Figure 7. For

[TABLE 2] SUMMARY OF REFERENCES FOR MODULES IN A MATCHING PIPELINE.

MODULE LIST OF REFERENCES

FEATURE EXTRACTION HARRIS AND STEPHENS [17], LOWE [15], [23], MATAS ET AL. [18], MIKOLAJCZYK ET AL. [16], [22], DALAL AND TRIGGS [41], ROSTEN AND DRUMMOND [19], BAY ET AL. [20], WINDER ET AL. [27], [28], CHANDRASEKHAR ET AL. [25], [26], PHILBIN ET AL. [40]

FEATURE INDEXING AND MATCHING SCHMID AND MOHR [13], LOWE [15], [23], SIVIC AND ZISSERMAN [9], NISTÉR AND STEWÉNIUS [10], CHUM ET AL. [50], [52], [53], YEH ET AL. [51], PHILBIN ET AL. [12], JEGOU ET AL. [11], [59], [60], ZHANG ET AL. [54]CHEN ET AL. [58], PERRONNIN [61], MIKULIK ET AL. [55], TURCOT AND LOWE [56], LI ET AL. [57]

GV FISCHLER AND BOLLES [66], SCHAFFALITZKY AND ZISSERMAN [74], LOWE [15], [23], CHUM ET AL. [53], [70], [71] FERRARI ET AL. [68], JEGOU ET AL. [11], WU ET AL. [69], TSAI ET AL. [73]

FAST GEOMETRIC RERANKING We have proposed a fast geometric reranking algorithm in [73] that uses x, y locations of features to rerank a short list of candidate images. First, we generate a set of potential feature matches between each query and database image based on VT-quantization results. After generating a set of feature correspondences, we calculate a geometric score between them. The process used to compute the geometric similarity score is illustrated in Figure S7. We find the dis-tance between the two features in the query image and the distance between the corresponding matching features in the database image. The ratio of the distance corre-sponds to the scale difference between the two images. We repeat the ratio calculation for features in the query image that have matching database features. If there exists a consistent set of ratios (as indicated by a peak in the his-togram of distance ratios), it is more likely that the query image and the database image match.

The geometric reranking is fast because we use the VT-quantization results directly to find potential feature matches and use a really simple similarity scoring scheme.

The time required to calculate a geometric similarity score is one to two orders of magnitude less than using RANSAC. Typically, we perform fast geometric reranking on the top 500 images and RANSAC on the top 50 ranked images.

(a) (b) (c) (d)

log (÷)

[FIG S7] The location geometric score is computed as follows: (a) features of two images are matched based on VT quantization, (b) distances between pairs of features within an image are calculated, (c) log-distance ratios of the corresponding pairs (denoted by color) are calculated, and (d) histogram of log-distance ratios is computed. The maximum value of the histogram is the geometric similarity score. A peak in the histogram indicates a similarity transform between the query and database image.

[FIG6] Stanford Product Search system. Because of the large database, the image-recognition server is placed at a remote location. In most systems [1], [3], [7], the query image is sent to the server and feature extraction is performed. In our system, we show that by performing feature extraction on the phone we can significantly reduce the transmission delay and provide an interactive experience.

Client

Image FeatureExtraction

FeatureCompression

Display

Query Data

Network

Identification Data

VT Matching

GV

Server

Girod et al. IEEE MulVmedia 2011

Datasets for MVS research

•  Stanford Mobile Visual Search Data Set ��(http://web.cs.wpi.edu/~claypool/mmsys-2011-dataset/stanford/) – Key characteristics: •  rigid objects

•  widely varying lighting conditions •  perspective distortion

•  foreground and background clutter •  realistic ground-truth reference data

•  query data collected from heterogeneous low and high-end camera phones.

Oge Marques Chandrasekhar et al. ACM MMSys 2011

Stanford Mobile Visual Search (SMVS) ��Data Set

•  Limitations of popular datasets

Oge Marques

ZuBuD

Oxford

INRIA

UKY

Image Nets

Figure 2: Limitations with popular data sets in computer vision. The left most image in each row is thedatabase image, and the other 3 images are query images. ZuBuD, INRIA and UKY consist of images takenat the same time and location. ImageNets is not suitable for image retrieval applications. The Oxford datasethas di!erent faades of the same building labelled with the same name.

Fig. 2 illustrates some images for the word “tiger”. Such adata set is useful for testing classification algorithms, butnot so much for testing retrieval algorithms.

We summarize the limitations of the di!erent data setsin Tab. 1. To overcome the limitations in these data sets,we propose the Stanford Mobile Visual Search (SMVS) dataset.

3. STANFORD MOBILE VISUAL SEARCHDATA SET

We present the SMVS (version 0.9) data set in the hopethat it will be useful for a wide range of visual search appli-cations like product recognition, landmark recognition, out-door augmented reality [26], business card recognition, textrecognition, video recognition and TV-on-the-go [5]. We col-lect data for several di!erent categories: CDs, DVDs, books,software products, landmarks, business cards, text docu-ments, museum paintings and video clips. Sample queryand database images are shown in Figure 4. Current andsubsequent versions of the dataset will be available at [3].

The number of database and query images for di!erentcategories is shown in Tab. 2. We provide a total 3300 queryimages for 1200 distinct classes across 8 image categories.Typically, a small number of query images (!1000s) suf-fice to measure the performance of a retrieval system as therest of the database can be padded with “distractor” images.Ideally, we would like to have a large distractor set for eachquery category. However, it is challenging to collect distrac-tor sets for each category. Instead, we plan to release twodistractor sets upon request: one containing Flickr images,and the other containing building images from Navteq. Thedistractor sets will be available in sets of 1K, 10K, 100K and

1M. Researchers can test scalability using these distractordata sets, or the ones provided in [22, 14]. Next, we discusshow the query and reference database images are collected,and evaluation measures that are in particular relevant formobile applications.

Reference Database Images.For product categories (CDs, DVDs and books), the refer-

ences are clean versions of images obtained from the productwebsites. For landmarks, the reference images are obtainedfrom data collected by Navteq’s vehicle-mounted cameras.For video clips, the reference images are the key frame fromthe reference video clips. The videos contain diverse contentlike movie trailers, news reports, and sports. For text doc-uments, we collect (1) reference images from [19], a websitethat mines the front pages of newspapers from around theworld, and (2) research papers. For business cards, the ref-erence image is obtained from a high quality upright scanof the card. For museum paintings, we collect data fromthe Cantor Arts Center at Stanford University for di!er-ent genres: history, portraits, landscapes and modern-art.The reference images are obtained from the artists’ websiteslike [23] or other online sources. All reference images arehigh quality JPEG compressed color images. The resolutionof reference images varies for each category.

Query Images.We capture query images with several di!erent camera

phones, including some digital cameras. The list of compa-nies and models used is as follows: Apple (iPhone4), Palm(Pre), Nokia (N95, N97, N900, E63, N5800, N86), Motorola(Droid), Canon (G11) and LG (LG300). For product cate-

119

Data Database Query Classes Rigid Lighting Clutter Perspective CameraSet (#) (#) (#) Phone

ZuBuD 1005 115 200!

"! !

"

Oxford 5062 55 17! ! ! !

#

INRIA 1491 500 500 " "! !

"

UKY 10200 2550 2550!

" "!

"

ImageNet 11M 15K 15K "! ! !

"

SMVS 1200 3300 1200! ! ! ! !

Table 1: Comparison of di!erent data sets. “Classes” refers to the number of distinct objects in the data set.“Rigid” refers to whether on not the objects in the database are rigid. “Lighting” refers to whether or notthe query images capture widely varying lighting conditions. “Clutter” refers to whether or not the queryimages contain foreground/background clutter. “Perspective” refers to whether the data set contains typicalperspective distortions. “Camera-phone” refers to whether the images were captured with mobile devices.SMVS is a good data set for mobile visual search applications.

gories like CDs, DVDs, books, text documents and businesscards, we capture the images indoors under widely varyinglighting conditions over several days. We include foregroundand background clutter that would be typically present inthe application, e.g., a picture of a CD would might otherCDs in the background. For landmarks, we capture imagesof buildings in San Francisco. We collected query imagesseveral months after the reference data was collected. Forvideo clips, the query images were taken from laptop, com-puter and TV screens to include typical specular distortions.Finally, the paintings were captured at the Cantor Arts Cen-ter at Stanford University under controlled lighting condi-tions typical of museums.

The resolution of the query images varies for each cameraphone. We provide the original JPEG compressed high qual-ity color images obtained from the camera. We also provideauxiliary information like phone model number, and GPSlocation, where applicable. As noted in Tab. 1, the SMVSquery data set has the following key characteristics that islacking in other data sets: rigid objects, widely varying light-ing conditions, perspective distortion, foreground and back-ground clutter, realistic ground-truth reference data, andquery images from heterogeneous low and high-end cameraphones.

Category Database QueryCD 100 400

DVD 100 400Books 100 400

Video Clips 100 400Landmarks 500 500

Business Cards 100 400Text documents 100 400

Paintings 100 400

Table 2: Number of query and database images inthe SMVS data set for di!erent categories.

Evaluation measures.A naive retrieval system would match all database im-

ages against each query image. Such a brute-force match-ing scheme provides as an upper-bound on the performancethat can be achieved with the feature matching pipeline.Here, we report results for brute-force pairwise matchingfor di!erent interest point detectors and descriptors usingthe ratio-test [16] and RANSAC [9]. For RANSAC, we use

a"ne models with a minimum threshold of 10 matches post-RANSAC for declaring a pair of images to be a valid match.

In Fig. 3, we report results for 3 state-of-the-art schemes:(1) SIFT Di!erence-of-Gaussian (DoG) interest point de-tector and SIFT descriptor (code: [27]), (2) Hessian-a"neinterest point detector and SIFT descriptor (code [17]), and(3) Fast Hessian blob interest point detector [2] sped upwith integral images, and the recently proposed CompressedHistogram of Gradients (CHoG) descriptor [4]. We reportthe percentage of images that match, the average numberof features and the average number of features that matchpost-RANSAC for each category.

First, we note that indoor categories are easier than out-door categories. E.g., some categories like CDs, DVDs andbook covers achieve over 95% accuracy. The most challeng-ing category is landmarks as the query data is collected sev-eral months after the database.

Second, we note that option (1): SIFT interest point de-tector and descriptor, performs the best. However, option(1) is computationally complex and is not suitable for im-plementation on mobile devices.

Third, we note that option (3) performs comes close toachieving the performance of (1), with worse performance(10-20% drop) for some categories. The performance hit isincurred due to the fast Hessian-based interest point detec-tor, which is not as robust as the DoG interest point detec-tor. One reason for lower robustness is observed in [25]: thefast box-filtering step causes the interest point detection tolose rotation invariance which a!ects oriented query images.The CHoG descriptor used in option (3) is a low-bitrate 60-bit descriptor which is shown to perform on par with the128-dimensional 1024-bit SIFT descriptor using extensiveevaluation in [4]. We note that option (3) is most suitablefor implementation on mobile devices as the fast hessian in-terest point detector is an order-of-magnitude faster thanSIFT DoG, and the CHoG descriptors generate an orderof magnitude less data than SIFT descriptors for e"cienttransmission [10].

Finally, we list aspects critical for mobile visual searchapplications. A good image retrieval system should exhibitthe follow characteristics when tested on the SMVS dataset.

• High Precision-Recall as size of database increases

• Low retrieval latency

120

Chandrasekhar et al. ACM MMSys 2011

SMVS Data Set: categories and examples

•  Number of query and database images per category

Oge Marques Chandrasekhar et al. ACM MMSys 2011


•  DVD covers

Oge Marques h<p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/dvd_covers.html


•  CD covers

Oge Marques h<p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/cd_covers.html


•  Museum paintings

Oge Marques h<p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/museum_painVngs.html

Other MVS data sets

Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT

Other MVS data sets

•  Distractor set – 1 million images of various resolution and content

collected from FLICKR.


MPEG Compact Descriptors for Visual Search (CDVS)

•  Objectives –  Define a standard that:

•  enables design of visual search applications •  minimizes lengths of query requests •  ensures high matching performance (in terms of reliability and

complexity) •  enables interoperability between search applications and visual databases •  enables efficient implementation of visual search functionality on mobile

devices

•  Scope –  It is envisioned that (as a minimum) the standard will specify:

•  bitstream of descriptors •  parts of descriptor extraction process (e.g. key-point detection) needed

to ensure interoperability

Oge Marques Bober, Cordara, and Reznik (2010)

MPEG CDVS •  Requirements –  Robustness

•  High matching accuracy shall be achieved at least for images of textured rigid objects, landmarks, and printed documents. The matching accuracy shall be robust to changes in vantage point, camera parameters, lighting conditions, as well as in the presence of partial occlusions.

–  Sufficiency •  Descriptors shall be self-contained, in the sense that no other data are

necessary for matching.

–  Compactness •  Shall minimize lengths/size of image descriptors

–  Scalability •  Shall allow adaptation of descriptor lengths to support the required

performance level and database size. •  Shall enable design of web-scale visual search applications and databases.


MPEG CDVS •  Requirements (cont’d) –  Image format independence

•  Descriptors shall be independent of the image format –  Extraction complexity

•  Shall allow descriptor extraction with low complexity (in terms of memory and computation) to facilitate video rate implementations

–  Matching complexity •  Shall allow matching of descriptors with low complexity (in terms of

memory and computation). •  If decoding of descriptors is required for matching, such decoding shall

also be possible with low complexity. –  Localization:

•  Shall support visual search algorithms that identify and localize matching regions of the query image and the database image

•  Shall support visual search algorithms that provide an estimate of a geometric transformation between matching regions of the query image and the database image


MPEG CDVS

•  Summarized timeline

Oge Marques

that among several component technologies for

image retrieval, such a standard should focus pri-

marily on defining the format of descriptors and

parts of their extraction process (such as interest

point detectors) needed to ensure interoperabil-

ity. Such descriptors must be compact, image-

format independent, and sufficient for robust

image-matching. Hence, the title Compact

Descriptors for Visual Search was coined as an in-

terim name for this activity. Requirements and

Evaluation Framework documents have been

subsequently produced to formulate precise cri-

teria and evaluation methodologies to be used

in selection of technology for this standard.

The Call for Proposals17 was issued at the 96th

MPEG meeting in Geneva, in March 2011, and

responses are now expected by November

2011. Table 1 lists milestones to be reached in

subsequent development of this standard.It is envisioned that, when completed, this

standard will

! ensure interoperability of visual search appli-

cations and databases,

! enable a high level of performance of imple-

mentations conformant to the standard,

! simplify design of descriptor extraction and

matching for visual search applications,

! enable hardware support for descriptor ex-

traction and matching in mobile devices, and

! reduce load on wireless networks carrying

visual search-related information.

To build full visual-search applications, this

standard may be used jointly with other

existing standards, such as MPEG Query For-

mat, HTTP, XML, JPEG, and JPSearch.

Conclusions and outlookRecent years have witnessed remarkable

technological progress, making mobile visual

search possible today. Robust local image fea-

tures achieve a high degree of invariance

against scale changes, rotation, as well as

changes in illumination and other photometric

conditions. The BoW approach offers resiliency

to partial occlusions and background clutter,

and allows design of efficient indexing

schemes. The use of compressed image features

makes it possible to communicate query

requests using only a fraction of the rate

needed by JPEG, and further accelerates search

by storing a cache of the visual database on

the phone.Nevertheless, many improvements are still

possible and much needed. Existing image fea-

tures are robust to much of the variability be-

tween query and database images, but not all.

Improvements in complexity and compactness

are also critically important for mobile visual-

search systems. In mobile augmented-reality

applications, annotations of the viewfinder

content simply pop up without the user ever

pressing a button. Such continuous annota-

tions require video-rate processing on the

mobile device. They may also require improve-

ments in indexing structures, retrieval algo-

rithms, and moving more retrieval-related

operations to the phone.Standardization of compact descriptors for

visual search, such as the new initiative within

MPEG, will undoubtedly provide a further

boost to an already exciting area. In the near

[3B2-9] mmu2011030086.3d 1/8/011 16:44 Page 93

Table 1. Timeline for development of MPEG standard for visual search.

When Milestone Comments

March, 2011 Call for Proposals is published Registration deadline: 11 July 2011

Proposals due: 21 November 2011

December, 2011 Evaluation of proposals None

February, 2012 1st Working Draft First specification and test software model that can

be used for subsequent improvements.

July, 2012 Committee Draft Essentially complete and stabilized specification.

January, 2013 Draft International Standard Complete specification. Only minor editorial

changes are allowed after DIS.

July, 2013 Final Draft International

Standard

Finalized specification, submitted for approval and

publication as International standard.

July!

Sep

tem

ber

2011

93


MPEG CDVS

•  CDVS: evaluation framework – Experimental setup •  Retrieval experiment: intended to assess performance of

proposals in the context of an image retrieval system


MPEG CDVS

•  CDVS: evaluation framework – Experimental setup •  Pair-wise matching experiments: intended for assessing

performance of proposals in the context of an application that uses descriptors for the purpose of image matching.


Image A

Extract descriptors

Extract descriptor

Match

Check accuracy of search results

Annota-‐Vons

Report

Image B

MPEG CDVS

•  For more info: –  https://mailhost.tnt.uni-hannover.de/mailman/listinfo/cdvs

–  http://mpeg.chiariglione.org/meetings/geneva11-1/geneva_ahg.htm ��(Ad hoc groups)

Oge Marques

Part IV

Examples and applications

Examples

•  Academic – Stanford Product Search System

•  Commercial – Google Goggles

– Kooaba: Déjà Vu and Paperboy – SnapTell – oMoby (and the IQ Engines API)

– pixlinQ – Moodstocks

Oge Marques

Stanford Product Search (SPS) System

•  Local feature based visual search system

•  Client-server architecture

Oge Marques Girod et al. IEEE MulVmedia 2011 Tsai et al. ACM MM 2010


or orientation information as in [11]. As scale and orientation data are not used, they need not be transmitted by the client, which reduces the amount of data transferred. We typically run fast geometric reranking on a large set of candidate data-base images and reduce the list of images that we run RANSAC on.

Before discussing system performance results, we provide a list of important references for each module in the matching pipeline in Table 2.

SYSTEM PERFORMANCEWhat performance can we expect for a mobile visual search system that incorporates all the ideas discussed so far? To answer this question, we have a closer look at the experimen-tal Stanford Product Search System (Figure 6). For evalua-tion, we use a database of 1 million CD, digital versatile disk (DVD), and book cover images, and a set of 1,000 query images (500 3 500 pixel resolution) [75] exhibiting challenging pho-tometric and geometric distortions, as shown in Figure 7. For

[TABLE 2] SUMMARY OF REFERENCES FOR MODULES IN A MATCHING PIPELINE.

MODULE LIST OF REFERENCES

FEATURE EXTRACTION HARRIS AND STEPHENS [17], LOWE [15], [23], MATAS ET AL. [18], MIKOLAJCZYK ET AL. [16], [22], DALAL AND TRIGGS [41], ROSTEN AND DRUMMOND [19], BAY ET AL. [20], WINDER ET AL. [27], [28], CHANDRASEKHAR ET AL. [25], [26], PHILBIN ET AL. [40]

FEATURE INDEXING AND MATCHING SCHMID AND MOHR [13], LOWE [15], [23], SIVIC AND ZISSERMAN [9], NISTÉR AND STEWÉNIUS [10], CHUM ET AL. [50], [52], [53], YEH ET AL. [51], PHILBIN ET AL. [12], JEGOU ET AL. [11], [59], [60], ZHANG ET AL. [54]CHEN ET AL. [58], PERRONNIN [61], MIKULIK ET AL. [55], TURCOT AND LOWE [56], LI ET AL. [57]

GV FISCHLER AND BOLLES [66], SCHAFFALITZKY AND ZISSERMAN [74], LOWE [15], [23], CHUM ET AL. [53], [70], [71] FERRARI ET AL. [68], JEGOU ET AL. [11], WU ET AL. [69], TSAI ET AL. [73]

FAST GEOMETRIC RERANKING We have proposed a fast geometric reranking algorithm in [73] that uses x, y locations of features to rerank a short list of candidate images. First, we generate a set of potential feature matches between each query and database image based on VT-quantization results. After generating a set of feature correspondences, we calculate a geometric score between them. The process used to compute the geometric similarity score is illustrated in Figure S7. We find the dis-tance between the two features in the query image and the distance between the corresponding matching features in the database image. The ratio of the distance corre-sponds to the scale difference between the two images. We repeat the ratio calculation for features in the query image that have matching database features. If there exists a consistent set of ratios (as indicated by a peak in the his-togram of distance ratios), it is more likely that the query image and the database image match.

The geometric reranking is fast because we use the VT-quantization results directly to find potential feature matches and use a really simple similarity scoring scheme.

The time required to calculate a geometric similarity score is one to two orders of magnitude less than using RANSAC. Typically, we perform fast geometric reranking on the top 500 images and RANSAC on the top 50 ranked images.

(a) (b) (c) (d)

log (÷)

[FIG S7] The location geometric score is computed as follows: (a) features of two images are matched based on VT quantization, (b) distances between pairs of features within an image are calculated, (c) log-distance ratios of the corresponding pairs (denoted by color) are calculated, and (d) histogram of log-distance ratios is computed. The maximum value of the histogram is the geometric similarity score. A peak in the histogram indicates a similarity transform between the query and database image.

[FIG6] Stanford Product Search system. Because of the large database, the image-recognition server is placed at a remote location. In most systems [1], [3], [7], the query image is sent to the server and feature extraction is performed. In our system, we show that by performing feature extraction on the phone we can significantly reduce the transmission delay and provide an interactive experience.

Client

Image FeatureExtraction

FeatureCompression

Display

Query Data

Network

Identification Data

VT Matching

GV

Server

Stanford Product Search (SPS) System

•  Key contributions: – Optimized feature extraction implementation – CHoG: a low bit-rate compact descriptor (provides up

to 20× bit-rate saving over SIFT with comparable image retrieval performance)

–  Inverted index compression to enable large-scale image retrieval on the server

– Fast geometric re-ranking


Stanford Product Search (SPS) System •  Two modes: – Send Image mode

– Send Feature mode


including different distances, viewing angles,

and lighting conditions, or in the presence of

partial occlusions or motion blur.

Mobile image-based retrievaltechnologies

Most successful algorithms for image-based

retrieval today use an approach that is referred

to as bag of features (BoF) or bag of words

(BoW).1,2 The BoW idea is borrowed from text

document retrieval. To find a particular text

document, such as a webpage, it’s sufficient to

use a few well-chosen words. In the database,

the document itself can likewise be repre-

sented by a bag of salient words, regardless

of where these words appear in the document.

For images, robust local features that are

characteristic of a particular image take the

role of visual words. As with text retrieval,

BoF image retrieval does not consider where

in the image the features occur, at least in the

[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 87

Figure 2. Mobile visual

search architectures.

(a) The mobile device

transmits the

compressed image,

while analysis of the

image and retrieval

are done entirely on a

remote server. (b) The

local image features

(descriptors) are

extracted on a mobile

phone and then

encoded and

transmitted over the

network. Such

descriptors are then

used by the server to

perform the search.

(c) The mobile device

maintains a cache of

the database and sends

search requests to the

remote server only if the

object of interest is not

found in this cache,

further reducing the

amount of data sent

over the network.

Visual search server

Visual search serverMobile phone


Mobile phone

Imagedecoding

Searchresults

Searchresults

Searchresults

Wireless network

Wireless network

Wireless network

Image encoding(JPEG)

Descriptorextraction

Database

Database

Database


Descriptorencoding

Descriptordecoding

Descriptordecoding

Descriptormatching

Descriptormatching

Descriptormatching

Descriptormatching

Found

Process anddisplay results

No

YesLocalDB/

DB/cache

Descriptorencoding


Image

Image

Image


(a)

(b)

(c)


Figure 1. A snapshot

of an outdoor mobile

visual-search system.

The system augments

the viewfinder with

information about the

objects it recognizes in

the image taken with

a phone camera.

87

including different distances, viewing angles,

and lighting conditions, or in the presence of

partial occlusions or motion blur.

Mobile image-based retrievaltechnologies

Most successful algorithms for image-based

retrieval today use an approach that is referred

to as bag of features (BoF) or bag of words

(BoW).1,2 The BoW idea is borrowed from text

document retrieval. To find a particular text

document, such as a webpage, it’s sufficient to

use a few well-chosen words. In the database,

the document itself can likewise be repre-

sented by a bag of salient words, regardless

of where these words appear in the document.

For images, robust local features that are

characteristic of a particular image take the

role of visual words. As with text retrieval,

BoF image retrieval does not consider where

in the image the features occur, at least in the

[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 87

Figure 2. Mobile visual

search architectures.

(a) The mobile device

transmits the

compressed image,

while analysis of the

image and retrieval

are done entirely on a

remote server. (b) The

local image features

(descriptors) are

extracted on a mobile

phone and then

encoded and

transmitted over the

network. Such

descriptors are then

used by the server to

perform the search.

(c) The mobile device

maintains a cache of

the database and sends

search requests to the

remote server only if the

object of interest is not

found in this cache,

further reducing the

amount of data sent

over the network.

Visual search server



Mobile phone

Imagedecoding

Searchresults

Searchresults

Searchresults

Wireless network

Wireless network

Wireless network

Image encoding(JPEG)


Database

Database

Database


Descriptorencoding

Descriptordecoding

Descriptordecoding

Descriptormatching

Descriptormatching

Descriptormatching

Descriptormatching

Found


No

YesLocalDB/

DB/cache

Descriptorencoding


Image

Image

Image


(a)

(b)

(c)


Figure 1. A snapshot

of an outdoor mobile

visual-search system.

The system augments

the viewfinder with

information about the

objects it recognizes in

the image taken with

a phone camera.

87

Stanford Product Search System

•  Performance evaluation – Dataset: 1 million CD, DVD, and book cover images +

1,000 query images (500×500) with challenging photometric and geometric distortions

Oge Marques


the client, we use a Nokia 5800 mobile phone with a 300-MHz central processing unit (CPU). For the recognition server, we use a Linux server with a Xeon E5410 2.33-GHz CPU and 32 GB of RAM. We report results for both 3G and wireless local area network (WLAN) networks. For 3G, experiments are con-ducted in an AT&T 3G wireless network, averaged over several days, with a total of more than 5,000 transmissions at indoor locations where such an image-based retrieval system would be typically used.

We evaluate two different modes of operation. In send fea-tures mode, we process the query image on the phone and transmit compressed query features to the server. In send image mode, we transmit the query image to the server, and all opera-tions are performed on the server.

We discuss results of the three key aspects that are critical for mobile visual search applications: retrieval accuracy, system latency, and power. A recurring theme throughout this section will be the benefits of performing feature extraction on the mobile device compared to performing all processing on a remote server.

RETRIEVAL ACCURACYIt is relatively easy to achieve high precision (low false posi-tives) for mobile visual search applications. By requiring a minimum number of feature matches after RANSAC GV, we can avoid false positives entirely. To avoid false-positive matches, we set a minimum number of matching features after RANSAC GV to 12, which is high enough to avoid false positives. We define recall as the percentage of query images correctly retrieved. Our goal then is to maximize recall at a negligibly low false-positive rate.

Figure 8 compares the recall for the three schemes: send features (CHoG), send features (SIFT), and send image (JPEG) at precision one. For the JPEG scheme, the bit rate is varied by changing the quality of compression. For the SIFT scheme, we extract the SIFT descriptors on the mobile device and transmit each descriptor uncompressed as 1,024 b. For the CHoG scheme, we need to transmit about 60 b/descriptor across the network. For SIFT and CHoG schemes, we sweep the recall-bit rate curve by varying the number of descriptors transmitted. For a given budget of

(a)

(b)

[FIG7] Example image pairs from the data set. (a) A clean database picture is matched against (b) a real-world picture with various distortions.

100 101 10280

82

84

86

88

90

92

94

96

98

100

Bit Rate (kB)

Rec

all (

%)

Send Feature (CHoG)Send Image (JPEG)Send Feature (SIFT)

[FIG8] Bit-rate comparisons of different schemes. CHoG descriptor data are an order of magnitude smaller compared to the JPEG images or uncompressed SIFT descriptors.



•  Performance evaluation – Recall vs. bit rate


by approximately a factor of two. Moreover,

transmission of features allows yet another

optimization: it’s possible to use progressive

transmission of image features, and let the

server execute searches on a partial set of

features, as they arrive.15 Once the server

finds a result that has sufficiently high match-

ing score, it terminates the search and immedi-

ately sends the results back. The use of this

optimization reduces system latency by an-

other factor of two.Overall, the SPS system demonstrates that

using the described array of technologies, mo-

bile visual-search systems can achieve high rec-

ognition accuracy, scale to realistically large

databases, and deliver search results in an ac-

ceptable time.

Emerging MPEG standardAs we have seen, key component technolo-

gies for mobile visual search already exist, and

we can choose among several possible architec-

tures to design such a system. We have shown

these options at the beginning, in Figure 2.

The architecture shown in Figure 2a is the easi-

est one to implement on a mobile phone, but it

requires fast networks such as Wi-Fi to achieve

good performance. The architecture shown in

Figure 2b reduces network latency, and allows

fast response over today’s 3G networks, but

requires descriptors to be extracted on the

phone. Many applications might be accelerated

further by using a cache of the database on the

phone, as exemplified by the architecture

shown in Figure 2c.However, this immediately raises the ques-

tion of interoperability. How can we enable

mobile visual search applications and databases

across a broad range of devices and platforms, if

the information is exchanged in the form of

compressed visual descriptors rather than

images? This question was initially posed dur-

ing the Workshop on Mobile Visual Search,

held at Stanford University in December 2009.

This discussion led to a formal request by the

US delegation to MPEG, suggesting that the po-

tential interest in a standard for visual search

applications be explored.16 As a result, an ex-

ploratory activity in MPEG was started, which

produced a series of documents in the subse-

quent year describing applications, use cases,

objectives, scope, and requirements for a future

standard.17

As MPEG exploratory work progressed, it

was recognized that the suite of existing

MPEG technologies, such as MPEG-7 Visual,

does not yet include tools for robust image-

based retrieval and that a new standard should

therefore be defined. It was further recognized

[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92

Figure 7. Comparison of different schemes with regard to classification

accuracy and query size. CHoG descriptor data is an order of magnitude

smaller compared to JPEG images or uncompressed SIFT descriptors.

100 101 10280

82

84

86

88

90

92

94

96

98

100

Query size (Kbytes)

Cla

ssifi

catio

n ac

cura

cy (

%)

Send feature (CHoG)

Send image (JPEG)

Send feature (SIFT)

Figure 8. End-to-end latency for different schemes. Compared to a system

transmitting a JPEG query image, a scheme employing progressive

transmission of CHoG features achieves approximately four times the

reduction in system latency over a 3G network.

0

2

4

6

8

10

12

Resp

onse

tim

e (s

econ

ds)

JPEG(3G)

Feature(3G)

Featureprogressive

(3G)

JPEG(WLAN)

Feature(WLAN)

Feature extractionNetwork transmissionRetrieval

92



•  Performance evaluation – Processing times

Oge Marques


features, the descriptors with the highest Hessian response are transmitted. The descrip-tors are transmitted in the order imposed by the LHC scheme discussed in “Location Histogram Coding.”

First, we observe that a recall of 96% is achieved at the highest bit rate for challenging query images even with a million images in the database. Second, we observe that the performance of the JPEG scheme rapidly deteriorates at low bit rates. The performance suffers at low bit rates as the interest-point detection fails due to JPEG-compression arti-facts. Third, we note that transmitting uncompressed SIFT data is almost always more expensive than transmitting JPEG-compressed images. Finally, we observe that the amount of data for CHoG descriptors is an order of magni-tude smaller than JPEG images or SIFT descriptors at the same retrieval accuracy.

SYSTEM LATENCYThe system latency can be broken down into three components: processing delay on client, transmission delay, and processing delay on server.

CLIENT- AND SERVER-PROCESSING DELAYWe show the time taken for the different operations on the client and server in Table 3. The send features mode requires approximately 1 s for feature extraction on the client. However, this increase in client-processing time is more than compensated by the decrease in transmission latency, com-pared to send image, as illustrated in Figures 9 and 10. On the server, using VT matching with a compressed inverted index, we can search through a million image database in 100 ms. We perform GV on a short list of 50 candidates after fast geo-metric reranking of the top 500 candidate images. We can achieve ,1 s server processing latency while maintaining high recall.

TRANSMISSION DELAYThe transmission delay depends on the type of network used. In Figure 10, we observe that the data transmission time is insignificant for a WLAN network because of the high

bandwidth available. However, the transmission time turns out to be a bottleneck for 3G net-works. In Figure 9, we present the experimental results for sending data over a 3G wireless network. We vary query data

sizes from that of typical compressed query features (3–4 kB) to typical JPEG query images (50 kB) to learn how query size affects transmission time. The communication time-out was set to 60 s. We have conducted the experiment continuously for several days. We tested at three different locations, typical locations where a user might use the visual search application.

The median and average transmission latency of our exper-iments are shown in Figure 9. Sending the compressed query features typically takes 3–4 s. The time required to send the compressed query image is several times longer and varies significantly at different locations. However, the transmission delay does not include the cases when the communication

18

16

14

12

10

8

6

4

2

14

12

10

8

6

4

2

0

0 10 20 30 40 50Query Data Size (kB)

(a)

0 10 20 30 40 50Query Data Size (kB)

(b)

Com

mun

icat

ion

Tim

e-O

ut (

%)

Tran

smis

sion

Lat

ency

(s)

Indoor (I) (Average) Indoor (I) (Median)Indoor (II) (Average) Indoor (II) (Median)Outdoor (Average) Outdoor (Median)

Indoor (I)Indoor (II)Outdoor

[FIG9] Measured transmission latency (a) and time-out percentage (b) for transmitting queries of different size over a 3G network. Indoor (I) is tested indoors with poor connectivity. Indoor (II) is tested indoors with good reception. Outdoor is tested outside of buildings.

[TABLE 3] PROCESSING TIMES.

CLIENT-SIDE OPERATIONS TIME (S)

IMAGE CAPTURE 1–2 FEATURE EXTRACTION AND COMPRESSION (FOR SEND IMAGE MODE)

1–1.5

SERVER-SIDE OPERATIONS TIME (MS)

FEATURE EXTRACTION (FOR SEND IMAGE MODE)

100

VT MATCHING 100 FAST GEOMETRIC RERANKING (PER IMAGE) 0.46 GV (PER IMAGE) 30

THE SYSTEM LATENCY CAN BE BROKEN DOWN INTO THREE COMPONENTS:

PROCESSING DELAY ON CLIENT, TRANSMISSION DELAY, AND

PROCESSING DELAY ON SERVER.

Girod et al. IEEE MulVmedia 2011 Tsai et al. ACM MM 2010


•  Performance evaluation – End-to-end latency


by approximately a factor of two. Moreover,

transmission of features allows yet another

optimization: it’s possible to use progressive

transmission of image features, and let the

server execute searches on a partial set of

features, as they arrive.15 Once the server

finds a result that has sufficiently high match-

ing score, it terminates the search and immedi-

ately sends the results back. The use of this

optimization reduces system latency by an-

other factor of two.Overall, the SPS system demonstrates that

using the described array of technologies, mo-

bile visual-search systems can achieve high rec-

ognition accuracy, scale to realistically large

databases, and deliver search results in an ac-

ceptable time.

Emerging MPEG standardAs we have seen, key component technolo-

gies for mobile visual search already exist, and

we can choose among several possible architec-

tures to design such a system. We have shown

these options at the beginning, in Figure 2.

The architecture shown in Figure 2a is the easi-

est one to implement on a mobile phone, but it

requires fast networks such as Wi-Fi to achieve

good performance. The architecture shown in

Figure 2b reduces network latency, and allows

fast response over today’s 3G networks, but

requires descriptors to be extracted on the

phone. Many applications might be accelerated

further by using a cache of the database on the

phone, as exemplified by the architecture

shown in Figure 2c.However, this immediately raises the ques-

tion of interoperability. How can we enable

mobile visual search applications and databases

across a broad range of devices and platforms, if

the information is exchanged in the form of

compressed visual descriptors rather than

images? This question was initially posed dur-

ing the Workshop on Mobile Visual Search,

held at Stanford University in December 2009.

This discussion led to a formal request by the

US delegation to MPEG, suggesting that the po-

tential interest in a standard for visual search

applications be explored.16 As a result, an ex-

ploratory activity in MPEG was started, which

produced a series of documents in the subse-

quent year describing applications, use cases,

objectives, scope, and requirements for a future

standard.17

As MPEG exploratory work progressed, it

was recognized that the suite of existing

MPEG technologies, such as MPEG-7 Visual,

does not yet include tools for robust image-

based retrieval and that a new standard should

therefore be defined. It was further recognized

[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92

Figure 7. Comparison of different schemes with regard to classification

accuracy and query size. CHoG descriptor data is an order of magnitude

smaller compared to JPEG images or uncompressed SIFT descriptors.

100 101 10280

82

84

86

88

90

92

94

96

98

100

Query size (Kbytes)

Cla

ssifi

catio

n ac

cura

cy (

%)

Send feature (CHoG)

Send image (JPEG)

Send feature (SIFT)

Figure 8. End-to-end latency for different schemes. Compared to a system

transmitting a JPEG query image, a scheme employing progressive

transmission of CHoG features achieves approximately four times the

reduction in system latency over a 3G network.

0

2

4

6

8

10

12

Resp

onse

tim

e (s

econ

ds)

JPEG(3G)

Feature(3G)

Featureprogressive

(3G)

JPEG(WLAN)

Feature(WLAN)

Feature extractionNetwork transmissionRetrieval

92


Examples of commercial MVS apps •  Google

Goggles – Android

and iPhone – Narrow-

domain search and retrieval

Oge Marques h<p://www.google.com/mobile/goggles

SnapTell •  One of the earliest (ca. 2008) MVS apps for iPhone –  Eventually acquired by Amazon (A9)

•  Proprietary technique (“highly accurate and robust algorithm for image matching: Accumulated Signed Gradient (ASG)”).

Oge Marques h<p://www.snaptell.com/technology/index.htm

oMoby (and the IQ Engines API) –  iPhone app

Oge Marques h<p://omoby.com/pages/screenshots.php

oMoby (and the IQ Engines API)

•  The IQ Engines API: ��“vision as a service”

Oge Marques h<p://www.iqengines.com/applicaVons.php

The IQEngines API demo app

•  Screenshots

Oge Marques

The IQEngines API demo app

•  XML-formatted response

Oge Marques

Kooaba: Déjà Vu and Paperboy

•  “Image recognition in the cloud” platform

Oge Marques h<p://www.kooaba.com/en/home/developers

Kooaba: Déjà Vu and Paperboy

•  Déjà Vu –  Enhanced digital

memories / notes /journal

•  Paperboy –  News sharing from

printed media

Oge Marques h<p://www.kooaba.com/en/products/dejavu h<p://www.kooaba.com/en/products/paperboy

pixlinQ •  A “mobile visual

search solution that enables you to link users to digital content whenever they take a mobile picture of your printed materials.” – Powered by image

recognition from LTU technologies

Oge Marques h<p://www.pixlinq.com/home

pixlinQ

•  Example app (La Redoute)

Oge Marques h<p://www.youtube.com/watch?v=qUZCFtc42Q4

Moodstocks •  Real-time mobile image recognition that works offline (!) •  API and SDK available

Oge Marques h<p://www.youtube.com/watch?v=tsxe23b12eU

Moodstocks

•  Many successful apps for different platforms

Oge Marques h<p://www.moodstocks.com/gallery/

Concluding thoughts

Concluding thoughts

•  Mobile Visual Search (MVS) is coming of age.

•  This is not a fad and it can only grow.

•  Still a good research topic – Many relevant technical challenges

– MPEG efforts have just started

•  Infinite creative commercial possibilities

Oge Marques

Side note

•  The power of Twitter…

Oge Marques

Thanks!

•  Questions?

•  For additional information: [email protected] Oge Marques

Mobile Visual Search

Technology

Transcript of Mobile Visual Search