Mobile Visual Search
Click here to load reader
-
Upload
foerderverein-technische-fakultaet -
Category
Technology
-
view
9.438 -
download
1
description
Transcript of Mobile Visual Search
Mobile Visual Search
Oge Marques Florida Atlantic University
Boca Raton, FL - USA
TEWI Kolloquium – 24 Jan 2012
Take-home message
Oge Marques
Mobile Visual Search (MVS) is a fascinating research field with many open challenges and opportunities which have the potential to impact the way we organize, annotate, and retrieve visual data (images and videos) using mobile devices.
Outline
• This talk is structured in four parts:
1. Opportunities
2. Basic concepts
3. Technical details
4. Examples and applications
Oge Marques
Part I
Opportunities
Mobile visual search: driving factors
• Age of mobile computing
Oge Marques h<p://60secondmarketer.com/blog/2011/10/18/more-‐mobile-‐phones-‐than-‐toothbrushes/
Mobile visual search: driving factors
• Smartphone market
Oge Marques h<p://www.idc.com/getdoc.jsp?containerId=prUS23123911
Mobile visual search: driving factors
• Smartphone market
Oge Marques h<p://www.cellular-‐news.com/story/48647.php?s=h
Mobile visual search: driving factors
• Why do I need a camera? I have a smartphone…
Oge Marques h<p://www.cellular-‐news.com/story/52382.php
Mobile visual search: driving factors
• Why do I need a camera? I have a smartphone…
Oge Marques h<p://www.cellular-‐news.com/story/52382.php
Mobile visual search: driving factors
• Powerful devices
1 GHz ARM Cortex-A9 processor, PowerVR SGX543MP2, Apple A5 chipset
Oge Marques h<p://www.apple.com/iphone/specs.html h<p://www.gsmarena.com/apple_iphone_4s-‐4212.php
Mobile visual search: driving factors
Social networks and mobile devices ���(May 2011)
Oge Marques h<p://jess3.com/geosocial-‐universe-‐2/
Mobile visual search: driving factors
• Social networks and mobile devices – Motivated users: image taking and image sharing are
huge!
Oge Marques : h<p://www.onlinemarkeVng-‐trends.com/2011/03/facebook-‐photo-‐staVsVcs-‐and-‐insights.html
Mobile visual search: driving factors
• Instagram: – 13 million registered (although not
necessarily active) users (in 13 months)
– 7 employees – Several apps based on it!
Oge Marques h<p://venturebeat.com/2011/11/18/instagram-‐13-‐million-‐users/
Mobile visual search: driving factors
• Food photo sharing!
Oge Marques h<p://mashable.com/2011/05/09/foodtography-‐infographic/
Mobile visual search: driving factors
• Legitimate (or not quite…) needs and use cases
Oge Marques h<p://www.slideshare.net/dtunkelang/search-‐by-‐sight-‐google-‐goggles h<ps://twi<er.com/#!/courtanee/status/14704916575
Mobile visual search: driving factors
• A natural use case for CBIR with QBE (at last!) – The example is right in front of the user!
Oge Marques Girod et al. IEEE MulVmedia 2011
IEEE SIGNAL PROCESSING MAGAZINE [62] JULY 2011
! The mobile client processes the query image, extracts fea-tures, and transmits feature data. The image-retrieval algo-rithms run on the server using the feature data as query.
! The mobile client downloads data from the server, and all image matching is performed on the device.One could also imagine a hybrid of the approaches men-
tioned above. When the database is small, it can be stored on the phone, and image-retrieval algorithms can be run locally [8]. When the database is large, it has to be placed on a remote server and the retrieval algorithms are run remotely.
In each case, the retrieval framework has to work within stringent memory, computation, power, and bandwidth constraints of the mobile device. The size of the data transmit-ted over the network needs to be as small as possible to reduce network latency and improve user experience. The server laten-cy has to be low as we scale to large databases. This article reviews the recent advances in content-based image retrieval with a focus on mobile applications. We first review large-scale image retrieval, highlighting recent progress in mobile visual search. As an example, we then present the Stanford Product Search system, a low-latency interactive visual search system. Several sidebars in this article invite the interested reader to dig deeper into the underlying algorithms.
ROBUST MOBILE IMAGE RECOGNITIONToday, the most successful algorithms for content-based image retrieval use an approach that is referred to as bag of features (BoFs) or bag of words (BoWs). The BoW idea is borrowed from text retrieval. To find a particular text document, such as a Web page, it is sufficient to use a few well-chosen words. In the database, the document itself can be likewise represented by a
bag of salient words, regardless of where these words appear in the text. For images, robust local features take the analogous role of visual words. Like text
retrieval, BoF image retrieval does not consider where in the image the features occur, at least in the initial stages of the retrieval pipeline. However, the variability of features extracted from different images of the same object makes the problem much more challenging.
A typical pipeline for image retrieval is shown in Figure 2. First, the local features are extracted from the query image. The set of image features is used to assess the similarity between query and database images. For mobile applications, individual features must be robust against geometric and photometric dis-tortions encountered when the user takes the query photo from a different viewpoint and with different lighting compared to the corresponding database image.
Next, the query features are quantized [9]–[12]. The parti-tioning into quantization cells is precomputed for the database, and each quantization cell is associated with a list of database images in which the quantized feature vector appears some-where. This inverted file circumvents a pairwise comparison of each query feature vector with all the feature vectors in the data-base and is the key to very fast retrieval. Based on the number of features they have in common with the query image, a short list of potentially similar images is selected from the database.
Finally, a geometric verification (GV) step is applied to the most similar matches in the database. The GV finds a coherent spatial pattern between features of the query image and the can-didate database image to ensure that the match is plausible. Example retrieval systems are presented in [9]–[14].
For mobile visual search, there are considerable challenges to provide the users with an interactive experience. Current deployed systems typically transmit an image from the client to the server, which might require tens of seconds. As we scale to large databases, the inverted file index becomes very large, with memory swapping operations slowing down the feature-match-ing stage. Further, the GV step is computationally expensive and thus increases the response time. We discuss each block of the retrieval pipeline in the following, focusing on how to meet the challenges of mobile visual search.
[FIG1] A snapshot of an outdoor mobile visual search system being used. The system augments the viewfinder with information about the objects it recognizes in the image taken with a camera phone.
Database
QueryImage
FeatureExtraction
FeatureMatching
GeometricVerification
[FIG2] A Pipeline for image retrieval. Local features are extracted from the query image. Feature matching finds a small set of images in the database that have many features in common with the query image. The GV step rejects all matches with feature locations that cannot be plausibly explained by a change in viewing position.
MOBILE IMAGE-RETRIEVAL APPLICATIONS POSE A UNIQUE
SET OF CHALLENGES.
Part II
Basic concepts
MVS: technical challenges
• How to ensure low latency (and interactive queries) under constraints such as: – Network bandwidth
– Computational power – Battery consumption
• How to achieve robust visual recognition in spite of low-resolution cameras, varying lighting conditions, etc.
• How to handle broad and narrow domains
Oge Marques
MVS: Pipeline for image retrieval
Oge Marques Girod et al. IEEE MulVmedia 2011
3 scenarios
Oge Marques Girod et al. IEEE MulVmedia 2011
Part III
Technical details
Part III - Outline
• The MVS pipeline in greater detail
• Datasets for MVS research
• MPEG Compact Descriptors for Visual Search (CDVS)
Oge Marques
MVS: descriptor extraction
• Interest point detection • Feature descriptor computation
Oge Marques Girod et al. IEEE MulVmedia 2011
Interest point detection • Numerous interest-point detectors have been
proposed in the literature: – Harris Corners (Harris and Stephens 1988) – Scale-Invariant Feature Transform (SIFT) Difference-of-
Gaussian (DoG) (Lowe 2004) – Maximally Stable Extremal Regions (MSERs) (Matas et al.
2002) – Hessian affine (Mikolajczyk et al. 2005) – Features from Accelerated Segment Test (FAST) (Rosten
and Drummond 2006) – Hessian blobs (Bay, Tuytelaars and Van Gool 2006) – etc.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Interest point detection • Different tradeoffs in repeatability and complexity: – SIFT DoG and other affine interest-point detectors are slow to
compute but are highly repeatable. – SURF interest-point detector provides significant speed up over
DoG interest-point detectors by using box filters and integral images for fast computation. • However, the box filter approximation causes significant anisotropy, i.e.,
the matching performance varies with the relative orientation of query and database images.
– FAST corner detector is an extremely fast interest-point detector that offers very low repeatability.
• See (Mikolajczyk and Schmid 2005) for a comparative performance evaluation of local descriptors in a common framework.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Feature descriptor computation
• After interest-point detection, we compute a visual word descriptor on a normalized patch.
• Ideally, descriptors should be: – robust to small distortions in scale, orientation, and
lighting conditions; – discriminative, i.e., characteristic of an image or a small
set of images; – compact, due to typical mobile computing constraints.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Feature descriptor computation • Examples of feature descriptors in the literature: – SIFT (Lowe 1999) – Speeded Up Robust Feature (SURF) interest-point
detector (Bay et al. 2008)
– Gradient Location and Orientation Histogram (GLOH) (Mikolajczyk and Schmid 2005)
– Compressed Histogram of Gradients (CHoG) (Chandrasekhar et al. 2009, 2010)
• See (Winder, (Hua,) and Brown CVPR 2007, 2009) and (Mikolajczyk and Schmid PAMI 2005) for comparative performance evaluation of different descriptors.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Feature descriptor computation
• What about compactness? – Several attempts in the literature to compress off-the-
shelf descriptors did not lead to the best-rate-constrained image-retrieval performance.
– Alternative: design a descriptor with compression in mind.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Feature descriptor computation • CHoG (Compressed Histogram of Gradients) ���
(Chandrasekhar et al. 2009, 2010)
– Based on the distribution of gradients within a patch of pixels
– Histogram of gradient (HoG)-based descriptors [e.g. (Lowe 2004), (Bay et al. 2008), (Dalal and Triggs 2005), (Freeman and Roth 1994), and (Winder et al. 2009)] have been shown to be highly discriminative at low bit rates.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
CHoG: Compressed Histogram of Gradients
Oge Marques Chandrasekhar et al. CVPR 09,10 Bernd Girod: Mobile Visual Search
Patch
01101
101101
CHoG
Descriptor
Gradient distributions
for each bin Gradients
Spatial
binning
Histogram
compression
dx
dy
0100011 111001 0010011 01100 1010100
dx
dy
0100101
011101
Encoding descriptor’s location information
• Location Histogram Coding (LHC)
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
IEEE SIGNAL PROCESSING MAGAZINE [67] JULY 2011
and higher query times, as longer inverted files need to be tra-versed due to the smaller vocabulary size.
As database size increases, the amount of memory used to index the database features can become very large. Thus, devel-oping a memory-efficient indexing structure is a problem of increasing interest. Chum et al. use a set of compact min-hash-es to perform near-duplicate image retrieval [52], [53]. Zhang et al. decompose each image’s set of features into coarse and refinement signatures [54]. The refinement signature is subsequently indexed by an LSH. Schemes that take advantage of the structure of the database have been proposed recently in [55]–[57]. These schemes are typically applied to databases where there is a lot of redundancy, e.g., each object is represent-
ed by images taken from multiple view points. The size of the inverted index is reduced by using geometry to find matching features across images, and only retaining useful features and discarding irrelevant clutter features.
To support the popular VT-scoring framework, inverted index compression methods for both hard-binned and soft-binned VTs have been developed by us [58], as explained in “Inverted Index Compression.” The memory for BoF image signatures can be alternatively reduced using the mini-BoF approach [59]. Very recently, visual word residuals on a small BoF codebook have shown promising retrieval results with low memory usage [60], [61]. The residuals are indexed either with PCA and product quantizers [60] or with LSH [61].
LOCATION HISTOGRAM CODING LHC is used to compress feature location data efficiently. We note that the interest points in the images are spatially clustered, as shown in Figure S3.
To encode location data, we first generate a two-dimensional (2-D) histogram from the locations of the descriptors (Figure S4). We divide the image into spatial bins and count the number of features within each spatial bin. We compress the binary map, indicating which spatial bins contains features, and a sequence of feature counts, representing the number of features in occupied bins. We encode the binary map using a trained context-based arithmetic coder, with the neighboring bins being used as the context for each spatial bin.
LHC provides two key benefits. First, encoding the locations of a set of N features as the histogram reduces the bit rate by log(N!) compared to encoding each feature location in sequence [47]. Here, we exploit the fact that the features can be sent in any order. Consider the sample space that represents N features. There are N! number of codes that represent the same feature set because the order does not matter. Thus, if we fix the ordering for the
feature set, i.e., using the LHC scheme described earlier, we can achieve a bit savings of log(N!). For example, for a feature set of 750 features, we achieve a rate savings of log(750!)/750 z 8 b/feature.
Second, with LHC, we exploit the spatial correlation between the locations of different descriptors, as illustrat-ed in Figure S3. Different interest-point detectors result in different coding gains. In our experiments, Hessian Laplace has the highest gain followed by SIFT and SURF interest-point detectors. Even if the feature points are uniformly scattered in the image, LHC is still able to exploit the ordering gain, which results in log(N!) saving in bits.
In our experiments, we found that quantizing the (x, y) location to four-pixel blocks is sufficient for GV. If we use a simple fixed-length coding scheme, then the rate will be log(640/4) 1 log(640/4) z 14 b/feature for a VGA size image. Using LHC, we can transmit the same location data with z 5 b/descriptor; z 12.5 times reduc-tion in data compared to a 64-b floating point represen-tation and z 2.8 times rate reduction compared to fixed-length coding [48].
[FIG S3] Interest-point locations in images tend to cluster spatially.
[FIG S4] We represent the location of the descriptors using a location histogram. The image is first divided into evenly spaced blocks. We enumerate the features within each spatial block by generating a location histogram.
1
11
1
1 1 1
13
2
– Rationale: Interest-point locations in images tend to cluster spatially.
Encoding descriptor’s location information • Method:
1. Generate a 2D histogram from the locations of the descriptors. • Divide the image into spatial bins and
count the number of features within each spatial bin.
2. Compress the binary map, indicating which spatial bins contains features, and a sequence of feature counts, representing the number of features in occupied bins.
3. Encode the binary map using a trained context-based arithmetic coder, with the neighboring bins being used as the context for each spatial bin.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
IEEE SIGNAL PROCESSING MAGAZINE [67] JULY 2011
and higher query times, as longer inverted files need to be tra-versed due to the smaller vocabulary size.
As database size increases, the amount of memory used to index the database features can become very large. Thus, devel-oping a memory-efficient indexing structure is a problem of increasing interest. Chum et al. use a set of compact min-hash-es to perform near-duplicate image retrieval [52], [53]. Zhang et al. decompose each image’s set of features into coarse and refinement signatures [54]. The refinement signature is subsequently indexed by an LSH. Schemes that take advantage of the structure of the database have been proposed recently in [55]–[57]. These schemes are typically applied to databases where there is a lot of redundancy, e.g., each object is represent-
ed by images taken from multiple view points. The size of the inverted index is reduced by using geometry to find matching features across images, and only retaining useful features and discarding irrelevant clutter features.
To support the popular VT-scoring framework, inverted index compression methods for both hard-binned and soft-binned VTs have been developed by us [58], as explained in “Inverted Index Compression.” The memory for BoF image signatures can be alternatively reduced using the mini-BoF approach [59]. Very recently, visual word residuals on a small BoF codebook have shown promising retrieval results with low memory usage [60], [61]. The residuals are indexed either with PCA and product quantizers [60] or with LSH [61].
LOCATION HISTOGRAM CODING LHC is used to compress feature location data efficiently. We note that the interest points in the images are spatially clustered, as shown in Figure S3.
To encode location data, we first generate a two-dimensional (2-D) histogram from the locations of the descriptors (Figure S4). We divide the image into spatial bins and count the number of features within each spatial bin. We compress the binary map, indicating which spatial bins contains features, and a sequence of feature counts, representing the number of features in occupied bins. We encode the binary map using a trained context-based arithmetic coder, with the neighboring bins being used as the context for each spatial bin.
LHC provides two key benefits. First, encoding the locations of a set of N features as the histogram reduces the bit rate by log(N!) compared to encoding each feature location in sequence [47]. Here, we exploit the fact that the features can be sent in any order. Consider the sample space that represents N features. There are N! number of codes that represent the same feature set because the order does not matter. Thus, if we fix the ordering for the
feature set, i.e., using the LHC scheme described earlier, we can achieve a bit savings of log(N!). For example, for a feature set of 750 features, we achieve a rate savings of log(750!)/750 z 8 b/feature.
Second, with LHC, we exploit the spatial correlation between the locations of different descriptors, as illustrat-ed in Figure S3. Different interest-point detectors result in different coding gains. In our experiments, Hessian Laplace has the highest gain followed by SIFT and SURF interest-point detectors. Even if the feature points are uniformly scattered in the image, LHC is still able to exploit the ordering gain, which results in log(N!) saving in bits.
In our experiments, we found that quantizing the (x, y) location to four-pixel blocks is sufficient for GV. If we use a simple fixed-length coding scheme, then the rate will be log(640/4) 1 log(640/4) z 14 b/feature for a VGA size image. Using LHC, we can transmit the same location data with z 5 b/descriptor; z 12.5 times reduc-tion in data compared to a 64-b floating point represen-tation and z 2.8 times rate reduction compared to fixed-length coding [48].
[FIG S3] Interest-point locations in images tend to cluster spatially.
[FIG S4] We represent the location of the descriptors using a location histogram. The image is first divided into evenly spaced blocks. We enumerate the features within each spatial block by generating a location histogram.
1
11
1
1 1 1
13
2
• Location Histogram Coding (LHC)
MVS: feature indexing and matching • Goal: produce a data structure that can quickly return a short
list of the database candidates most likely to match the query image. – The short list may contain false positives as long as the correct match
is included. – Slower pairwise comparisons can be subsequently performed on just
the short list of candidates rather than the entire database.
Oge Marques Girod et al. IEEE MulVmedia 2011
MVS: feature indexing and matching
• Vocabulary Tree (VT)-Based Retrieval
Oge Marques Girod et al. IEEE MulVmedia 2011
During a query, scoring the database images
can be made fast by using an inverted index
associated with the BoF codebook. To generate
a much larger codebook, Nister and Stewenius
use hierarchical k-means clustering to create a
vocabulary tree (VT).2 Additional details about
a VT can be found in the ‘‘Vocabulary Tree-
Based Retrieval’’ sidebar. Several alternative
search techniques, such as locality-sensitive
hashing and various improvements in tree-
based approaches, have also been developed.8-11
Geometric verificationGeometric verification typically follows the
feature-matching step. In this stage, location
information of features in query and database
images is used to confirm that the feature
matches are consistent with a change in view-
point between the two images. This process is
illustrated in Figure 5. The geometric transform
between the query and database image is usu-
ally estimated using robust regression tech-
niques such as random sample consensus
[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 90
Vocabulary Tree-Based RetrievalA vocabulary tree (VT) for a particular database is con-
structed by performing hierarchical k-means clustering on
a set of training feature descriptors representative of the
database. Initially, k large clusters are generated for all the
training descriptors. Then, for each large cluster, k-means
clustering is applied to the training descriptors assigned to
that cluster, to generate k smaller clusters. This recursive di-
vision of the descriptor space is repeated until there are
enough bins to ensure good classification performance.
Figure B1 shows a VT with only two levels, branching factor
k ! 3, and 32 ! 9 leaf nodes. In practice, VT can be much
larger, for example, with height 6, branching factor k !10, and containing 106 ! 1 million nodes.
The associated inverted index structure maintains two
lists for each VT leaf node, as shown in Figure B2. For a
leaf node x, there is a sorted array of image identifiers
ix1,. . ., ixNx indicating which Nx database images have fea-
tures that belong to a cluster associated with this node.
Similarly, there is a corresponding array of counters
cx1,. . ., cxNx indicating how many features in each image
fall in same cluster.
During a query, the VT is traversed for each feature in
the query image, finishing at one of the leaf nodes. The
corresponding lists of images and frequency counts are
subsequently used to compute similarity scores be-
tween these images and the query image. By pulling
images from all these lists and sorting them according
to the scores, we arrive at a subset of database images
that is likely to contain a true match to the query
image.
Figure B. (1) Vocabulary tree and (2) inverted index structures.
1 2
3
87
9
4
6
5
Training descriptor
Root node
1st level intermediate node
2nd level leaf node
(1)
1 2 3 4 5 6 7 8 9
i11
Vocabulary tree
(2)
Inverted index
i12 i13
c11 c12 c13
i21 i22 i23
c21 c22 c23
i1N1. . .. . .
. . .
. . .
c1N1
c2N2
i2N2
IEEE
Mu
ltiM
ed
ia
90
Industry and Standards
MVS: geometric verification
• Goal: use location information of features in query and database images to confirm that the feature matches are consistent with a change in view-point between the two images.
Oge Marques Girod et al. IEEE MulVmedia 2011
MVS: geometric verification
• Method: perform pairwise matching of feature descriptors and evaluate geometric consistency of correspondences.
Oge Marques
IEEE SIGNAL PROCESSING MAGAZINE [69] JULY 2011
[11] use weak geometric consistency checks to rerank images based on the orientation and scale information of all features. The authors in [53] and [69] propose incor-porating geometric information into the VT matching or hashing step. In [70] and [71], the authors investigate how to speed up RANSAC estimation itself. Philbin et al. [72] use single pairs of matching features to propose hypotheses of the geometric transformation model and verify only possible sets of hypotheses. Weak geometric consistency checks are typically used to rerank a larger candidate list of images, before a full GV is performed on a shorter candidate list.
To speed up GV, one can add a geometric reranking step before the RANSAC GV step, as illustrated in Figure 5. In [73], we propose a reranking step that incorporates geometric information directly into the fast index lookup stage and use it to reorder the list of top matching images (see “Fast Geometric Reranking”). The main advantage of the scheme is that it only requires x, y fea-ture location data and does not use scale
INVERTED INDEX COMPRESSION For a database containing 1 million images and a VT that uses soft binning, each image ID can be stored in a 32-b unsigned integer, and each fractional count can be stored in a 32-b float in the inverted index. The memory usage of the entire inverted index is gk
k51 Nk# 64 bits, where Nk is the length of
the inverted list at the kth leaf node. For a database of 1 mil-lion product images, this amount of memory reaches 10 GB, a huge amount for even a modern server. Such a large memory footprint limits the ability to run other concurrent processes on the same server, such as recognition systems for other databases. When the inverted index’s memory usage exceeds the server’s available random access memory (RAM), swap-ping between main and virtual memory occurs, which signifi-cantly slows down all processes.
A compressed inverted index [58] can significantly reduce memory usage without affecting recognition accuracy. First, because each list of IDs 5ik1, ik2, c, ikNk
6 is sorted, it is more efficient to store consecutive ID differences 5dk15 ik1,dk25 ik22 ik1, c, dkNk
5 ikNk2 ik1Nk2126 in place of the IDs. This
practice is also commonly used in text retrieval [62]. Second, the fractional visit counts can be quantized to a few repre-sentative values using Lloyd-Max quantization. Third, the dis-tributions of the ID differences and visit counts are far from uniform, so variable-length coding can be much more rate efficient than fixed-length coding. Using the distributions of the ID differences and visit counts, each inverted list can be encoded using an arithmetic code (AC) [63]. Since keeping the decoding delay low is very important for interactive mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred over AC. The carryover code
[64] and recursive bottom-up complete (RBUC) code [65] have been shown to be at least ten times faster in decoding than AC, while achieving comparable compression gains as AC. The carryover and RBUC codes attain these speedups by enforcing word-aligned memory accesses.
Figure S6(a) compares the memory usage of the invert-ed index with and without compression using the RBUC code. Index compression reduces memory usage from near-ly 10 GB to 2 GB. This five times reduction leads to a sub-stantial speedup in server-side processing, as shown in Figure S6(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine. After compression, memory swapping is avoided and memory congestion delays no longer contribute to the query latency.
Uncod
ed
Coded
Uncod
ed
Coded
0
5
10
Mem
ory
Usa
ge (
GB
)
(a)
0
2
4
6
Que
ry L
aten
cy (
s)
(b)
[FIG S6] (a) Memory usage for inverted index with and without compression. A five times savings in memory is achieved with compression. (b) Server-side query latency (per image) with and without compression. The RBUC code is used to encode the inverted index.
QueryData VT
GeometricReranking GV
IdentifyInformation
[FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometric reranking stage.
[FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometric model. True feature matches are shown in red. False feature matches are shown in green. Girod et al. IEEE MulVmedia 2011
MVS: geometric verification • Techniques: – The geometric transform between the query and database
image is usually estimated using robust regression techniques such as: • Random sample consensus (RANSAC) (Fischler and Bolles 1981) • Hough transform (Lowe 2004)
– The transformation is often represented by an affine mapping or a homography.
• GV is computationally expensive, which is why it’s only used for a subset of images selected during the feature-matching stage.
Oge Marques Girod et al. IEEE MulVmedia 2011
MVS: geometric reranking
• Speed-up step between Vocabulary Tree building and Geometric Verification.
Oge Marques
IEEE SIGNAL PROCESSING MAGAZINE [69] JULY 2011
[11] use weak geometric consistency checks to rerank images based on the orientation and scale information of all features. The authors in [53] and [69] propose incor-porating geometric information into the VT matching or hashing step. In [70] and [71], the authors investigate how to speed up RANSAC estimation itself. Philbin et al. [72] use single pairs of matching features to propose hypotheses of the geometric transformation model and verify only possible sets of hypotheses. Weak geometric consistency checks are typically used to rerank a larger candidate list of images, before a full GV is performed on a shorter candidate list.
To speed up GV, one can add a geometric reranking step before the RANSAC GV step, as illustrated in Figure 5. In [73], we propose a reranking step that incorporates geometric information directly into the fast index lookup stage and use it to reorder the list of top matching images (see “Fast Geometric Reranking”). The main advantage of the scheme is that it only requires x, y fea-ture location data and does not use scale
INVERTED INDEX COMPRESSION For a database containing 1 million images and a VT that uses soft binning, each image ID can be stored in a 32-b unsigned integer, and each fractional count can be stored in a 32-b float in the inverted index. The memory usage of the entire inverted index is gk
k51 Nk# 64 bits, where Nk is the length of
the inverted list at the kth leaf node. For a database of 1 mil-lion product images, this amount of memory reaches 10 GB, a huge amount for even a modern server. Such a large memory footprint limits the ability to run other concurrent processes on the same server, such as recognition systems for other databases. When the inverted index’s memory usage exceeds the server’s available random access memory (RAM), swap-ping between main and virtual memory occurs, which signifi-cantly slows down all processes.
A compressed inverted index [58] can significantly reduce memory usage without affecting recognition accuracy. First, because each list of IDs 5ik1, ik2, c, ikNk
6 is sorted, it is more efficient to store consecutive ID differences 5dk15 ik1,dk25 ik22 ik1, c, dkNk
5 ikNk2 ik1Nk2126 in place of the IDs. This
practice is also commonly used in text retrieval [62]. Second, the fractional visit counts can be quantized to a few repre-sentative values using Lloyd-Max quantization. Third, the dis-tributions of the ID differences and visit counts are far from uniform, so variable-length coding can be much more rate efficient than fixed-length coding. Using the distributions of the ID differences and visit counts, each inverted list can be encoded using an arithmetic code (AC) [63]. Since keeping the decoding delay low is very important for interactive mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred over AC. The carryover code
[64] and recursive bottom-up complete (RBUC) code [65] have been shown to be at least ten times faster in decoding than AC, while achieving comparable compression gains as AC. The carryover and RBUC codes attain these speedups by enforcing word-aligned memory accesses.
Figure S6(a) compares the memory usage of the invert-ed index with and without compression using the RBUC code. Index compression reduces memory usage from near-ly 10 GB to 2 GB. This five times reduction leads to a sub-stantial speedup in server-side processing, as shown in Figure S6(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine. After compression, memory swapping is avoided and memory congestion delays no longer contribute to the query latency.
Uncod
ed
Coded
Uncod
ed
Coded
0
5
10
Mem
ory
Usa
ge (G
B)
(a)
0
2
4
6
Que
ry L
aten
cy (s
)
(b)
[FIG S6] (a) Memory usage for inverted index with and without compression. A five times savings in memory is achieved with compression. (b) Server-side query latency (per image) with and without compression. The RBUC code is used to encode the inverted index.
QueryData VT
GeometricReranking GV
IdentifyInformation
[FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometric reranking stage.
[FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometric model. True feature matches are shown in red. False feature matches are shown in green.
Fast geometric reranking
• The location geometric score is computed as follows: a) features of two images are matched based on VT quantization; b) distances between pairs of features within an image are calculated; c) log-distance ratios of the corresponding pairs (denoted by color) are
calculated; and d) histogram of log-distance ratios is computed.
• The maximum value of the histogram is the geometric similarity score. – A peak in the histogram indicates a similarity transform between
the query and database image.
Oge Marques
IEEE SIGNAL PROCESSING MAGAZINE [70] JULY 2011
or orientation information as in [11]. As scale and orientation data are not used, they need not be transmitted by the client, which reduces the amount of data transferred. We typically run fast geometric reranking on a large set of candidate data-base images and reduce the list of images that we run RANSAC on.
Before discussing system performance results, we provide a list of important references for each module in the matching pipeline in Table 2.
SYSTEM PERFORMANCEWhat performance can we expect for a mobile visual search system that incorporates all the ideas discussed so far? To answer this question, we have a closer look at the experimen-tal Stanford Product Search System (Figure 6). For evalua-tion, we use a database of 1 million CD, digital versatile disk (DVD), and book cover images, and a set of 1,000 query images (500 3 500 pixel resolution) [75] exhibiting challenging pho-tometric and geometric distortions, as shown in Figure 7. For
[TABLE 2] SUMMARY OF REFERENCES FOR MODULES IN A MATCHING PIPELINE.
MODULE LIST OF REFERENCES
FEATURE EXTRACTION HARRIS AND STEPHENS [17], LOWE [15], [23], MATAS ET AL. [18], MIKOLAJCZYK ET AL. [16], [22], DALAL AND TRIGGS [41], ROSTEN AND DRUMMOND [19], BAY ET AL. [20], WINDER ET AL. [27], [28], CHANDRASEKHAR ET AL. [25], [26], PHILBIN ET AL. [40]
FEATURE INDEXING AND MATCHING SCHMID AND MOHR [13], LOWE [15], [23], SIVIC AND ZISSERMAN [9], NISTÉR AND STEWÉNIUS [10], CHUM ET AL. [50], [52], [53], YEH ET AL. [51], PHILBIN ET AL. [12], JEGOU ET AL. [11], [59], [60], ZHANG ET AL. [54]CHEN ET AL. [58], PERRONNIN [61], MIKULIK ET AL. [55], TURCOT AND LOWE [56], LI ET AL. [57]
GV FISCHLER AND BOLLES [66], SCHAFFALITZKY AND ZISSERMAN [74], LOWE [15], [23], CHUM ET AL. [53], [70], [71] FERRARI ET AL. [68], JEGOU ET AL. [11], WU ET AL. [69], TSAI ET AL. [73]
FAST GEOMETRIC RERANKING We have proposed a fast geometric reranking algorithm in [73] that uses x, y locations of features to rerank a short list of candidate images. First, we generate a set of potential feature matches between each query and database image based on VT-quantization results. After generating a set of feature correspondences, we calculate a geometric score between them. The process used to compute the geometric similarity score is illustrated in Figure S7. We find the dis-tance between the two features in the query image and the distance between the corresponding matching features in the database image. The ratio of the distance corre-sponds to the scale difference between the two images. We repeat the ratio calculation for features in the query image that have matching database features. If there exists a consistent set of ratios (as indicated by a peak in the his-togram of distance ratios), it is more likely that the query image and the database image match.
The geometric reranking is fast because we use the VT-quantization results directly to find potential feature matches and use a really simple similarity scoring scheme.
The time required to calculate a geometric similarity score is one to two orders of magnitude less than using RANSAC. Typically, we perform fast geometric reranking on the top 500 images and RANSAC on the top 50 ranked images.
(a) (b) (c) (d)
log (÷)
[FIG S7] The location geometric score is computed as follows: (a) features of two images are matched based on VT quantization, (b) distances between pairs of features within an image are calculated, (c) log-distance ratios of the corresponding pairs (denoted by color) are calculated, and (d) histogram of log-distance ratios is computed. The maximum value of the histogram is the geometric similarity score. A peak in the histogram indicates a similarity transform between the query and database image.
[FIG6] Stanford Product Search system. Because of the large database, the image-recognition server is placed at a remote location. In most systems [1], [3], [7], the query image is sent to the server and feature extraction is performed. In our system, we show that by performing feature extraction on the phone we can significantly reduce the transmission delay and provide an interactive experience.
Client
Image FeatureExtraction
FeatureCompression
Display
Query Data
Network
Identification Data
VT Matching
GV
Server
Girod et al. IEEE MulVmedia 2011
Datasets for MVS research
• Stanford Mobile Visual Search Data Set ���(http://web.cs.wpi.edu/~claypool/mmsys-2011-dataset/stanford/) – Key characteristics: • rigid objects
• widely varying lighting conditions • perspective distortion
• foreground and background clutter • realistic ground-truth reference data
• query data collected from heterogeneous low and high-end camera phones.
Oge Marques Chandrasekhar et al. ACM MMSys 2011
Stanford Mobile Visual Search (SMVS) ���Data Set
• Limitations of popular datasets
Oge Marques
ZuBuD
Oxford
INRIA
UKY
Image Nets
Figure 2: Limitations with popular data sets in computer vision. The left most image in each row is thedatabase image, and the other 3 images are query images. ZuBuD, INRIA and UKY consist of images takenat the same time and location. ImageNets is not suitable for image retrieval applications. The Oxford datasethas di!erent faades of the same building labelled with the same name.
Fig. 2 illustrates some images for the word “tiger”. Such adata set is useful for testing classification algorithms, butnot so much for testing retrieval algorithms.
We summarize the limitations of the di!erent data setsin Tab. 1. To overcome the limitations in these data sets,we propose the Stanford Mobile Visual Search (SMVS) dataset.
3. STANFORD MOBILE VISUAL SEARCHDATA SET
We present the SMVS (version 0.9) data set in the hopethat it will be useful for a wide range of visual search appli-cations like product recognition, landmark recognition, out-door augmented reality [26], business card recognition, textrecognition, video recognition and TV-on-the-go [5]. We col-lect data for several di!erent categories: CDs, DVDs, books,software products, landmarks, business cards, text docu-ments, museum paintings and video clips. Sample queryand database images are shown in Figure 4. Current andsubsequent versions of the dataset will be available at [3].
The number of database and query images for di!erentcategories is shown in Tab. 2. We provide a total 3300 queryimages for 1200 distinct classes across 8 image categories.Typically, a small number of query images (!1000s) suf-fice to measure the performance of a retrieval system as therest of the database can be padded with “distractor” images.Ideally, we would like to have a large distractor set for eachquery category. However, it is challenging to collect distrac-tor sets for each category. Instead, we plan to release twodistractor sets upon request: one containing Flickr images,and the other containing building images from Navteq. Thedistractor sets will be available in sets of 1K, 10K, 100K and
1M. Researchers can test scalability using these distractordata sets, or the ones provided in [22, 14]. Next, we discusshow the query and reference database images are collected,and evaluation measures that are in particular relevant formobile applications.
Reference Database Images.For product categories (CDs, DVDs and books), the refer-
ences are clean versions of images obtained from the productwebsites. For landmarks, the reference images are obtainedfrom data collected by Navteq’s vehicle-mounted cameras.For video clips, the reference images are the key frame fromthe reference video clips. The videos contain diverse contentlike movie trailers, news reports, and sports. For text doc-uments, we collect (1) reference images from [19], a websitethat mines the front pages of newspapers from around theworld, and (2) research papers. For business cards, the ref-erence image is obtained from a high quality upright scanof the card. For museum paintings, we collect data fromthe Cantor Arts Center at Stanford University for di!er-ent genres: history, portraits, landscapes and modern-art.The reference images are obtained from the artists’ websiteslike [23] or other online sources. All reference images arehigh quality JPEG compressed color images. The resolutionof reference images varies for each category.
Query Images.We capture query images with several di!erent camera
phones, including some digital cameras. The list of compa-nies and models used is as follows: Apple (iPhone4), Palm(Pre), Nokia (N95, N97, N900, E63, N5800, N86), Motorola(Droid), Canon (G11) and LG (LG300). For product cate-
119
Data Database Query Classes Rigid Lighting Clutter Perspective CameraSet (#) (#) (#) Phone
ZuBuD 1005 115 200!
"! !
"
Oxford 5062 55 17! ! ! !
#
INRIA 1491 500 500 " "! !
"
UKY 10200 2550 2550!
" "!
"
ImageNet 11M 15K 15K "! ! !
"
SMVS 1200 3300 1200! ! ! ! !
Table 1: Comparison of di!erent data sets. “Classes” refers to the number of distinct objects in the data set.“Rigid” refers to whether on not the objects in the database are rigid. “Lighting” refers to whether or notthe query images capture widely varying lighting conditions. “Clutter” refers to whether or not the queryimages contain foreground/background clutter. “Perspective” refers to whether the data set contains typicalperspective distortions. “Camera-phone” refers to whether the images were captured with mobile devices.SMVS is a good data set for mobile visual search applications.
gories like CDs, DVDs, books, text documents and businesscards, we capture the images indoors under widely varyinglighting conditions over several days. We include foregroundand background clutter that would be typically present inthe application, e.g., a picture of a CD would might otherCDs in the background. For landmarks, we capture imagesof buildings in San Francisco. We collected query imagesseveral months after the reference data was collected. Forvideo clips, the query images were taken from laptop, com-puter and TV screens to include typical specular distortions.Finally, the paintings were captured at the Cantor Arts Cen-ter at Stanford University under controlled lighting condi-tions typical of museums.
The resolution of the query images varies for each cameraphone. We provide the original JPEG compressed high qual-ity color images obtained from the camera. We also provideauxiliary information like phone model number, and GPSlocation, where applicable. As noted in Tab. 1, the SMVSquery data set has the following key characteristics that islacking in other data sets: rigid objects, widely varying light-ing conditions, perspective distortion, foreground and back-ground clutter, realistic ground-truth reference data, andquery images from heterogeneous low and high-end cameraphones.
Category Database QueryCD 100 400
DVD 100 400Books 100 400
Video Clips 100 400Landmarks 500 500
Business Cards 100 400Text documents 100 400
Paintings 100 400
Table 2: Number of query and database images inthe SMVS data set for di!erent categories.
Evaluation measures.A naive retrieval system would match all database im-
ages against each query image. Such a brute-force match-ing scheme provides as an upper-bound on the performancethat can be achieved with the feature matching pipeline.Here, we report results for brute-force pairwise matchingfor di!erent interest point detectors and descriptors usingthe ratio-test [16] and RANSAC [9]. For RANSAC, we use
a"ne models with a minimum threshold of 10 matches post-RANSAC for declaring a pair of images to be a valid match.
In Fig. 3, we report results for 3 state-of-the-art schemes:(1) SIFT Di!erence-of-Gaussian (DoG) interest point de-tector and SIFT descriptor (code: [27]), (2) Hessian-a"neinterest point detector and SIFT descriptor (code [17]), and(3) Fast Hessian blob interest point detector [2] sped upwith integral images, and the recently proposed CompressedHistogram of Gradients (CHoG) descriptor [4]. We reportthe percentage of images that match, the average numberof features and the average number of features that matchpost-RANSAC for each category.
First, we note that indoor categories are easier than out-door categories. E.g., some categories like CDs, DVDs andbook covers achieve over 95% accuracy. The most challeng-ing category is landmarks as the query data is collected sev-eral months after the database.
Second, we note that option (1): SIFT interest point de-tector and descriptor, performs the best. However, option(1) is computationally complex and is not suitable for im-plementation on mobile devices.
Third, we note that option (3) performs comes close toachieving the performance of (1), with worse performance(10-20% drop) for some categories. The performance hit isincurred due to the fast Hessian-based interest point detec-tor, which is not as robust as the DoG interest point detec-tor. One reason for lower robustness is observed in [25]: thefast box-filtering step causes the interest point detection tolose rotation invariance which a!ects oriented query images.The CHoG descriptor used in option (3) is a low-bitrate 60-bit descriptor which is shown to perform on par with the128-dimensional 1024-bit SIFT descriptor using extensiveevaluation in [4]. We note that option (3) is most suitablefor implementation on mobile devices as the fast hessian in-terest point detector is an order-of-magnitude faster thanSIFT DoG, and the CHoG descriptors generate an orderof magnitude less data than SIFT descriptors for e"cienttransmission [10].
Finally, we list aspects critical for mobile visual searchapplications. A good image retrieval system should exhibitthe follow characteristics when tested on the SMVS dataset.
• High Precision-Recall as size of database increases
• Low retrieval latency
120
Chandrasekhar et al. ACM MMSys 2011
SMVS Data Set: categories and examples
• Number of query and database images per category
Oge Marques Chandrasekhar et al. ACM MMSys 2011
SMVS Data Set: categories and examples
• DVD covers
Oge Marques h<p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/dvd_covers.html
SMVS Data Set: categories and examples
• CD covers
Oge Marques h<p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/cd_covers.html
SMVS Data Set: categories and examples
• Museum paintings
Oge Marques h<p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/museum_painVngs.html
Other MVS data sets
Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT
Other MVS data sets
Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT
Other MVS data sets
• Distractor set – 1 million images of various resolution and content
collected from FLICKR.
Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT
MPEG Compact Descriptors for Visual Search (CDVS)
• Objectives – Define a standard that:
• enables design of visual search applications • minimizes lengths of query requests • ensures high matching performance (in terms of reliability and
complexity) • enables interoperability between search applications and visual databases • enables efficient implementation of visual search functionality on mobile
devices
• Scope – It is envisioned that (as a minimum) the standard will specify:
• bitstream of descriptors • parts of descriptor extraction process (e.g. key-point detection) needed
to ensure interoperability
Oge Marques Bober, Cordara, and Reznik (2010)
MPEG CDVS • Requirements – Robustness
• High matching accuracy shall be achieved at least for images of textured rigid objects, landmarks, and printed documents. The matching accuracy shall be robust to changes in vantage point, camera parameters, lighting conditions, as well as in the presence of partial occlusions.
– Sufficiency • Descriptors shall be self-contained, in the sense that no other data are
necessary for matching.
– Compactness • Shall minimize lengths/size of image descriptors
– Scalability • Shall allow adaptation of descriptor lengths to support the required
performance level and database size. • Shall enable design of web-scale visual search applications and databases.
Oge Marques Bober, Cordara, and Reznik (2010)
MPEG CDVS • Requirements (cont’d) – Image format independence
• Descriptors shall be independent of the image format – Extraction complexity
• Shall allow descriptor extraction with low complexity (in terms of memory and computation) to facilitate video rate implementations
– Matching complexity • Shall allow matching of descriptors with low complexity (in terms of
memory and computation). • If decoding of descriptors is required for matching, such decoding shall
also be possible with low complexity. – Localization:
• Shall support visual search algorithms that identify and localize matching regions of the query image and the database image
• Shall support visual search algorithms that provide an estimate of a geometric transformation between matching regions of the query image and the database image
Oge Marques Bober, Cordara, and Reznik (2010)
MPEG CDVS
• Summarized timeline
Oge Marques
that among several component technologies for
image retrieval, such a standard should focus pri-
marily on defining the format of descriptors and
parts of their extraction process (such as interest
point detectors) needed to ensure interoperabil-
ity. Such descriptors must be compact, image-
format independent, and sufficient for robust
image-matching. Hence, the title Compact
Descriptors for Visual Search was coined as an in-
terim name for this activity. Requirements and
Evaluation Framework documents have been
subsequently produced to formulate precise cri-
teria and evaluation methodologies to be used
in selection of technology for this standard.
The Call for Proposals17 was issued at the 96th
MPEG meeting in Geneva, in March 2011, and
responses are now expected by November
2011. Table 1 lists milestones to be reached in
subsequent development of this standard.It is envisioned that, when completed, this
standard will
! ensure interoperability of visual search appli-
cations and databases,
! enable a high level of performance of imple-
mentations conformant to the standard,
! simplify design of descriptor extraction and
matching for visual search applications,
! enable hardware support for descriptor ex-
traction and matching in mobile devices, and
! reduce load on wireless networks carrying
visual search-related information.
To build full visual-search applications, this
standard may be used jointly with other
existing standards, such as MPEG Query For-
mat, HTTP, XML, JPEG, and JPSearch.
Conclusions and outlookRecent years have witnessed remarkable
technological progress, making mobile visual
search possible today. Robust local image fea-
tures achieve a high degree of invariance
against scale changes, rotation, as well as
changes in illumination and other photometric
conditions. The BoW approach offers resiliency
to partial occlusions and background clutter,
and allows design of efficient indexing
schemes. The use of compressed image features
makes it possible to communicate query
requests using only a fraction of the rate
needed by JPEG, and further accelerates search
by storing a cache of the visual database on
the phone.Nevertheless, many improvements are still
possible and much needed. Existing image fea-
tures are robust to much of the variability be-
tween query and database images, but not all.
Improvements in complexity and compactness
are also critically important for mobile visual-
search systems. In mobile augmented-reality
applications, annotations of the viewfinder
content simply pop up without the user ever
pressing a button. Such continuous annota-
tions require video-rate processing on the
mobile device. They may also require improve-
ments in indexing structures, retrieval algo-
rithms, and moving more retrieval-related
operations to the phone.Standardization of compact descriptors for
visual search, such as the new initiative within
MPEG, will undoubtedly provide a further
boost to an already exciting area. In the near
[3B2-9] mmu2011030086.3d 1/8/011 16:44 Page 93
Table 1. Timeline for development of MPEG standard for visual search.
When Milestone Comments
March, 2011 Call for Proposals is published Registration deadline: 11 July 2011
Proposals due: 21 November 2011
December, 2011 Evaluation of proposals None
February, 2012 1st Working Draft First specification and test software model that can
be used for subsequent improvements.
July, 2012 Committee Draft Essentially complete and stabilized specification.
January, 2013 Draft International Standard Complete specification. Only minor editorial
changes are allowed after DIS.
July, 2013 Final Draft International
Standard
Finalized specification, submitted for approval and
publication as International standard.
July!
Sep
tem
ber
2011
93
Girod et al. IEEE MulVmedia 2011
MPEG CDVS
• CDVS: evaluation framework – Experimental setup • Retrieval experiment: intended to assess performance of
proposals in the context of an image retrieval system
Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT
MPEG CDVS
• CDVS: evaluation framework – Experimental setup • Pair-wise matching experiments: intended for assessing
performance of proposals in the context of an application that uses descriptors for the purpose of image matching.
Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT
Image A
Extract descriptors
Extract descriptor
Match
Check accuracy of search results
Annota-‐Vons
Report
Image B
MPEG CDVS
• For more info: – https://mailhost.tnt.uni-hannover.de/mailman/listinfo/cdvs
– http://mpeg.chiariglione.org/meetings/geneva11-1/geneva_ahg.htm ���(Ad hoc groups)
Oge Marques
Part IV
Examples and applications
Examples
• Academic – Stanford Product Search System
• Commercial – Google Goggles
– Kooaba: Déjà Vu and Paperboy – SnapTell – oMoby (and the IQ Engines API)
– pixlinQ – Moodstocks
Oge Marques
Stanford Product Search (SPS) System
• Local feature based visual search system
• Client-server architecture
Oge Marques Girod et al. IEEE MulVmedia 2011 Tsai et al. ACM MM 2010
IEEE SIGNAL PROCESSING MAGAZINE [70] JULY 2011
or orientation information as in [11]. As scale and orientation data are not used, they need not be transmitted by the client, which reduces the amount of data transferred. We typically run fast geometric reranking on a large set of candidate data-base images and reduce the list of images that we run RANSAC on.
Before discussing system performance results, we provide a list of important references for each module in the matching pipeline in Table 2.
SYSTEM PERFORMANCEWhat performance can we expect for a mobile visual search system that incorporates all the ideas discussed so far? To answer this question, we have a closer look at the experimen-tal Stanford Product Search System (Figure 6). For evalua-tion, we use a database of 1 million CD, digital versatile disk (DVD), and book cover images, and a set of 1,000 query images (500 3 500 pixel resolution) [75] exhibiting challenging pho-tometric and geometric distortions, as shown in Figure 7. For
[TABLE 2] SUMMARY OF REFERENCES FOR MODULES IN A MATCHING PIPELINE.
MODULE LIST OF REFERENCES
FEATURE EXTRACTION HARRIS AND STEPHENS [17], LOWE [15], [23], MATAS ET AL. [18], MIKOLAJCZYK ET AL. [16], [22], DALAL AND TRIGGS [41], ROSTEN AND DRUMMOND [19], BAY ET AL. [20], WINDER ET AL. [27], [28], CHANDRASEKHAR ET AL. [25], [26], PHILBIN ET AL. [40]
FEATURE INDEXING AND MATCHING SCHMID AND MOHR [13], LOWE [15], [23], SIVIC AND ZISSERMAN [9], NISTÉR AND STEWÉNIUS [10], CHUM ET AL. [50], [52], [53], YEH ET AL. [51], PHILBIN ET AL. [12], JEGOU ET AL. [11], [59], [60], ZHANG ET AL. [54]CHEN ET AL. [58], PERRONNIN [61], MIKULIK ET AL. [55], TURCOT AND LOWE [56], LI ET AL. [57]
GV FISCHLER AND BOLLES [66], SCHAFFALITZKY AND ZISSERMAN [74], LOWE [15], [23], CHUM ET AL. [53], [70], [71] FERRARI ET AL. [68], JEGOU ET AL. [11], WU ET AL. [69], TSAI ET AL. [73]
FAST GEOMETRIC RERANKING We have proposed a fast geometric reranking algorithm in [73] that uses x, y locations of features to rerank a short list of candidate images. First, we generate a set of potential feature matches between each query and database image based on VT-quantization results. After generating a set of feature correspondences, we calculate a geometric score between them. The process used to compute the geometric similarity score is illustrated in Figure S7. We find the dis-tance between the two features in the query image and the distance between the corresponding matching features in the database image. The ratio of the distance corre-sponds to the scale difference between the two images. We repeat the ratio calculation for features in the query image that have matching database features. If there exists a consistent set of ratios (as indicated by a peak in the his-togram of distance ratios), it is more likely that the query image and the database image match.
The geometric reranking is fast because we use the VT-quantization results directly to find potential feature matches and use a really simple similarity scoring scheme.
The time required to calculate a geometric similarity score is one to two orders of magnitude less than using RANSAC. Typically, we perform fast geometric reranking on the top 500 images and RANSAC on the top 50 ranked images.
(a) (b) (c) (d)
log (÷)
[FIG S7] The location geometric score is computed as follows: (a) features of two images are matched based on VT quantization, (b) distances between pairs of features within an image are calculated, (c) log-distance ratios of the corresponding pairs (denoted by color) are calculated, and (d) histogram of log-distance ratios is computed. The maximum value of the histogram is the geometric similarity score. A peak in the histogram indicates a similarity transform between the query and database image.
[FIG6] Stanford Product Search system. Because of the large database, the image-recognition server is placed at a remote location. In most systems [1], [3], [7], the query image is sent to the server and feature extraction is performed. In our system, we show that by performing feature extraction on the phone we can significantly reduce the transmission delay and provide an interactive experience.
Client
Image FeatureExtraction
FeatureCompression
Display
Query Data
Network
Identification Data
VT Matching
GV
Server
Stanford Product Search (SPS) System
• Key contributions: – Optimized feature extraction implementation – CHoG: a low bit-rate compact descriptor (provides up
to 20× bit-rate saving over SIFT with comparable image retrieval performance)
– Inverted index compression to enable large-scale image retrieval on the server
– Fast geometric re-ranking
Oge Marques Girod et al. IEEE MulVmedia 2011 Tsai et al. ACM MM 2010
Stanford Product Search (SPS) System • Two modes: – Send Image mode
– Send Feature mode
Oge Marques Girod et al. IEEE MulVmedia 2011 Tsai et al. ACM MM 2010
including different distances, viewing angles,
and lighting conditions, or in the presence of
partial occlusions or motion blur.
Mobile image-based retrievaltechnologies
Most successful algorithms for image-based
retrieval today use an approach that is referred
to as bag of features (BoF) or bag of words
(BoW).1,2 The BoW idea is borrowed from text
document retrieval. To find a particular text
document, such as a webpage, it’s sufficient to
use a few well-chosen words. In the database,
the document itself can likewise be repre-
sented by a bag of salient words, regardless
of where these words appear in the document.
For images, robust local features that are
characteristic of a particular image take the
role of visual words. As with text retrieval,
BoF image retrieval does not consider where
in the image the features occur, at least in the
[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 87
Figure 2. Mobile visual
search architectures.
(a) The mobile device
transmits the
compressed image,
while analysis of the
image and retrieval
are done entirely on a
remote server. (b) The
local image features
(descriptors) are
extracted on a mobile
phone and then
encoded and
transmitted over the
network. Such
descriptors are then
used by the server to
perform the search.
(c) The mobile device
maintains a cache of
the database and sends
search requests to the
remote server only if the
object of interest is not
found in this cache,
further reducing the
amount of data sent
over the network.
Visual search server
Visual search serverMobile phone
Visual search serverMobile phone
Mobile phone
Imagedecoding
Searchresults
Searchresults
Searchresults
Wireless network
Wireless network
Wireless network
Image encoding(JPEG)
Descriptorextraction
Database
Database
Database
Descriptorextraction
Descriptorencoding
Descriptordecoding
Descriptordecoding
Descriptormatching
Descriptormatching
Descriptormatching
Descriptormatching
Found
Process anddisplay results
No
YesLocalDB/
DB/cache
Descriptorencoding
Process anddisplay results
Image
Image
Image
Process anddisplay results
(a)
(b)
(c)
Descriptorextraction
Figure 1. A snapshot
of an outdoor mobile
visual-search system.
The system augments
the viewfinder with
information about the
objects it recognizes in
the image taken with
a phone camera.
87
including different distances, viewing angles,
and lighting conditions, or in the presence of
partial occlusions or motion blur.
Mobile image-based retrievaltechnologies
Most successful algorithms for image-based
retrieval today use an approach that is referred
to as bag of features (BoF) or bag of words
(BoW).1,2 The BoW idea is borrowed from text
document retrieval. To find a particular text
document, such as a webpage, it’s sufficient to
use a few well-chosen words. In the database,
the document itself can likewise be repre-
sented by a bag of salient words, regardless
of where these words appear in the document.
For images, robust local features that are
characteristic of a particular image take the
role of visual words. As with text retrieval,
BoF image retrieval does not consider where
in the image the features occur, at least in the
[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 87
Figure 2. Mobile visual
search architectures.
(a) The mobile device
transmits the
compressed image,
while analysis of the
image and retrieval
are done entirely on a
remote server. (b) The
local image features
(descriptors) are
extracted on a mobile
phone and then
encoded and
transmitted over the
network. Such
descriptors are then
used by the server to
perform the search.
(c) The mobile device
maintains a cache of
the database and sends
search requests to the
remote server only if the
object of interest is not
found in this cache,
further reducing the
amount of data sent
over the network.
Visual search server
Visual search serverMobile phone
Visual search serverMobile phone
Mobile phone
Imagedecoding
Searchresults
Searchresults
Searchresults
Wireless network
Wireless network
Wireless network
Image encoding(JPEG)
Descriptorextraction
Database
Database
Database
Descriptorextraction
Descriptorencoding
Descriptordecoding
Descriptordecoding
Descriptormatching
Descriptormatching
Descriptormatching
Descriptormatching
Found
Process anddisplay results
No
YesLocalDB/
DB/cache
Descriptorencoding
Process anddisplay results
Image
Image
Image
Process anddisplay results
(a)
(b)
(c)
Descriptorextraction
Figure 1. A snapshot
of an outdoor mobile
visual-search system.
The system augments
the viewfinder with
information about the
objects it recognizes in
the image taken with
a phone camera.
87
Stanford Product Search System
• Performance evaluation – Dataset: 1 million CD, DVD, and book cover images +
1,000 query images (500×500) with challenging photometric and geometric distortions
Oge Marques
IEEE SIGNAL PROCESSING MAGAZINE [71] JULY 2011
the client, we use a Nokia 5800 mobile phone with a 300-MHz central processing unit (CPU). For the recognition server, we use a Linux server with a Xeon E5410 2.33-GHz CPU and 32 GB of RAM. We report results for both 3G and wireless local area network (WLAN) networks. For 3G, experiments are con-ducted in an AT&T 3G wireless network, averaged over several days, with a total of more than 5,000 transmissions at indoor locations where such an image-based retrieval system would be typically used.
We evaluate two different modes of operation. In send fea-tures mode, we process the query image on the phone and transmit compressed query features to the server. In send image mode, we transmit the query image to the server, and all opera-tions are performed on the server.
We discuss results of the three key aspects that are critical for mobile visual search applications: retrieval accuracy, system latency, and power. A recurring theme throughout this section will be the benefits of performing feature extraction on the mobile device compared to performing all processing on a remote server.
RETRIEVAL ACCURACYIt is relatively easy to achieve high precision (low false posi-tives) for mobile visual search applications. By requiring a minimum number of feature matches after RANSAC GV, we can avoid false positives entirely. To avoid false-positive matches, we set a minimum number of matching features after RANSAC GV to 12, which is high enough to avoid false positives. We define recall as the percentage of query images correctly retrieved. Our goal then is to maximize recall at a negligibly low false-positive rate.
Figure 8 compares the recall for the three schemes: send features (CHoG), send features (SIFT), and send image (JPEG) at precision one. For the JPEG scheme, the bit rate is varied by changing the quality of compression. For the SIFT scheme, we extract the SIFT descriptors on the mobile device and transmit each descriptor uncompressed as 1,024 b. For the CHoG scheme, we need to transmit about 60 b/descriptor across the network. For SIFT and CHoG schemes, we sweep the recall-bit rate curve by varying the number of descriptors transmitted. For a given budget of
(a)
(b)
[FIG7] Example image pairs from the data set. (a) A clean database picture is matched against (b) a real-world picture with various distortions.
100 101 10280
82
84
86
88
90
92
94
96
98
100
Bit Rate (kB)
Rec
all (
%)
Send Feature (CHoG)Send Image (JPEG)Send Feature (SIFT)
[FIG8] Bit-rate comparisons of different schemes. CHoG descriptor data are an order of magnitude smaller compared to the JPEG images or uncompressed SIFT descriptors.
Girod et al. IEEE MulVmedia 2011
Stanford Product Search System
• Performance evaluation – Recall vs. bit rate
Oge Marques Girod et al. IEEE MulVmedia 2011
by approximately a factor of two. Moreover,
transmission of features allows yet another
optimization: it’s possible to use progressive
transmission of image features, and let the
server execute searches on a partial set of
features, as they arrive.15 Once the server
finds a result that has sufficiently high match-
ing score, it terminates the search and immedi-
ately sends the results back. The use of this
optimization reduces system latency by an-
other factor of two.Overall, the SPS system demonstrates that
using the described array of technologies, mo-
bile visual-search systems can achieve high rec-
ognition accuracy, scale to realistically large
databases, and deliver search results in an ac-
ceptable time.
Emerging MPEG standardAs we have seen, key component technolo-
gies for mobile visual search already exist, and
we can choose among several possible architec-
tures to design such a system. We have shown
these options at the beginning, in Figure 2.
The architecture shown in Figure 2a is the easi-
est one to implement on a mobile phone, but it
requires fast networks such as Wi-Fi to achieve
good performance. The architecture shown in
Figure 2b reduces network latency, and allows
fast response over today’s 3G networks, but
requires descriptors to be extracted on the
phone. Many applications might be accelerated
further by using a cache of the database on the
phone, as exemplified by the architecture
shown in Figure 2c.However, this immediately raises the ques-
tion of interoperability. How can we enable
mobile visual search applications and databases
across a broad range of devices and platforms, if
the information is exchanged in the form of
compressed visual descriptors rather than
images? This question was initially posed dur-
ing the Workshop on Mobile Visual Search,
held at Stanford University in December 2009.
This discussion led to a formal request by the
US delegation to MPEG, suggesting that the po-
tential interest in a standard for visual search
applications be explored.16 As a result, an ex-
ploratory activity in MPEG was started, which
produced a series of documents in the subse-
quent year describing applications, use cases,
objectives, scope, and requirements for a future
standard.17
As MPEG exploratory work progressed, it
was recognized that the suite of existing
MPEG technologies, such as MPEG-7 Visual,
does not yet include tools for robust image-
based retrieval and that a new standard should
therefore be defined. It was further recognized
[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92
Figure 7. Comparison of different schemes with regard to classification
accuracy and query size. CHoG descriptor data is an order of magnitude
smaller compared to JPEG images or uncompressed SIFT descriptors.
100 101 10280
82
84
86
88
90
92
94
96
98
100
Query size (Kbytes)
Cla
ssifi
catio
n ac
cura
cy (
%)
Send feature (CHoG)
Send image (JPEG)
Send feature (SIFT)
Figure 8. End-to-end latency for different schemes. Compared to a system
transmitting a JPEG query image, a scheme employing progressive
transmission of CHoG features achieves approximately four times the
reduction in system latency over a 3G network.
0
2
4
6
8
10
12
Resp
onse
tim
e (s
econ
ds)
JPEG(3G)
Feature(3G)
Featureprogressive
(3G)
JPEG(WLAN)
Feature(WLAN)
Feature extractionNetwork transmissionRetrieval
92
Industry and Standards
Stanford Product Search System
• Performance evaluation – Processing times
Oge Marques
IEEE SIGNAL PROCESSING MAGAZINE [72] JULY 2011
features, the descriptors with the highest Hessian response are transmitted. The descrip-tors are transmitted in the order imposed by the LHC scheme discussed in “Location Histogram Coding.”
First, we observe that a recall of 96% is achieved at the highest bit rate for challenging query images even with a million images in the database. Second, we observe that the performance of the JPEG scheme rapidly deteriorates at low bit rates. The performance suffers at low bit rates as the interest-point detection fails due to JPEG-compression arti-facts. Third, we note that transmitting uncompressed SIFT data is almost always more expensive than transmitting JPEG-compressed images. Finally, we observe that the amount of data for CHoG descriptors is an order of magni-tude smaller than JPEG images or SIFT descriptors at the same retrieval accuracy.
SYSTEM LATENCYThe system latency can be broken down into three components: processing delay on client, transmission delay, and processing delay on server.
CLIENT- AND SERVER-PROCESSING DELAYWe show the time taken for the different operations on the client and server in Table 3. The send features mode requires approximately 1 s for feature extraction on the client. However, this increase in client-processing time is more than compensated by the decrease in transmission latency, com-pared to send image, as illustrated in Figures 9 and 10. On the server, using VT matching with a compressed inverted index, we can search through a million image database in 100 ms. We perform GV on a short list of 50 candidates after fast geo-metric reranking of the top 500 candidate images. We can achieve ,1 s server processing latency while maintaining high recall.
TRANSMISSION DELAYThe transmission delay depends on the type of network used. In Figure 10, we observe that the data transmission time is insignificant for a WLAN network because of the high
bandwidth available. However, the transmission time turns out to be a bottleneck for 3G net-works. In Figure 9, we present the experimental results for sending data over a 3G wireless network. We vary query data
sizes from that of typical compressed query features (3–4 kB) to typical JPEG query images (50 kB) to learn how query size affects transmission time. The communication time-out was set to 60 s. We have conducted the experiment continuously for several days. We tested at three different locations, typical locations where a user might use the visual search application.
The median and average transmission latency of our exper-iments are shown in Figure 9. Sending the compressed query features typically takes 3–4 s. The time required to send the compressed query image is several times longer and varies significantly at different locations. However, the transmission delay does not include the cases when the communication
18
16
14
12
10
8
6
4
2
14
12
10
8
6
4
2
0
0 10 20 30 40 50Query Data Size (kB)
(a)
0 10 20 30 40 50Query Data Size (kB)
(b)
Com
mun
icat
ion
Tim
e-O
ut (
%)
Tran
smis
sion
Lat
ency
(s)
Indoor (I) (Average) Indoor (I) (Median)Indoor (II) (Average) Indoor (II) (Median)Outdoor (Average) Outdoor (Median)
Indoor (I)Indoor (II)Outdoor
[FIG9] Measured transmission latency (a) and time-out percentage (b) for transmitting queries of different size over a 3G network. Indoor (I) is tested indoors with poor connectivity. Indoor (II) is tested indoors with good reception. Outdoor is tested outside of buildings.
[TABLE 3] PROCESSING TIMES.
CLIENT-SIDE OPERATIONS TIME (S)
IMAGE CAPTURE 1–2 FEATURE EXTRACTION AND COMPRESSION (FOR SEND IMAGE MODE)
1–1.5
SERVER-SIDE OPERATIONS TIME (MS)
FEATURE EXTRACTION (FOR SEND IMAGE MODE)
100
VT MATCHING 100 FAST GEOMETRIC RERANKING (PER IMAGE) 0.46 GV (PER IMAGE) 30
THE SYSTEM LATENCY CAN BE BROKEN DOWN INTO THREE COMPONENTS:
PROCESSING DELAY ON CLIENT, TRANSMISSION DELAY, AND
PROCESSING DELAY ON SERVER.
Girod et al. IEEE MulVmedia 2011 Tsai et al. ACM MM 2010
Stanford Product Search System
• Performance evaluation – End-to-end latency
Oge Marques Girod et al. IEEE MulVmedia 2011
by approximately a factor of two. Moreover,
transmission of features allows yet another
optimization: it’s possible to use progressive
transmission of image features, and let the
server execute searches on a partial set of
features, as they arrive.15 Once the server
finds a result that has sufficiently high match-
ing score, it terminates the search and immedi-
ately sends the results back. The use of this
optimization reduces system latency by an-
other factor of two.Overall, the SPS system demonstrates that
using the described array of technologies, mo-
bile visual-search systems can achieve high rec-
ognition accuracy, scale to realistically large
databases, and deliver search results in an ac-
ceptable time.
Emerging MPEG standardAs we have seen, key component technolo-
gies for mobile visual search already exist, and
we can choose among several possible architec-
tures to design such a system. We have shown
these options at the beginning, in Figure 2.
The architecture shown in Figure 2a is the easi-
est one to implement on a mobile phone, but it
requires fast networks such as Wi-Fi to achieve
good performance. The architecture shown in
Figure 2b reduces network latency, and allows
fast response over today’s 3G networks, but
requires descriptors to be extracted on the
phone. Many applications might be accelerated
further by using a cache of the database on the
phone, as exemplified by the architecture
shown in Figure 2c.However, this immediately raises the ques-
tion of interoperability. How can we enable
mobile visual search applications and databases
across a broad range of devices and platforms, if
the information is exchanged in the form of
compressed visual descriptors rather than
images? This question was initially posed dur-
ing the Workshop on Mobile Visual Search,
held at Stanford University in December 2009.
This discussion led to a formal request by the
US delegation to MPEG, suggesting that the po-
tential interest in a standard for visual search
applications be explored.16 As a result, an ex-
ploratory activity in MPEG was started, which
produced a series of documents in the subse-
quent year describing applications, use cases,
objectives, scope, and requirements for a future
standard.17
As MPEG exploratory work progressed, it
was recognized that the suite of existing
MPEG technologies, such as MPEG-7 Visual,
does not yet include tools for robust image-
based retrieval and that a new standard should
therefore be defined. It was further recognized
[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92
Figure 7. Comparison of different schemes with regard to classification
accuracy and query size. CHoG descriptor data is an order of magnitude
smaller compared to JPEG images or uncompressed SIFT descriptors.
100 101 10280
82
84
86
88
90
92
94
96
98
100
Query size (Kbytes)
Cla
ssifi
catio
n ac
cura
cy (
%)
Send feature (CHoG)
Send image (JPEG)
Send feature (SIFT)
Figure 8. End-to-end latency for different schemes. Compared to a system
transmitting a JPEG query image, a scheme employing progressive
transmission of CHoG features achieves approximately four times the
reduction in system latency over a 3G network.
0
2
4
6
8
10
12
Resp
onse
tim
e (s
econ
ds)
JPEG(3G)
Feature(3G)
Featureprogressive
(3G)
JPEG(WLAN)
Feature(WLAN)
Feature extractionNetwork transmissionRetrieval
92
Industry and Standards
Examples of commercial MVS apps • Google
Goggles – Android
and iPhone – Narrow-
domain search and retrieval
Oge Marques h<p://www.google.com/mobile/goggles
SnapTell • One of the earliest (ca. 2008) MVS apps for iPhone – Eventually acquired by Amazon (A9)
• Proprietary technique (“highly accurate and robust algorithm for image matching: Accumulated Signed Gradient (ASG)”).
Oge Marques h<p://www.snaptell.com/technology/index.htm
oMoby (and the IQ Engines API) – iPhone app
Oge Marques h<p://omoby.com/pages/screenshots.php
oMoby (and the IQ Engines API)
• The IQ Engines API: ���“vision as a service”
Oge Marques h<p://www.iqengines.com/applicaVons.php
The IQEngines API demo app
• Screenshots
Oge Marques
The IQEngines API demo app
• XML-formatted response
Oge Marques
Kooaba: Déjà Vu and Paperboy
• “Image recognition in the cloud” platform
Oge Marques h<p://www.kooaba.com/en/home/developers
Kooaba: Déjà Vu and Paperboy
• Déjà Vu – Enhanced digital
memories / notes /journal
• Paperboy – News sharing from
printed media
Oge Marques h<p://www.kooaba.com/en/products/dejavu h<p://www.kooaba.com/en/products/paperboy
pixlinQ • A “mobile visual
search solution that enables you to link users to digital content whenever they take a mobile picture of your printed materials.” – Powered by image
recognition from LTU technologies
Oge Marques h<p://www.pixlinq.com/home
pixlinQ
• Example app (La Redoute)
Oge Marques h<p://www.youtube.com/watch?v=qUZCFtc42Q4
Moodstocks • Real-time mobile image recognition that works offline (!) • API and SDK available
Oge Marques h<p://www.youtube.com/watch?v=tsxe23b12eU
Moodstocks
• Many successful apps for different platforms
Oge Marques h<p://www.moodstocks.com/gallery/
Concluding thoughts
Concluding thoughts
• Mobile Visual Search (MVS) is coming of age.
• This is not a fad and it can only grow.
• Still a good research topic – Many relevant technical challenges
– MPEG efforts have just started
• Infinite creative commercial possibilities
Oge Marques
Side note
• The power of Twitter…
Oge Marques