Seminar Report on Augmented Reality With Visual Search

66
BUNDELKHAND INSTITUTE OF ENGINEERING & TECHNOLOGY JHANSI (U.P) SESSION: 2012-2013 A Seminar Report On “Augmented Reality with Visual Search” .::: UNDER THE GUIDANCE OF:::. HOD Head of Department INFORMATION TECHNOLOGY .:: SUBMITTED BY::. 1

description

this report is based on augmented reality technique,its implementation and its application

Transcript of Seminar Report on Augmented Reality With Visual Search

BUNDELKHAND INSTITUTE OF ENGINEERING & TECHNOLOGY JHANSI (U.P)

SESSION: 2012-2013ASeminar Report OnAugmented Reality with Visual Search

.::: UNDER THE GUIDANCE OF:::.

HODHead of DepartmentINFORMATION TECHNOLOGY

.:: SUBMITTED BY::.

Name(Roll No )B.TECH-3rd Year 6th SemesterDEPARTMENT OF INFORMATION TECHNOLOGY BUNDELKHAND INSTITUTE OF ENGINEERING & TECHNOLOGY JHANSI (U.P)DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATEThis is to certify that seminar titled AUGMENTED REALITY WITH VISUAL SEARCH has successfully delivered by name(B.Tech. 6th semester, Information Technology) in the partial fulfillment of the B.Tech Degree in Information Technology from Bundelkhand Institute of Engineering & Technology during the academic year 2012-13.

HEAD OF DEPARTMENT . DR. Department Of Information Technology B.I.E.T. Jhansi

ACKNOWLEDGEMENT I feel great pleasure in expressing my deep sense of gratitude and heartiest respect to Dr.Yashpal Singh, H.O.D. Information Technology, Bundelkhand Institute of Engineering and Technology, Jhansi for their preserving guidance and inspiration throughout the preparation of this seminar. I am thankful to teachers for their guidance and help.

I gratefully acknowledge the blessing, useful guidance and help that I have received.

name Roll No- B.Tech 3rd Yr. Information Technology

ABSTRACTAugmented reality is a direct or indirect, view of a physical, real-world environment whose elements are augmented by computer-generated sensory input such as sound, video, graphics or GPS data. With the help of advanced AR technology information about the surrounding real world of the user becomes interactive and can be digitally manipulated.Visual capture capability on mobile devices can be used for linking the real world and the digital world. Mobile phones have evolved into powerful image and video processing devices, equipped with high-resolution cameras, color displays, and hardware-accelerated graphics. They are also equipped with GPS, and connected to broadband wireless networks. All this enables a new class of applications which use the camera phone to initiate search queries about objects in visual proximity to the user. Such applications can be used, e.g., for identifying products, comparison-shopping, finding information about movies, CDs, buildings, shops, real estate, print media or artworks.

For implementation of this system we require two architectures: an integrated system which runs solely on the phone and a networked system which recognizes a submitted image on a server. Although the networked system offers a large database capacity, we argue for the integrated system because it enables real time recognition on the device with smooth user interaction. We have implemented a highly efficient feature extraction and matching algorithm targeting resource-constrained mobile devices. The advantage of the system is the complete integral solution on the phone including a language-independent feature extraction and an efficient database lookup, which provides instant response.

Project Glass is a research and development program to develop an augmented reality head-mounted display. It is a newly unveiled concept headgear that would superimpose graphics on your view of the world. It has a small transparent device just over the right eye which serves as a means of displaying information in an overlay manner.

TABLE OF CONTENT

TitlePage no.1. INTRODUCTION ....1-2

2. AUGMENTED REALITY............................................................................3-4 2.1 HISTORICAL OVERVIEW

3. AUGMENTED REALITY WITH VISUAL SEARCH...5-9 3.1 CHALLENGING ISSUES 3.2 POTENTIAL SOLUTION

4. AUGMENTED REALITY WITH VISUAL SEARCH10-11 4.1 IMAGE RETREIVAL PIPELINE

5. FEATURE EXTRACTION...12-17 5.1 INTEREST POINT DETECTION 5.2 FEATURE DESCRIPTOR COMPUTATION 5.2.1 CHoG: A LOW BITRATE DESCRIPTOR 5.2.2 LOCATION HISTOGRAM CODING

6. FEATURE INDEXING AND MATCHING17-22 6.1 VOCABULARY TREE AND INVERTED INDEX 6.2 INVERTED INDEX COMPRESSION

7. GEOMETRIC VERIFICATION...23-25 7.1 FAST GEOMETRIC RE-RANKING

8. SYSTEM PERFORMANCE..26-32 8.1 RETRIEVAL ACCURACY 8.2 SYSTEM LATENCY 8.3 TRANSMISSION DELAY 8.4 END TO END LATENCY SYSTEM 8.5 ENERGY CONSUMPTION

9. PROJECT GLASS...33-3510. FUTURE CHALLENGES.3611. CONCLUSION..3712. REFERENCES...38TABLE OF FIGURES

FIGURE 2.1...4FIGURE 4.1...10FIGURE 5.1...13FIGURE 5.2...15FIGURE 5.3...16FIGURE 5.4...17FIGURE 6.1...19FIGURE 6.2...21FIGURE 7.1...23FIGURE 7.2...23FIGURE 7.3...24FIGURE 8.1...27FIGURE 8.2...28FIGURE 8.3...30FIGURE 8.4...30FIGURE 8.5...31FIGURE 8.6...31FIGURE 8.7...32 FIGURE 9.1...34

1. Introduction

As computers increase in power and decrease in size, new mobile, wearable, and pervasive computing applications are rapidly becoming feasible, providing people access to online resources always and everywhere. This new flexibility makes possible new kind of applications that exploit the person's surrounding context. Augmented reality (AR) presents a particularly powerful user interface to context-aware computing environments. AR systems integrate virtual information into a person's physical environment so that he or she will perceive that information as existing in their surroundings. Augmented reality systems with visual search provide this service without constraining the individuals whereabouts to a specially equipped area. Ideally, they work virtually anywhere, adding a palpable layer of information to any environment whenever desired. By doing so, they hold the potential to revolutionize the way in which information is presented to people. Computer-presented material is directly integrated with the real world surrounding the freely roaming person, who can interact with it to display related information, to pose and resolve queries, and to collaborate with other people. The world becomes the user interface.

Mobile phones have evolved into powerful image and video processing devices equipped with high-resolution cameras, color displays, and hardware-accelerated graphics. They are also increasingly equipped with a global positioning system and connected to broadband wireless networks. All this enables a new class of applications that is augmented reality with visual search that use the camera phone to initiate search queries about objects in visual proximity to the user . Such applications can be used, e.g., for identifying products, comparison shopping, finding information about movies, compact disks (CDs), real estate, print media, or artworks. First deployments of such systems include Google Goggles , Nokia Point and Find , Kooaba , Ricoh iCandy, and Amazon Snaptell . Mobile image-retrieval applications pose a unique set of challenges. What part of the processing should be performed on the mobile client, and what part is better carried out at the server? On the one hand, transmitting a Joint Photographic Experts Group (JPEG) image could take few seconds over a slow wireless link. On the other hand, extraction of salient image features is now possible on mobile devices in seconds. There are several possible clientserver architectures The mobile client transmits a query image to the server. The image-retrieval algorithms run entirely on the server, including an analysis of the query image. The mobile client processes the query image, extracts features, and transmits feature data. The image-retrieval algorithms run on the server using the feature data as query. The mobile client downloads data from the server, and all image matching is performed on the device.One could also imagine a hybrid of the approaches mentioned above. When the database is small, it can be stored on the phone, and image-retrieval algorithms can be run locally. When the database is large, it has to be placed on a remote server and the retrieval algorithms are run remotely. In each case, the retrieval framework has to work within stringent memory, computation, power, and bandwidth constraints of the mobile device. The size of the data transmitted over the network needs to be as small as possible to reduce network latency and improve user experience. The server latency has to be low as we scale to large databases.

2. Augmented Reality

Augmented reality is related to the concept of virtual reality (VR). VR attempts to create an artificial world that a person can experience and explore interactively predominantly through his or her sense of vision, but also via audio, tactile, and other forms of feedback. AR also brings about an interactive experience, but aims to supplement the real world, rather than creating an entirely artificial environment. The physical objects in the individuals surroundings become the backdrop and target items for computer-generated annotations. Different researchers subscribe to narrower or wider definitions of exactly whatConstitutes AR. While the research community largely agrees on most of the elements of AR systems, helped along by the exchange and discussions at several international conferences in the field, there are still small differences in opinion and nomenclature.

We will define an AR system as one that combines real and computer-generated information in a real environment, interactively and in real time, and aligns virtual objects with physical ones. At the same time, AR is a subfield of the broader concept of mixed reality (MR) which also includes simulations predominantly taking place in the virtual domain and not in the real world. Mobile AR applies this concept in truly mobile settings; that is, away from the carefully conditioned environments of research laboratories and special-purpose work areas.

2.1 Historical Overview

While the term augmented reality was coined in the early 1990s, the first fully Functional AR system dates back to the late 1960s, when Ivan Sutherland and colleagues (1968) built a mechanically tracked 3D see-through head-worn display, through which the wearer could see computer-generated information mixed with physical objects, such as signs on a laboratory wall. For the next few decades much research was done on getting computers to generate graphical information, and the emerging field of interactive computer graphics began to flourish. Photorealistic computer-generated images became an area of research in the late 1970s, and progress in tracking technology furthered the hopes to create the ultimate simulation machine. The field of augmented reality began to emerge.It was not until the early 1990s, with research at the Boeing Corporation, that the notion of overlaying computer graphics on top of the real world received its current name. Caudell and Mizell (1992) worked at Boeing on simplifying the process of conveying wiring instructions for aircraft assembly to construction workers, and they referred to their proposed solution of overlaying computer presented material on top of the real world as augmented reality. Even though this application was conceived with the goal of mobility in mind, true mobile graphical AR was out of reach for the available technology until a few years later.

Figure 2.1 Traditional AR restaurant guide. (a) User with MARS backpack, looking at a restaurant. (b) Annotated view of restaurant, imaged through the head-worn display.

3. Augmented Reality with Visual Search

Augmented reality with visual search is also known as mobile augmented reality. Revisiting our definition of AR, we can identify the components needed for Mobile Augmented reality System.

Computational Platform-A computational platform that can generate and manage the virtual material to be layered on top of the physical environment, process the tracker information, and control the AR display(s). Displays-Displays to present the virtual material in the context of the physical world. In the case of augmenting the visual sense, these can be head-worn displays, mobile hand-held displays, or displays integrated into the physical world. Registration-Registration must also be addressed: aligning the virtual elements with the physical objects they annotate. For visual and auditory registration, this can be done by tracking the position and orientation of the users head and relating that measurement to a model of the environment and/or by making the computer see and potentially interpret the environment by means of cameras and computer vision. Wearable input and interaction technology-Wearable input and interaction technologies enable a mobile person to work with the augmented world (e.g., to make selections or access and visualize databases containing relevant material) and to further augment the world around them. Wireless Networking-Wireless networking is needed to communicate with other people and computers while on the run. Dynamic and flexible mobile AR will rely on up-to the-second information that cannot possibly be stored on the computing device before application run-time. Data Storage and access technology-If a MARS is to provide information about a roaming individuals current environment, it needs to get the data about that environment from somewhere. Data repositories must provide information suited for the roaming individuals current context.

3.1 Challenging Issues in Augmented Reality with Visual Search

Devices differ from general computing environments in several aspects. The design of a mobile image search system must take into account the following inherent challenges and limitations of mobile devices:

Low Processing Power of CPU-Modern mobile embedded CPUs are designed with much more than pure speed in mind. Priority is often given to factors which address requirements of a mobile operating environment such as low heat dissipation, minimal power consumption, and small form factor size. Although technologically advanced, mobile CPUs are still not fast enough to perform computationally intensive image-processing operations such as feature extraction. Graphics processing units (GPUs), which are built into most mobile devices, can help to speed up processing via parallel computing, but most feature extraction algorithms are designed to be executed sequentially and cannot fully utilize GPU capabilities. Less Memory Capacity-Mobile devices have less memory capacity than desktop systems. Smart phones such as the top-tier Google Nexus One come with 512MB of built-in RAM. While the Nexus One employs one of the largest memory capacities currently available, limitations caused by memory become a large issue when extracting features for an image search. This is because feature extraction often requires large sets of intermediate data to be stored in memory since analysis is performed sequentially. For example, SURF, a popular feature extraction algorithm generates results by analyzing data in a lock-step fashion where data generated in previous stages is referenced by the current stage and by future stages as well. Furthermore, the total amount of memory usage of each stage grows linearly with the size of the original image. For moderate- to high-resolution images, this process could easily exhaust memory resources. Small Screen size- Modern high-end smart phones boast displays which measure slightly less than four inches diagonally. However, this size is still much smaller than that of a common desktop/laptop. Smaller screens greatly limit the amount of information that can be presented to a user at any given time. This creates a much greater requirement for an efficient, effective display of search results and also increases the need for higher search accuracy. Limited Connectivity-Wi-Fi is a built-in feature for most mobile devices. However Wi-Fi is still only available at sparse locations even in most urban areas. For the majority of their network connectivity, mobile devices must rely on a combination of mobile broadband networks such as 3G, 3.5G, and 4G. These networks provide acceptable network access speeds, but can become a design limitation when a large amount of data must be transferred in real time. Moreover, mobile broadband networks are limited in their availability outside of large cites.

To summarize, hard constraints imposed on mobile device platforms distinguish them from conventional computing platforms and create new challenges for Applications that work within their realm of limitations. However, despite their shortcomings, mobile devices possess inherent characteristics that have the potential to increase the accuracy and efficiency of image search.

3.2 Potential Solution Some of the challenges that face image search can be addressed by applying solutions which have been devised to solve problems in related areas. CPU-Mobile System-on-chips (SOCs) often come with embedded graphics processing unit (GPU) cores in addition to the CPU. GPUs allow for large quantities of instructions to be executed in parallel. While originally intended for rendering 2D and 3D graphics, GPUs have been at the core of a branch of study known as general-purpose computation on graphics processing units (GPGPU). GPGPU technology extends the programmability of GPUs to enable non-graphics applications with high parallelizability to run more efficiently than on a CPU. In the context of mobile image search, where sequential feature extraction algorithms are often used, GPGPU technology can allow for feature extraction algorithms to be broken up into smaller subtasks and executed in parallel. Efforts have been made to improve the parallelization of feature extraction in recent years. In, a number of stages in the SIFT algorithm are parallelized to run on consumer desktop GPUs, decreasing runtime by a factor of 10. To fully utilize the GPU, new feature extraction algorithms must be devised with the aim to be executed concurrently. Memory- Conservative use of memory in feature extraction algorithms is another area in which mobile search benefits from other studies. In , the SURF algorithm is ported to mobile phones for use in an augmented reality experiment. To limit memory usage, only the smaller of the original image and integral image is saved in memory and conversions from one to the other are performed as needed. This results in a large reduction in memory usage. Other engineering approaches include scaling down the original image to a smaller resolution before performing feature extraction. Smaller images require much less memory to analyze, but come at the cost of fewer detected features. Another approach is to keep the original size of an image, but introduce an additional step where a user can crop a section of an image which includes the object of interest. This acts to reduce the image dimensions while preserving the features which are most relevant to the image search. Another proposed approach is to divide an image into smaller sub-images and perform analysis on each sub-image sequentially before merging the results as a final step. This method can be used in case an algorithm must produce large amounts of intermediate data during execution. The idea being that after analyzing each sub-image, the intermediate data can be freed and used for the processing of the subsequent sub-image. Screen/Interface-Touch screens provide an interface that allows users to express their intentions more freely and intuitively. However, smaller screen size greatly limits the amount of result images which can be displayed on the screen at any given time. Improvements in search accuracy can minimize the number of results that must be returned to the user before query-relevant content is produced. Another possibility is to perform post-search pruning on a set of search results based on attributes that can be computed on the server side. Only the most relevant content is returned by examining the search context and the users interests. This process can make efficient use of the limited screen space and enhance the search experience. Network-Networking challenges in mobile image search can be overcome in several ways which address the different instances in which a mobile search application makes use of its network. First, there is the transmission of the extracted feature vectors. In this step, the features which are obtained from an image are sent to a search server which compares the extracted features with stored features extracted from a large image database. This challenge is characterized by a large set of data that must be sent to the search server. A typical image of a landmark produces hundreds of SURF features. Each feature is expressed by a descriptor vector holding 64 floating point numbers. By converting the floating point numbers to bytes, the size of each image feature vectors is reduced, resulting in significantly less bytes transferred over the network. The next major network usage is when image results are returned to the user. In this phase, the returned images must be transferred and displayed for the user to choose from. This challenge can be met by sending only the most relevant images back to user. To improve the search relevance, we suggest a multimodal query scheme and dynamic, post-search pruning method. Moreover, pre-scaling images to produce small preview images can further reduce payload size when transferring the search results to the mobile device.

4. Image Recognition for Augmented Reality

The most successful algorithms for content-based image retrieval use an approach that is referred to as bag of features (BoFs) or bag of words (BoWs). The BoW idea is borrowed from text retrieval. To find a particular text document, such as a Web page, it is sufficient to use a few well-chosen words. In the database, the document itself can be likewise represented by a bag of salient words, regardless of where these words appear in the text. For images, robust local features take the analogous role of visual words. Like text retrieval, BoF image retrieval does not consider where in the image the features occur, at least in the initial stages of the retrieval pipeline. However, the variability of features extracted from different images of the same object makes the problem much more challenging.4.1 Image Retrieval PipelineThe typical image retrieval pipeline is as follows:

Figure 4.1: A Pipeline for image retrieval.

1. First, the local features are extracted from the query image. The set of image feature is used assess the similarity between query and database images. For mobile applications, individual features must be robust against geometric and photometric distortions to encounter when the user takes the query photo from a different viewpoint and with different lighting compared to the corresponding database image.2. Next, the query features are quantized. The partitioning into quantization cells is pre computed for the database, and each quantization cell is associated with a list of database images in which the quantized feature vector appears somewhere. This inverted file circumvents a pair wise comparison of each query feature vector with all the feature vectors in the database and is the key to very fast retrieval. Based on the number of features they have in common with the query image, a short list of potentially similar images is selected from the database.3. Finally, a geometric verification (GV) step is applied to the most similar matches in the database. The GV finds a coherent spatial pattern between features of the query image and the candidate database image to ensure that the match is plausible.

5. Feature ExtractionFeature extraction consists of following steps:5.1 Interest Point DetectionFeature extraction typically starts by finding the salient interest points in the image. For robust image matching, we desire interest points to be repeatable under perspective transformations (or, at least, scale changes, rotation, and translation) and real-world lighting variations. An example of feature extraction is illustrated in Figure 3. To achieve scale invariance, interest points are typically computed at multiple scales using an image pyramid. To achieve rotation invariance, the patch around each interest point is canonically oriented in the direction of the dominant gradient. Illumination changes are compensated by normalizing the mean and standard deviation of the pixels of the gray values within each patch. Numerous interest-point detectors have been proposed in the literature. Some of them are: Corner Detectors-Corner Detectors Corners are among the first low-level features used for image analysis and in particular, tracking. Based on Moravecs, Harris and Stephens developed the algorithm that became known as the Harris Corner Detector. They derive a corner score from the second-order moment image gradient matrix, which also forms the basis for the detectors proposed by Frstner (1994) and Shi and Tomasi (1994). Mikolajczyk and Schmid (2001) proposed an approach to make the Harris detector scale invariant. Other intensity-based corner detectors include the algorithms of Beaudet (1978), which uses the determinant of the Hessian matrix, and Kitchen and Rosenfeld (1982), which measures the change of direction in the local gradient field. Blob Detectors-Instead of trying to detect corners, one may use local extreme of the responses of certain filters as interest points. In particular, many approaches aim at approximating the Laplacian of a Gaussian, which, given an appropriate normalization, Lowe (1999, 2004) proposed to select the local extrema of an image filtered with differences of Gaussians, which are separable and hence faster to compute than the Laplacian. The Fast Hessian detector (Bay et al. 2008) is based on efficient-to-compute approximations to the Hessian matrix at different scales. Agrawal et al. (2008) proposed to approximate the Laplacian even further, down to bi-level octagons and boxes. Using slanted integral images, the result can be computed very efficiently despite a fine scale quantization. SIFT-The original SIFT descriptor (Lowe 1999, 2004) was computed from the image intensities around interesting locations in the image domain which can be referred to as interest points, alternatively key points. These interest points are obtained from scale-space extrema of differences-of-Gaussians (DoG) within a difference-of-Gaussians pyramid, as originally proposed by Burt and Adelson (1983) and by Crowley and Stern (1984).A Gaussian pyramid is constructed from the input image by repeated smoothing and sub sampling, and a difference-of-Gaussians pyramid is computed from the differences between the adjacent levels in the Gaussian pyramid. Then, interest points are obtained from the points at which the difference-of-Gaussians values assume extrema with respect to both the spatial coordinates in the image domain and the scale level in the pyramid. Figure5.1: Interest Point Detection

5.2 Feature Descriptor ComputationAfter interest point detection, we compute a visual word descriptor on the normalized patch. We would like descriptors to be robust to small distortions in scale, orientation and lighting conditions. Also, we require descriptors to be discriminative, i.e., characteristic of an image or a small set of images. Descriptors that occur in almost every image (the equivalent of the word and in text documents) would not be useful for retrieval. Since Lowes paper in 1999, the highly discriminative SIFT descriptor remains the most popular descriptor in computer vision. Other examples of feature descriptors are Gradient Location and Orientation Histogram (GLOH) by Mikolajczyk and Schmid, Speeded Up Robust Features (SURF) by Bay et al. and our own Compressed Histogram of Gradients (CHoG), Winder and Brown, and Mikolajczyk et evaluate the performance of different descriptors.As a 128-dimensional descriptor, SIFT descriptor is conventionally stored as 1024 bits (8 bits/dimension). Alas, the size of SIFT descriptor data from an image is typically larger than the size of the JPEG compressed image itself. Several compression schemes have been proposed to reduce the bit rate of SIFT descriptors. In our recent work, we survey different SIFT compression schemes. They can be broadly categorized into schemes based on hashing, transform coding and vector quantization. We note that hashing schemes like Locality Sensitive Hashing (LSH), Similarity Sensitive Coding (SSC) or Spectral Hashing (SH) do not perform well at low bitrates. Conventional transform coding schemes based on Principal Component Analysis (PCA) do not work well due to the highly non-Gaussian statistics of the SIFT descriptor. Vector quantization schemes based on the Product Quantizer or a Tree Structured Vector Quantizer are complex and require storage of large codebooks on the mobile device. We came to realize that simply compressing anoff-the-shelf descriptor does not lead to the best rate-constrained image retrieval performance. One can do better by designing a descriptor with compression in mind. Of course, such a descriptor still has to be robust and highly discriminative. Ideally, it would permit descriptor comparisons in the compressed domain for speedy feature matching. To meet all these requirements simultaneously, we designed the Compressed Histogram of Gradients (CHoG) descriptor. The CHoG descriptor is designed to work well at low bitrates. CHoG achieves the performance of 1024-bit SIFT at less than 60 bits/descriptor. Since CHoG descriptor data are an order of magnitude smaller than SIFT or JPEG compressed images, it can be transmitted much faster over slow wireless links. A small descriptor also helps if the database is stored in the mobile device. The smaller the descriptor, the more features can be stored in limited memory.

Figure 5.2: Feature Descriptor Computation

5.2.1 CHoG: A Low Bitrate Descriptor CHoG builds upon the principles of HoG descriptors with the goal of being highly discriminative at low bitrates. How CHoG descriptors are computed. The patch is divided into spatial bins, which provides robustness to interest point localization error. We divide the patch around each interest point into soft log polar spatial bins using DAISY configurations. The log polar configuration is more effective than the square grid configuration used in SIFT. The joint (dx, dy) gradient histogram in each spatial bin is captured directly into the descriptor. CHoG histogram binning exploits the skew in gradient statistics that are observed for patches extracted around interest points. CHoG retains the information in each spatial bin as a distribution. This allows the use of more effective distance measures like KL divergence, and more importantly, allow us to apply quantization and compression schemes that work well for distributions, to produce compact descriptors.Typically, 9 to 13 spatial bins and 3 to 9 gradient bins are chosen resulting in 27 to 117 dimensional descriptors. For compressing the descriptor, we quantize the gradient histogram in each spatial bin individually.Figure 5.3: The joint (dx, dy) gradient distribution (a) over a large number of cells and (b) its contour plot. The greater variance in y axis results from aligning the patches along the most dominant gradient after interest-point detection. (The quantization bin constellations (c) VQ-3, (d) VQ-5, (e) VQ-7, and (f) VQ-9 and their associated Voronoi cells are shown.

Each interest point has a location, scale and orientation associated with it. Interest point locations are needed in the geometric verification step to validate potential candidate matches. The location of each interest point is typically stored as two numbers: x and y co-ordinates in the image at sub-pixel accuracy. In a floating point representation, each feature location would require 64 bits, 32 bits each for x and y. This is comparable in size to the CHoG descriptor itself. We have developed a novel histogram coding scheme to encode the x, y coordinates of feature descriptors. With location histogram coding, we can reduce location data by an order of magnitude compared to their floating point representation, without loss in matching accuracy.5.2.2 Location Histogram Coding Location Histogram Coding is used to compress feature location data efficiently. We note that the interest points in images are spatially clustered. To encode their locations, we first generate a 2-D histogram from the locations of the descriptors. Location histogram coding provides two key benefits. First, encoding the locations of a set of N features as a histogram reduces the bit rate by log (N!), compared to encoding each feature location in sequence. This gain arises because ordering information (N! unique orderings) is discarded when a histogram is computed. Second, we exploit the spatial correlation between the locations of different descriptors. We divide the image into spatial bins and count the number of features within each spatial bin. We compress the binary map, indicating which spatial bins contains features, and a sequence of feature counts, representing the number of features in occupied bins. We encode the binary map using a trained context-based arithmetic coder, with neighboring bins being used as the context for each spatial bin. Using location histogram coding, we can transmit each location with 5 bits/descriptor with little loss in matching accuracy - a 12.5 reduction in data. Figure 5.4: Location Histogram Coding

6. Feature Indexing and matchingFor a large database of images, comparing the query image against every database image using pair wise feature matching is infeasible. A database with millions of images might contain billions of features. A linear scan through the database would be too time-consuming for interactive mobile visual search applications. Instead, we must use a data structure that can quickly return a shortlist of the database candidates most likely to match the query image. The shortlist may contain false positives, as long as the correct match is included. Slower pairwise comparisons can subsequently be performed on just the shortlist of candidates rather than the entire databaseMany data structures have been proposed for efficiently indexing all the local features in a large image database. Lowe proposes approximate nearest neighbor (ANN) search of SIFT descriptors with a best-bin-first strategy. One of the most popular methods is Sivic and Zissermans Bag-of- Features (BoF) approach. The BoF codebook is trained by k-means clustering of many training descriptors. During a query, scoring the database images can be made fast by using an inverted file index associated with the BoF codebook. To generate a much larger codebook, Nister and Stewenius utilize hierarchical k-means clustering to create a Vocabulary Tree (VT). Alternatively, Philbin et al. use randomized k-d trees to partition the feature descriptor space. Subsequent improvements in tree-based quantization and ANN search include greedy N-best paths, query expansion; efficient updates over time, soft binning, and Hamming embedding. As database size increases, the amount of memory used to index the database features can become very large. Thus, developing a memory-efficient indexing structure is a problem of increasing interest. Chum et al. use a set of compact minhashes to perform near-duplicate image retrieval. Zhang et al. decompose each images set of features into a coarse signature and a refinement signature. The refinement signature is subsequently indexed by a locality sensitive hash (LSH). To support the popular VT scoring framework, inverted index compression methods for both hard-binned andsoft-binned VTs have been developed by us, as explained in the box Inverted Index Compression. The memory for BoF image signatures can alternatively be reduced using the mini-BoF approach. Very recently, visual word residuals on a small BoF codebook have shown promising retrieval results with low memory usage. The residuals are indexed either with PCA and product quantizes or with LSH.6.1 Vocabulary Tree and Inverted Index A Vocabulary Tree (VT) with an inverted index can be used to quickly compare images in a large database against a query image. If the VT has L levels excluding the root node and each interior node has C children, then a fully balanced VT contains K = CL leaf nodes. Fig. 8 shows a VT with L = 2, C = 3, and K = 9. The VT for a particular database is constructed by performing hierarchical k-means clustering on a set of training feature descriptors representative of the database Initially, C large clusters are generated from all the training descriptors by ordinary k-means with an appropriate distance function like L2-norm or symmetric KL divergence. Then, for each large cluster, k-means clustering is applied to the training descriptors assigned to that cluster, to generate C smaller clusters. This recursive division of the descriptor space is repeated until there are enough bins to ensure good classification performance. Typically, L = 6 and C = 10 are selected, in which case the VT has K = 106 leaf nodes. Figure 6.1: (a) Construction of a Vocabulary Tree by hierarchical k-means clustering of training feature descriptors. (b) Vocabulary Tree and the associated inverted index.The inverted index associated with the VT maintains two lists per leaf node. For node k, there is a sorted array of image IDs {ik1, ik2, , ikNk} indicating which Nk database images have visited that node. Similarly, there is a corresponding array of counts {ck1, ck2, , ckNk} indicating the frequency of visits. During a query, a database of N total images can be quickly scored by traversing only the nodes visited by the query descriptors. Let s(i) be the similarity score for the ith database image. Initially, prior to visiting any node, s(i) is set to 0. Suppose node k is visited by the query descriptors a total of qk times. Then, all the images in the inverted list {ik1, , ikNk} for node k will have their scores incremented according to where wk is an inverse document frequency (IDF) weight used to penalize often-visited nodes, ikj is a normalization factor for database image ikj , and q is a normalization factor for the query image. Scores for images at the other nodes visited by the query image are updated similarly. The database images attaining the highest scores s(i) are judged to be the best matching candidates and kept in a shortlist for further verification. Soft binning can be used to mitigate the effect of quantization errors for a large VT. Some descriptors lie very close to the boundary between two bins. When soft binning is employed, the visit counts are then no longer integers but rather fractional values. For each feature descriptor, the m nearest leaf nodes in the VT are assigned fractional counts where di is the distance between the ith closest leaf node and the feature descriptor, and is appropriately chosen to maximize classification accuracy.6.2 Inverted Index Compression For a database containing one million images and a VT that uses soft binning, each image ID can be stored in a 32- bit unsigned integer and each fractional count can be stored in a 32-bit float in the inverted index. The memory usage of the entire inverted index is PK k=1 Nk 64 bits, where Nk is the length of the inverted list at the kth leaf node. For a database of one million product images, this amount of memory reaches 10 GB, a huge amount for even a modern server. Such a large memory footprint limits the ability to run other concurrent processes on the same server, such as recognition systems for other databases. When the inverted indexs memory usage exceeds the servers available random access memory (RAM), swapping between main and virtual memory occurs, which significantly slows down all processes. Figure 6.2: (a) Memory usage for inverted index with and without compression. A 5 savings in memory is achieved with compression. (b) Server-side query latency (per image) with and without compression.A compressed inverted index can significantly reduce memory usage without affecting recognition accuracy. First, because each list of IDs {ik1, ik2, , ikNk} is sorted, it is more efficient to store consecutive ID differences dk1 = ik1, dk2 = ik2 ik1, , dkNk = ikNk ik(Nk1) in place of the IDs. This practice is also commonly usedin text retrieval. Second, the fractional visit counts can be quantized to a few representative values using Lloyd-Max quantization. Third, the distributions of the ID differences and visit counts are far from uniform, so variable-length coding can be much more rate-efficient than fixed-length coding. Using the distributions of the ID differences and visit counts, each inverted list can be encoded using an arithmetic code (AC). Since keeping the decoding delay low is very important for interactive mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred over AC. The carryover code [50] and recursive bottom up complete (RBUC) code have been shown to be at least 10 faster in decoding than AC, while achieving comparable\ compression gains as AC. The carryover and RBUC codes attain these speed-ups by enforcing word-aligned memory accesses.Fig.6.1 compares the memory usage of the inverted index with and without compression, using the RBUC code. Index compression reduces memory usage from nearly 10 GB to 2 GB. This 5 reduction leads to a substantial speed-up in server-side processing, as shown in Fig. 6.1(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine.After compression, memory swapping is avoided and memory congestion delays no longer contribute to the query latency.

7.Geometric VerificationGeometric Verification (GV) typically follows the Feature Matching step. In this stage, we use location information of query and database features to confirm that the feature matches are consistent with a change in viewpoint between the two images. We perform pairwise matching of feature descriptors and evaluate geometric consistency of correspondences as shown in Figure. The geometric transform between query and database image is estimated using robust regression techniques like RANSAC [52] or the Hough transform [13]. The transformation can be represented by the fundamental matrix which incorporates 3-D geometry, or simpler homography or affine models. Geometric Verification tends to be computationally expensive, which limits the list of candidate images to a small number. Figure 7.1:Geometric VerificationA number of groups have investigated different ways to speed up the GV process. In Chum et al. investigate how to optimize steps to speed up RANSAC. Jegou et al. use weak geometric consistency checks based on feature orientation information. Some authors have also proposed to incorporate geometric information into the VT matching step .Figure 7.2: A image retrieval pipeline can be greatly sped up by incorporatinga geometric re-ranking stage.To speed up geometric verification, one can add a geometric re-ranking step before the RANSAC GV step as illustrated in Fig. 11, we propose a re-ranking step that incorporates geometric information directly into the fast index look up stage, and use it to re-order the list of top matching images. The main advantage of the scheme is that it only requires x, y feature location data, and does not use scale or orientation information. As scale and orientation data are not used, they need not be transmitted by the client, which reduces the amount of data transferred. We typically run fast geometric re-ranking on a large set of candidate database images, and reduce the list of images that we run RANSAC on.7.1 Fast Geometric Re-rankingWe have proposed a fast geometric re-ranking algorithm, that uses x, y locations of features to rerank a shortlist of candidate images. First, we generate a set of potential feature matches between each query and database image based on VT quantization results. After generating a set of feature correspondences, we calculate a geometric score between them. The process used to compute the geometric similarity score is illustrated in Fig. Fig.7.3. The location geometric score is computed as follows: (a) features of two images are matched based on VT quantization, (b) distances between pairs of features within an image are calculated, (c) log distance ratios of the corresponding pairs (denoted by color) are calculated , and (d) histogram of log distance ratios is computed. The maximum value of the histogram is the geometric similarity score. A peak in the histogram indicates a similarity transform between the query and database image.We find the distance between two features in the query image and the distance between the corresponding matching features in the database image. The ratio of the distance corresponds to the scale difference between the two images. We repeat the ratio calculation for features in the query image that have matching database features. If there exists a consistent set of ratios (as indicatedby a peak in the histogram of distance ratios), it is more likely that the query image and the database image match.The geometric re-ranking is fast because we use the vocabulary tree quantization results directly to find potential feature matches and using a really simple similarity scoring scheme. The time required to calculate a geometric similarity score is 1-2 orders of magnitude less than using RANSAC.

8.System PerformanceWhat performance can we expect for a mobile visual search system that incorporates all the ideas discussed so far? To answer this question, we have a closer look at the experimental Stanford Product Search System. For evaluation, we use a database of one million CD, DVD and book cover images, and a set of 1000 query images (500500 pixel resolution) exhibiting challenging photometric and geometric distortions. For the client, we use a Nokia 5800 mobile phone with a 300MHz CPU. For the recognition server, we use a Linux server with a Xeon E5410 2.33GHz CPU and 32GB of RAM. We report results for both 3G and WLAN networks. For 3G, experiments are conducted in an AT&T 3G wireless network, averaged over several days, with a total of more than 5000 transmissions at indoor locations where such an image-based retrieval system would be typically used.We evaluate two different modes of operation. In Send Features mode, we process the query image on the phone and transmit compressed query features to the server. In Send Image mode, we transmit the query image to the server and all operations are performed on the server. We discuss results of three key aspects that are critical for mobile visual search applications: retrieval accuracy, system latency and power. A recurring theme throughout this section will be the benefits of performing feature extraction on the mobile device compared to performing all processing on a remote server.8.1 Retrieval AccuracyIt is relatively easy to achieve high precision (low false positives) for mobile visual search applications. By requiring a minimum number of feature matches after RANSAC geometric verification, we can avoid false positives entirely. We define Recall as the percentage of query images correctly retrieved. Our goal is to then maximize Recall at a negligibly low false positive rate.Send Features (CHoG), Send Features (SIFT) and Send Image (JPEG). For the JPEG scheme, the bitrate is varied by changing the quality of compression. For the SIFT scheme, we extract SIFT descriptors on the mobile device, and transmit each descriptor uncompressed as 1024 bits. For the CHoG scheme, we need to transmit about 60 bits per descriptor accross the network. For SIFT and CHoG schemes, we sweep the Recall-bitrate curve by varying the number of descriptors transmitted.First, we observe that a Recall of 96% is achieved at the highest bitrate for challenging query images even with a million images in the database. Second, we observe that the performance of the JPEG scheme rapidly deteriorates at low bitrates. The performance suffers at low bitrates as the interest point detection fails due to JPEG compression artifacts. Third, we note that transmitting uncompressed SIFT data is almost always more expensive than transmitting JPEG compressed images. Finally, we observe that the amount of data for CHoG descriptors are an order of magnitude smaller than JPEG images or SIFT descriptors, at the same retrieval accuracy. Figure 8.1: Bit-rate comparisons of different schemes. CHoG descriptor data are an order of magnitude smaller compared to the JPEG images or uncompressed SIFT descriptors.

8.2 System LatencyThe system latency can be broken down into 3 components: processing delay on client, transmission delay, and processing delay on server. Client and Server Processing Delay-We show the time for the different operations on the client and server in Table II. The Send Features mode requires 1 second for feature extraction on the client. However, this increase in client processing time is more than compensated by the decrease in transmission latency, compared to Send Image, as we illustrate in Fig. On the server, using VT matching with a compressed inverted index, we can search through a million image database in 100 milliseconds. We perform GV on a short list of 10 candidates after fast geometric re-ranking of the top 500 candidate images. We can achieve