Image based location recognition using foreground features

Computer vision project: location recognition using

foreground features

Andres Perez <[email protected]>

August 13, 2013

1 Problem formulation

This project deals with the problem of location recognition, given an image depicting a locationin a urban environment it's compared against a reference database of images following a content-based image retrieval approach where scene location is determined based on the retrieval results.The basic system addressing this problem is shown in �gure 1, it consists of a content analysispart (images database features extraction and selection), a context analysis part (involve in therecognition process additional information like coarse location readings), and a classi�cation module(using o�ine trained classi�ers upon detected features to online classify unknown images). Herethe context information is disregarded.

Figure 1: Location recognition system overview

The project aims at reproducing one of the pipeline variants presented in [1] where the featurescorresponding to the surroundings (or the background) are discarded to test whether they help tothe recognition or if they are just a distraction. The reference baseline implementation involves:

• SIFT as detector/descriptor.

• A vocabulary tree for storing images database descriptors together with an inverted �le systemfor bag of words indexing.

1

• RANSAC for the estimation of an a�ne transformation between query and db images asgeometric veri�cation algorithm for the re-ranking of the candidates list.

The following questions pose the problem, which factors in�uence the recognition performance?how convenient is applying geometric veri�cation? and, do features from the surroundings helprecognition?

2 Solution approach

From the theoretical side the solution approach is settled with the de�nition of the pipeline variantshown in �gure 2. I focus on the foreground features selection scheme and the geometric veri�cationstep, on which we deviate from the authors proposal and go for a more robust model as will befurther explained in section 2.4.

Figure 2: Location recognition pipeline

The solution approach reduces to deal with two issues:

1. Choose and implement a feature selection scheme: I chose an scheme based on features spatialdistribution, using detected vanishing points it roughly distinguishes main planes and hencebackground and foreground features, the algorithm is further detailed in section 2.2.

2. Software components selection and overall system design: see 4.1 for details.

In the next section we go through each of the pipeline steps and spend a few words explaining howdo they were addressed.

2.1 Features extraction

Local features are computed using the OpenCV implementation of the SIFT detector/descriptorpair. Another possibility was to use the keypoint data bundled with the dataset but it was notsuitable since it doesn't provide scale and orientation information1. The computed keypoints foreach image are saved in a di�erent plain text �le using following format:

1Keypoints were detected using an a�ne detector hence no single scale factor exists, while orientation informationis simply missing

2

http://docs.opencv.org/doc/tutorials/features2d/feature_detection/feature_detection.html

http://docs.opencv.org/doc/tutorials/features2d/feature_detection/feature_detection.html

• It starts with two integers giving the total number of keypoints and the size of descriptorvector for each keypoint (by default 128).

• Then each keypoint is speci�ed by four �oating point numbers giving subpixel row and columnlocation, scale, and orientation (in radians from −π to π).

• Then the descriptor vector for each keypoint is given as a list of integers in the range [0, 255].

2.2 Feature selection

Accounting for background features removal, or more generally feature selection, the authors of thereference baseline implementation mention the use of geometric models without further clues. Thisstep is an important one as the aim of this project was evaluating the performance when doingrecognition using only foreground features. My approach, summarized in 1, is simple and its resultsare quite rough, it is grounded on the identi�cation of areas which are more likely to be part of abuilding facade.

Algorithm 1 Automatic detection of features belonging to a building

(i) Find long and straight lines on the image

(ii) Estimate sets of vanishing points and all the lines complying with them using the foundlines, results are shown in �gure 3

(iii) For each inlier set compute the variance of the orientation of the lines belonging to itand choose the one with the lowest one

(iv) Compute the convex hull for the selected inlier set of lines

The output of my algorithm is a polygon, as shown in �gure 4, that is further used as a maskfor selecting building features. Polygon's covering area might be controlled by setting the minimumlength L of the found set of vanishing lines, allowing smaller lines will result in a bigger polygonwhich might include features not belonging to a building while �ltering out smaller lines and allowingonly the bigger ones will result in the exclusion of true building features. An empirically determinedvalue for L of 0.05 times the length of image diagonal is used.

3

Figure 3: Detected vanishing lines classi�ed according to its corresponding estimated vanishingpoint

Figure 4: Outcome of the foreground features detection algorithm4

This algorithm is not very reliable since the detected vanishing lines correspond to �tted linesthrough edge pixels corresponding to points where the intensity changes abruptly, something thatin practice may occur for di�erent types of surrounding objects like poles or tra�c lines. Howeverthis is not a big issue because the feature selection scheme is applied only to the query images whosefeatures, despite a few images as seen in �gure 5, belong mainly to buildings.

Figure 5: Query images overlaid with their corresponding keypoints

2.3 Features quantization

Following the baseline implementation I used as well a vocabulary tree[3] for the quantization ofvisual features into visual words. In big words a vocabulary tree is the result of applying hierarchicalclustering, typically k-means, over the whole set of images database descriptors followed by a schemefor scoring a query image with respect to the images database, typically tf-idf (term frequency-inverse document frequency) is used as suggested by the authors. Several implementations of thisquantization scheme can be found on the web:

• vocsearch2: code for vocabulary tree based image search developed as part of a publication

2F. Fraundorfer and C. Wu: �vocsearch� http://people.inf.ethz.ch/fraundof/page2.html

5

http://people.inf.ethz.ch/fraundof/page2.html

Algorithm 2 Automatic estimation of an homography between two images using RANSAC

(i) Computation of interest points.

(ii) Computation of putative correspondences based on a proximity and similarity criterion.

(iii) Estimation of an homography using RANSAC where the number of samples is deter-mined adaptively.

(iv) Iterative re-estimation of the homography by minimizing a cost function and aiming atmaximizing the number of inliers.

authored by the department of Computer Science from the ETH Zürich.

• vocabulary_tree3: package part of the Visual Simultaneous Localization And Mapping (VS-LAM) stack from the Robot Operating System (ROS). It provides a generic BSD licensedimplementation ready to compile using CMake.

• VocabTree24: it was born as a skeleton code for a project from a computer science course onComputer Vision at the Cornell University, lately the owner made it publicly available. Itis a well structured application divided in di�erent applications each ready to compile usingmake�les.

My vote was for VocabTree2 since it is one of the most complete implementations and architecturallyrobust albeit a bit coupled with the 128 dimensional SIFT descriptor. Using this visual appearanceclassi�cation and scoring scheme a list of candidates ordered by similarity is produced, in a furtherstep a voting scheme is applied to chose the winner landmark based on the ground truth informationof the retrieved candidates.

2.4 Geometric veri�cation

Next step after feature extraction, selection and quantization is applying geometric veri�cationaiming at re-ranking the retrieved images in the list of candidates, such that not only the mostvisually similar but also the most geometrically similar images are placed on the top. The processuses RANSAC algorithm to compute geometric transformation in the projective space betweenthe keypoints of the query image and those of each candidate image in the list of top n rankedvisually similar images, resulting in the classi�cation of the candidate keypoints as inliers or outlierscomplying with the estimated homography. The list of top n candidates is then re-ranked based onthe number of inliers.

The basic algorithm for robust model estimation explained in [2] and here shown in 2 is thesame independent on the estimated geometric transformation. The applied transformation mightbe an a�ne one as proposed in the reference baseline implementation, which is a non-singular lineartransformation followed by a translation

HA =

a11 a12 txa21 a22 ty0 0 1

=

[A t0T 1

]3Patrick Mihelich: �vocabulary_tree� http://www.ros.org/wiki/vocabulary_tree4Noah Snavely: �VocabTree2� https://github.com/snavely/VocabTree2

6

http://www.ros.org/wiki/vocabulary_tree

https://github.com/snavely/VocabTree2

or an homography which is a general non-singular linear transformation of homogeneous coor-dinates

HP =

a11 a12 txa21 a22 tyv1 v2 ν

=

[A tvT υ

]

capable of describing changes in perspective, something that occurs in practice for urban envi-ronment photos and cannot be captured by an a�ne transformation. This combined with the factthat OpenCV provides a method for homography robust estimation motivates its use. Accountingfor degrees of freedom estimating an homography rather than an a�nity is more expensive havingthe former one 9 in comparison with 6 of the a�nity.

In this step I concentrate on the generation of putative correspondence since the descriptormatching using either brute force or approximated nearest neighbors as proposed by the OpenCVfeatures framework was found to perform poorly. Hence I opt for an algorithm using an all-vs-allstrategy for the comparison of query and candidate images keypoints neighborhoods, making it themost expensive step in the pipeline. It discards unreliable correspondences by applying a proximity

and similarity threshold, excluding �rst those which are very far one another then those which arenot very similar.

Using an adequate proximity threshold helps improving the quality of the resulting set of pu-tative matches, a small proximity threshold will discard most of the features while a bigger onewill include more keypoint pairs and therefore will increase the execution time of the algorithmbut might improve the quality of the resulting set of putative matches. In practice I am using anadaptive threshold of two times the minimum keypoint's distance with a lower bound of 50 pixels(both empirically determined).

As for similarity there are two parameters to review: 1) the similarity criterion, this is maybethe most critical point since a very complex measure involves several matrix operations and wouldnegatively impact the overall execution time of the putative matching algorithm whereas a veryloose one will become in poor results and eventually a wrong estimated projective transformation,and 2) the neighborhood size, it depends mainly on the scale at which the keypoints under evaluationwhere found, typically it is a window of size 21x21. Some popular similarity measures are:

• Sum of absolute di�erences.

• Zero mean sum of absolute di�erences.

• Sum of squared di�erences.

• Normalized sum of squared di�erences.

• Cross correlation.

• Normalized cross correlation.

Each of them has a varying complexity, in this case I selected the normalized cross correlation whichalbeit quite complex is capable of dealing with a�ne mapping of intensity values (i.e. I 7→ αI + β,scaling plus o�set), a desirable feature given the characteristics of the dataset.

7

http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html?highlight=findhomography#findhomography

Figure 6: Ranked results of visual ranking process

2.5 Performance evaluation

To the end of evaluating the recognition results I plotted the curves of precision and recall for avarying number of candidates (1 to n being in this case n=50 ) as shown in the reference paper[1] where they observe the frequency of occurrence of the correct building being among the top n

candidates for all the candidates. The input to the performance evaluation are: the �les with theground truth landmark labels for the query and database images, and the ranked list of candidatesfor each query image, the output is a ranked list of candidates as shown in �gure 6.

The number of retrieved images are the top n candidates in the candidates list, hence:

• the true positives are the candidate images in the n-length ranked list and labeled with thecorrect landmark label.

• while the false positives are the wrong labeled ones.

Whereas the not retrieved images are those not in the top n candidates list, and following the sameidea as before:

• the false negatives are the candidate images not in the n-length ranked list and labeled withthe correct landmark label.

• whereas the true negatives are the wrong labeled ones.

As consequence of having the more visually similar images ranked on the top it is expected tosee an increase in recall with the increase on the number of candidates until some point where theresults start degrading. Same discourse applies for the precision which is expected to decrease withthe increase of the number of candidates since the more candidates are considered the less relevantare the retrieved images.

3 Dataset

To test the problem hypothesis I am running the reproduced pipeline over the popular OxfordBuildings dataset introduced in [4]. It consists of 5062 high resolution (1024×768) images result ofexecuting 17 di�erent queries over the Flickr database and depicting 11 di�erent landmarks. Thedatabase includes several distracting images making the dataset ideal for testing detector/descriptorperformance but rather not much as reference for location recognition as will be further explainedin section �4. This dataset exhibits high scale variation, occlusion, and extreme viewpoint changes,in addition each database image might be associated with more than one landmark. All thischaracteristics strengthen the idea that it is not very suitable as reference set.

8

4 Experimental results

4.1 Implementation

The system is implemented mainly in OpenCV on top C++ in combination with some third partylibraries also in C++ and some scripts and functions on top of Matlab/Octave and Bash. Themain system codenamed LocationRecognition5 is a Multi-File C++ program whose compile processis managed with make�les, it has mainly six components as shown in �gure 7 namely:

• Common: holds generic functionality that eases some recurrent tasks such as string handlingand �le management.

• DataLib: holds functionality for with keypoints and descriptors.

• FeatureExtraction: provides functionality for detecting and describing keypoints and writingit in a suitable format as needed by the feature quantization library.

• ListBuild: holds functions for the generation of lists taken as input by the feature quantizationlibrary, they hold paths to query and db images as well as ground truth information.

• GeomVerify: holds the functions necessary for re-ranking candidate lists as explained in sec-tion 2.4.

• PerfEval: implements the logic necessary to produce the matrix of occurrences which is inputto the performance evaluation framework explained in section 2.5.

Figure 7: C++ implementation overview

An important part of the system is implemented as Octave/Matlab scripts and functions sup-ported by Bash scripts for simple �le and image management tasks, they are summarized in table 3.

5Source Code Repository - https://github.com/gantzer89/LocationRecognition

9

Script name Language Description

compute_performance_rates.m Octave/Matlab Function that computes the average precision andrecall ratesover all query images using occurrence matrix.

display_queries.m Octave/Matlab Function plotting in an approximate square grida set of JPEG images read from a folder.

eval_performance.m Octave/Matlab Script holding function calls for evaluatingpipeline over a few experiments as described insection 4.2.

eval_performance_featsel_alg.m Octave/Matlab Script holding function calls for evaluating perfor-mance of the feature selection algorithm.

masker.m Matlab Function that implements the feature selection al-gorithm.

mask_reader.m Octave/Matlab Function serving to show feature selection resultsby overlaying query masks over query images.

plot_performance_rates.m Octave/Matlab Function for plotting the average precision andrecall rates.

script.linux.sh Bash Script showing the commands implementing thewhole pipeline.

thumbnailer.sh Bash Script creating thumbnails for the images in afolder.

Table 1: Summary of supporting functions and scripts

A big part of the system relies heavily on third party software pieces, a summary of them isfound in table 2. The selection of the vocabulary tree library, or feature quantization library asnamed earlier, was a critical point since it established the overall system architecture as well as thesystem for exchange information, which is based on plain text �les. An interesting exercise wouldhave been to migrate the code from this pieces of software to a C++ implementation which wouldserve to the computer vision community, though this is a heavy task on its own.

Script name Language Description

Long Straight Lines Finder6 Matlab/Octave Computes the average precision and recall ratesover all query images using occurrence matrix.

Vanishing Point Detector7 Matlab Plots in an approximate square grid a set of JPEGimages read from a folder.

VocabTree2 C++ Implement the call to functions for evaluatingpipeline over a few experiments

Table 2: Summary of third party libraries

Finally is worth mentioning that this system works only over the Linux platform despite theVanishing Point Detector subsystem which runs only on Matlab and so the masker function does.In order to run the system one may refer to the script.linux.sh on which ideally only changingthe environment variables is possible to run it on a di�erent system. Some system tweakings areexpected since this implementation was not tested in other platforms neither developed with aimof widespread use but rather as an academic work for proving the e�ect of surrounding featuresremoval.

10

4.2 Results and discussion

In this section we evaluate the pipeline over the dataset using the following combinations:

• Visual: uses only visual ranking.

• Geometric �x: uses visual ranking followed by geometric veri�cation using a RANSAC re-projection threshold of 3 pixels, a proximity threshold of 100 pixels, and a similarity thresholdof 0.5.

• Geometric auto: same as before but instead of a �xed proximity threshold it uses anadaptive one as explained in section 2.4.

• Geometric loose: uses visual ranking followed by geometric veri�cation using loose thresh-olds, that is RANSAC re-projection threshold of 10 pixels, similarity threshold of 0.8, and anadaptive proximity threshold.

• Masked: uses visual ranking preceded by feature selection on the query images.

We compare then pairs of this combinations in order to answer the questions posed in the beginning.By looking at the results of Visual vs. Geometric auto, shown in �gure 8, we assess how convenientis applying geometric veri�cation; we observe that up to 20 candidates Geometric auto outperformsVisual though the margin of improvement is not very signi�cant. This is not surprising since result-ing projective transformations are rather poor given the small quantity of found putative matches(factor that we identi�ed as critical for homography estimation) due to the wide baseline betweenimages and multiplicity of locations depicted on a single image. Applying geometric veri�cation isuseful as far as the reference images database doesn't exhibit extreme viewpoint and scale changes,hence enforcing small baseline, and covers as uniform as possible the query area leading to betterrecognition results.

To further review the e�ects of geometric veri�cation we compare Geometric auto against Geo-metric �x, results are shown in �gure 9, for reference purposes Visual is included as well. Thoughobvious is noteworthy saying that Geometric auto outperforms Geometric �x by a large marginshowing that using a �xed proximity threshold hurts performance since disowns the scale variation.

Accounting for the e�ect of the RANSAC re-projection threshold and the similarity threshold wecompare Geometric loose against Geometric auto where both use an adaptive proximity thresholdbut vary the similarity and re-projection thresholds. We observe that both have similar recallwith a small variation margin, but Geometric loose outperforms Geometric auto by nearly 20%until 20 candidates, this evidences that using a smaller similarity threshold produces several falsepositives and using a higher re-projection threshold tackles the problem of wide baseline allowingcorrespondences which are far one another to some extent.

To assess if features from surroundings help recognition Visual was compared against Masked.It doesn't involve any geometric veri�cation step in order to account only for the e�ect of featureselection based on foreground features. As pointed in [1] performance is hurt when surroundingfeatures are excluded, the experiment results shown in �gure 10 con�rms the �ndings.

To discard the possibility that this is due to the failure of the feature selection algorithm welook at the maximum precision achieved by two query images, see �gure 11, to which the featureselection algorithm was applied. The results shown in table 3 let see that surrounding featurese�ectively help recognition to some extent.

Dealing with the Oxford Buildings dataset as reference images database allows to derive somelessons that apply in general for any recognition system:

11

Figure 8: Precision and recall curves for pre and post geometric veri�cation

12

Figure 9: Precision and recall curves for the combinations Geometric auto and Geometric �x

13

Figure 10: Precision and recall curves for the combinations Visual and Masked

14

Figure 11: Feature selection algorithm results. Left: radcli�e_camera_000519, example of suc-cessful removal of most distracting background features. Right: christ_church_000179, exampleof failed background features removal where long straight lines belonging to the �oor messed theresult.

Visual Masked

christ_church_000179 92% 6%radcli�e_camera_000519 96% 16%

Table 3: Maximum precision rates for two query images before and after feature selection

• Though discarding surrounding features is discouraged, it isn't applying a feature selectionscheme.

• Shadowing a�ects severely visual appearance matching since it introduces virtual artifacts ofhigh intensity change, histogram equalization may help but is rather far from optimal.

• Geometric veri�cation is helpful in cases where geometric properties are repeatable startingfrom a set of features which are visually similar (the putative matches).

• Geometric veri�cation is a computationally heavy task that sometimes positions itself as abottleneck and often doesn't provide a signi�cant improvement.

As said before homography estimation depends highly on the quantity and quality of the foundputative matches, some identi�ed factors that in�uence putative matches detection are:

• Extreme viewpoint and scale changes between query and database images: when two imagesdepicting the same scene are taken from locations very far one another it results in highperspective distortion which in turn decreases the repeatability of visual features.

• Extreme scale change between query and database images: if the di�erence of the distancesbetween the scene object and the cameras of query and database images is very large then thesame level of detail cannot be captured hence visual features are located at di�erent scales

15

something that any of our considered similarity measures is able to capture, it is rather amatter of �xing the window size.

• Shadowing artifacts either in query or database image: visual features detected on imageswhere shadow is casted on neighbor buildings are not reliable and only disturb the recognitionprocess and putative matching search process.

• Comparing images taken at di�erent times of the day: same scene depicted by images takenat di�erent time of the day (day or night) results in very di�erent illumination conditionsaltering visual features signature and hence putative matching search process.

References

[1] Georges Baatz, Kevin Köser, David Chen, Radek Grzeszczuk, and Marc Pollefeys. Leveraging3d city models for rotation invariant place-of-interest recognition. International Journal of

Computer Vision, 96:315�334, 2012.

[2] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. CambridgeUniversity Press, ISBN: 0521540518, second edition, 2004.

[3] David Nistér and Henrik Stewénius. Scalable recognition with a vocabulary tree. In IN CVPR,pages 2161�2168, 2006.

[4] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocab-ularies and fast spatial matching. In Computer Vision and Pattern Recognition, 2007. CVPR

'07. IEEE Conference on, pages 1�8, 2007.

16

Image based location recognition using foreground features

Documents

Transcript of Image based location recognition using foreground features