Fast Scale Invariant Feature Detection and Matching on...

8
Fast Scale Invariant Feature Detection and Matching on Programmable Graphics Hardware Nico Cornelis K.U.Leuven Leuven, Belgium [email protected] Luc Van Gool K.U.Leuven / ETH Zurich Leuven, Belgium / Zurich, Switzerland [email protected] Abstract Ever since the introduction of freely programmable hard- ware components into modern graphics hardware, graphics processing units (GPUs) have become increasingly popular for general purpose computations. Especially when applied to computer vision algorithms where a Single set of Instruc- tions has to be executed on Multiple Data (SIMD), GPU- based algorithms can provide a major increase in process- ing speed compared to their CPU counterparts. This pa- per presents methods that take full advantage of modern graphics card hardware for real-time scale invariant feature detection and matching. The focus lies on the extraction of feature locations and the generation of feature descrip- tors from natural images. The generation of these feature- vectors is based on the Speeded Up Robust Features (SURF) method [1] due to its high stability against rotation, scale and changes in lighting condition of the processed images. With the presented methods feature detection and matching can be performed at framerates exceeding 100 frames per second for 640 × 480 images. The remaining time can then be spent on fast matching against large feature databases on the GPU while the CPU can be used for other tasks. Keywords GPU,SURF, feature extraction, feature matching 1. Introduction Since the arrival of dedicated graphics hardware, its de- velopment has mainly been driven by the game industry. Throughout the years, however, these GPUs transformed from a piece of hardware with fixed function pipelines into a compact size programmable mini computer with multi- ple processor cores and a large amount of dedicated mem- ory. With an instruction set comparable to that of a CPU, modern day graphics boards provide a new platform for de- velopers, allowing them to write new algorithms and also port existing CPU algorithms to the GPU which are then able to run at very high speeds. Algorithms developed using the GPU as a programming platform are commonly referred to as GPGPU (General Purpose Computations on GPU) al- gorithms [8]. Recent developments in graphics hardware also provide GPGPU programmers with a C-like program- ming environment for GPUs, enabling them to write even more efficient code, provided that a small number of guide- lines are being followed. One such programming language is called CUDA which stands for Compute Unified Device Architecture [5]. This paper shows how scale invariant feature detection and matching can be performed using the GPU as develop- ment platform. More specifically, we show how carefully chosen data layouts can lead to a fast calculation of SURF feature locations and descriptors even when compared to existing GPU implementations [11, 3]. Additionally, fast feature matching is being performed by taking advantage of the CUBLAS library which provides fast GPU implementa- tions of commonly used basic linear algebra routines on the latest generation of Geforce graphics cards. This approach has been developed for Geforce8 series graphics board and above. 2. Related work One commonly addressed problem in computer vision is the extraction of stable feature points from natural images. These feature points can subsequently be used for corre- spondence matching in applications such as wide-baseline stereo and object recognition. Because there can be a signif- icant change in viewpoint position and orientation between images taken from a common scene, the extracted feature points need to be scale and rotation invariant and preferably also invariant to changes in lighting conditions in order to be considered as stable features. The most popular algo- rithm developed towards this goal, called Scale Invariant Feature Transform (SIFT) as proposed by Lowe [10], has already been applied successfully in a variety of applica- tions. Shortly after the introduction of SIFT, a different kind 978-1-4244-2340-8/08/$25.00 ©2008 IEEE

Transcript of Fast Scale Invariant Feature Detection and Matching on...

Page 1: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

Fast Scale Invariant Feature Detection and Matchingon Programmable Graphics Hardware

Nico CornelisK.U.Leuven

Leuven, [email protected]

Luc Van GoolK.U.Leuven / ETH Zurich

Leuven, Belgium / Zurich, [email protected]

Abstract

Ever since the introduction of freely programmable hard-ware components into modern graphics hardware, graphicsprocessing units (GPUs) have become increasingly popularfor general purpose computations. Especially when appliedto computer vision algorithms where a Single set of Instruc-tions has to be executed on Multiple Data (SIMD), GPU-based algorithms can provide a major increase in process-ing speed compared to their CPU counterparts. This pa-per presents methods that take full advantage of moderngraphics card hardware for real-time scale invariant featuredetection and matching. The focus lies on the extractionof feature locations and the generation of feature descrip-tors from natural images. The generation of these feature-vectors is based on the Speeded Up Robust Features (SURF)method [1] due to its high stability against rotation, scaleand changes in lighting condition of the processed images.With the presented methods feature detection and matchingcan be performed at framerates exceeding 100 frames persecond for 640× 480 images. The remaining time can thenbe spent on fast matching against large feature databaseson the GPU while the CPU can be used for other tasks.

KeywordsGPU,SURF, feature extraction, feature matching

1. Introduction

Since the arrival of dedicated graphics hardware, its de-velopment has mainly been driven by the game industry.Throughout the years, however, these GPUs transformedfrom a piece of hardware with fixed function pipelines intoa compact size programmable mini computer with multi-ple processor cores and a large amount of dedicated mem-ory. With an instruction set comparable to that of a CPU,modern day graphics boards provide a new platform for de-velopers, allowing them to write new algorithms and alsoport existing CPU algorithms to the GPU which are then

able to run at very high speeds. Algorithms developed usingthe GPU as a programming platform are commonly referredto as GPGPU (General Purpose Computations on GPU) al-gorithms [8]. Recent developments in graphics hardwarealso provide GPGPU programmers with a C-like program-ming environment for GPUs, enabling them to write evenmore efficient code, provided that a small number of guide-lines are being followed. One such programming languageis called CUDA which stands for Compute Unified DeviceArchitecture [5].

This paper shows how scale invariant feature detectionand matching can be performed using the GPU as develop-ment platform. More specifically, we show how carefullychosen data layouts can lead to a fast calculation of SURFfeature locations and descriptors even when compared toexisting GPU implementations [11, 3]. Additionally, fastfeature matching is being performed by taking advantage ofthe CUBLAS library which provides fast GPU implementa-tions of commonly used basic linear algebra routines on thelatest generation of Geforce graphics cards. This approachhas been developed for Geforce8 series graphics board andabove.

2. Related work

One commonly addressed problem in computer vision isthe extraction of stable feature points from natural images.These feature points can subsequently be used for corre-spondence matching in applications such as wide-baselinestereo and object recognition. Because there can be a signif-icant change in viewpoint position and orientation betweenimages taken from a common scene, the extracted featurepoints need to be scale and rotation invariant and preferablyalso invariant to changes in lighting conditions in order tobe considered as stable features. The most popular algo-rithm developed towards this goal, called Scale InvariantFeature Transform (SIFT) as proposed by Lowe [10], hasalready been applied successfully in a variety of applica-tions. Shortly after the introduction of SIFT, a different kind

1

978-1-4244-2340-8/08/$25.00 ©2008 IEEE

Page 2: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

of feature extraction and descriptor calculation was pro-posed, called SURF. While the feature localization stage ofSURF has been tuned towards performance and parallelismby using a fast approximation of the Hessian, its feature de-scriptor stage has been tuned towards high recognition ratesand proved to be a valid alternative to the SIFT descriptor.Therefore, the method proposed in this paper will use theSURF descriptor as our preferred choice of feature vector.

3. SURF overviewThe SURF method consists of multiple stages to obtain

relevant feature points. The single SURF stages are:

1. Construction of an Integral Image, introduced by Vi-ola and Jones [12], for fast box filtering with very fewmemory accesses.

2. Search for candidate feature points by the creation ofa Hessian based scale-space pyramid (SURF detector).Fast filtering is performed by approximating the Hes-sian as a combination of boxfilters.

3. Further filtering and reduction of the obtained candi-date points by applying a non maximum suppressionstage in order to extract stable points with high con-trast. To each remaining point, its position and scaleare assigned.

4. Assignment of an orientation to each feature by findinga characteristic direction.

5. Feature vector calculation (SURF descriptor) based onthe characteristic direction to provide rotation invari-ance.

6. Feature vector normalization for invariance to changesin lighting conditions.

For feature localization on a single scale, a variety a fea-ture detectors can be used such as the Laplacian of Gaus-sians (LoG), Difference of Gaussians (DoG) as used in SIFTand the determinant of the Hessian used in SURF. In orderto provide scale invariance, however, these filters have to beapplied at different scales, corresponding to different val-ues of sigma for the Gaussian filter kernels. As the filtersizes become large, there is little difference in output valuesbetween neighbouring pixels. Therefore the computationalcomplexity can be reduced by the creation of a scale-spacepyramid where each level of the pyramid, referred to as anoctave, consists of N different scales. The resolution ofeach octave is half that of the previous octave in both the xand y directions.

In order to localize interest points in the image and overscales, non-maximum suppression (NMS) in a 3 × 3 × 3neighbourhood is applied to the scale-space pyramid. The

extrema of the determinant of the Hessian matrix are theninterpolated in scale and image space with the method pro-posed by Brown and Lowe [2].

The next step consists of fixing a reproducible orienta-tion based on information from circular regions around theinterest points. For that purpose, they first calculate theHaar-wavelet responses in x and y direction in a circularneighbourhood around the interest point. The size of the cir-cular region and the sampling steps are chosen proportionalto the scale at which the feature point was detected. Oncethe wavelet responses are calculated and weighted with aGaussian centered at the interest point, the responses arerepresented as vectors in a space with the horizontal re-sponse strength along the abscissa and the vertical responsestrength along the ordinate. The dominant orientation issubsequently estimated by calculating the sum of all re-sponses within a sliding orientation window covering anangle of Π

3 . The horizontal and vertical responses withinthis window are summed. The two summed responses thenyield a new vector. The longest such vector lends its orien-tation to the interest point.

Finally, a descriptor is being generated for each feature.This step consists of constructing a square region centeredaround the interest point and oriented along the orientationselected in the previous step. This region is split up reg-ularly into 4 × 4 square sub-regions. For each of thesesub-regions, 4 characteristic values are computed at 5 × 5regularly spaced sample points. The wavelet responses dxand dy computed at these sample points are first weightedwith a Gaussian centered at the interest point and summedup over each sub-region to form a first set of entries to thefeature vector. In order to bring in information about thepolarity of the intensity changes, the sum of the absolutevalues of the responses, |dx| and |dy|, are also extracted.Hence, each sub-region has a four-dimensional descriptorvector v = (Σdx, Σdy, Σ|dx|, Σ|dy|) for its underlying in-tensity structure resulting in a descriptor vector for all 4× 4sub-regions of length 64. The wavelet responses are invari-ant to a bias in illumination (offset). Invariance to contrast(a scale factor) is achieved by turning the descriptor into aunit vector.

4. GPU-implementation

Writing GPGPU applications is by no means a triv-ial task, even in the presence of a reference CPU-implementation. One has to make careful choices regardingthe data layout in graphics card memory to ensure that theapplication can take full advantage of the graphics card ca-pabilities, especially in terms of parallelism. The followingsections show how the data layout proposed in this paperallows for parallelism at both image-level and scale-level atvarious stages of the algorithm.

Page 3: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

4.1. Data Layout

The methods proposed in this paper make extensive useof texture mipmaps. A texture mipmap differs from a regu-lar texture in the sense that it also provides additional stor-age for subsampled versions of the full resolution (base)texture of size w × h. Apart from the base level, the highermipmap levels are generated by applying a 2 × 2 boxfil-ter on the previous mipmap level. Texture mipmaps gen-erated this way are used to reduce aliasing artifacts whendisplaying high resolution textures in low resolution view-ports. Because of the ability of the GPU to write into tex-tures rather than treating them as a read-only part of mem-ory and also because the data in the mipmap levels is notconstrained to be subsampled versions of the base texture,these texture mipmaps can provide us with the data storagewe need. Similar to the storage requirements for the Hes-sian calculations, the mipmap size at level i correspondsto max

(1, bw/2ic

)×max

(1, bh/2ic

). As such, a texture

mipmap proves to be a suitable choice for the scale-spacepyramid where each octave corresponds to a mipmap level.

4.2. Scale-Space Pyramid Creation

Although the pyramid creation proposed for SURF in [1]allows for fast filtering and parallelism at scale-level, it isnot suited for an efficient GPU-implementation. This iscaused by the fact that, although only a small amount ofmemory accesses have to be performed, there is a largestride between memory locations of those accesses at largescales leading to poor performance of the texture caches.Furthermore, the calculation of this fast Hessian at differ-ent scales requires texture accesses at different positionswithin the integral image. Because texture caching worksefficiently on localized data, we prefer a computation of theexact Hessian as a combination of separable filters to anapproximation using Integral Images. Similar to the firststages of SIFT, a two-pass approach is used where Gaus-sian filtered versions of the base texture are created in thefirst pass. The second pass consists of a pixel-wise calcula-tion of the Hessian matrix.

4.2.1 Gaussian Pyramid Creation

As noted in [3], GPUs do not only have parallel processingability on a per pixel basis, parallelized by the number offragment processors on the GPU, there is also parallelismon the computational stages of the GPU by calculating thefour color channels simultaneously as a vector. To take ad-vantage of these vector abilities of GPUs they modified thegray-level input image in a preprocessing step where thegray image data was rearranged into a four channel RGBAimage. Each pixel in the RGBA image represented a 2 × 2pixel block of the gray-level image, thus reducing the imagearea by a factor of 4. Although this approach clearly allows

Channel Scale g(t)red 0

green 1blue 2alpha 3

Figure 1. Top Left: Gaussian blurred images, stored in multipletextures. Top Right: All octaves and scales mapped into a singletexture mipmap. Bottom: Gaussian filter kernels for N = 4.

for more efficient parallelization at image-space level thispaper proposes an alternative data layout which also pro-vides parallelism at scale-space level.

For each octave, N Gaussian blurred versions G of thebase texture I need to be created according to the followingseparable filter implementations:

G′s(x, y) =ns∑

t=−ns

I(x− t, y).gs(t) (1)

Gs(x, y) =ns∑

t=−ns

G′s(x, y − t).gs(t) (2)

where gs represents the discretized Gaussian kernel of size2ns + 1 at scale s ∈ [0, N − 1]. By defining:

g(t) = (g0(t), g1(t), ..., gN−1(t))T (3)

the above equations can be rewritten as:

G′(x, y) =n∑

t=−n

I(x− t, y) ∗ g(t) (4)

G(x, y) =n∑

t=−n

G′(x, y − t) ∗ g(t) (5)

where ∗ represents an elementwise multiplication operatorand n = max(n0, n1, ..., nN−1). Figure 1 illustrates thedata layout used for N = 4 and n = 5. In practice, n = 9.

Page 4: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

The separable Gaussian filters are implemented as a two-pass operation. The first pass performs Gaussian filteringin the x-direction by fetching scalar values from the gray-level base texture I and multiplying them with the four-component Gaussian kernel values g. As the values of g areindependent of the pixel position, they can be hard-codedinto the fragment shader. The results of the first pass arewritten to a four-channel intermediate texture G’. Similarly,the second pass reads from G’ and writes the result of the19 × 19 Gaussian kernel to the four-component texture G.As such, the four Gaussian filtered versions are being cal-culated simultaneously. Note that, compared to the methodproposed in [3], superfluous computations are being per-formed at lower scales caused by the zero-valued entries inthe Gaussian kernels. This disadvantage, however, is be-ing overcompensated by a reduced number of context andfragment shader switches as well as the elimination of a pre-processing step.

Apart from the first octave, where the first pass takessamples from the base texture I, the first pass for the re-maining octaves samples the alpha-component of G at theprevious octave in order to reduce aliasing effects.

4.2.2 Hessian Pyramid Creation

Once the four Gaussian filtered images have been calculatedfor an octave, with one scale per color channel, they can beused for calculating the determinant of the Hessian.

detH(x, y) =∂2G(x,y)

∂x2∂2G(x,y)

∂x∂y∂2G(x,y)

∂x∂y∂2G(x,y)

∂y2

(6)

Because the discretized second order derivative kernels areindependent of the scale, the Hessian values are being cal-culated for the four scales simultaneously once again. Notethat the methods mentioned above are also applicable for aGPU implementation of the SIFT algorithm.

4.3. Keypoint Filtering

Once the Hessian values have been computed, Non Max-imum Suppression is applied in a 3× 3× 3 neighbourhoodfor all scales. First of all, in order to reduce the amount ofdata that has to be passed on to the next stage, we exploitthe fact that no more than one extremum can be found in a2× 2 neighbourhood in image space. Therefore the resultsof the NMS stage can be written out to a texture mipmap ofhalf the size of H in both x and y directions. Because thefirst and last scale lack one border scale to perform 3×3×3NMS filtering, the two last scales of the previous octave areresampled so that there are once again 4 detectable scalesper octave. Without going into further detail, we would liketo point out that the data-layout, which encodes 4 scales ina single four-component pixel, once more proves to be a

Figure 2. Top: Fast extraction of sparse feature locations fromlarge 2D arrays on GPU. Bottom: Expanding the algorithm to ex-tract feature data at multiple scales from a matrix of encoded 32-bitvalues.

good choice in terms of efficiency. The extremum in a 3×3neighbourhood in image-space can be computed for mul-tiple scales simultaneously by performing the comparisonson vectors rather than scalars. For each pixel, the result isencoded in a 4-byte value as follows:

X-offset Y-offset Extremum Octave00001100 00000101 00001101 00000011

where the last byte encodes the octave that is currently be-ing processed (up to 256 possible octaves). The third byteencodes whether an extremum has been detected with theleast significant bit corresponding to scale 0. As such, thisencoding allows for up to 8 detectable scales. The first twobytes encode the offset in x and y direction, needed because2× 2 pixels are being processed simultaneously. Note that,although a 16-bit encoding would suffice for N = 4, wehave chosen a 32-bit encoding for further efficient process-ing in CUDA.

Once the pixels have been encoded, we need to extractthe extrema locations from the encoded images. Usually,this is done by transferring the data from the GPU to sys-tem memory followed by scanning the downloaded data onthe CPU for extrema locations. Downloading large amountsof data to system memory, however, is slow as it has tobe transferred over the PCI bus. Therefore, we present amethod that only downloads the detected extrema locationsto system memory.

In 3D application programming interfaces such asOpenGL and DirectX, a fragment produced by a single ex-

Page 5: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

ecution of a fragment program is constrained to the pixellocation where it is to be drawn on screen, even when us-ing multiple render targets. The Nvidia CUDA language,however, allows data to be scattered at arbitrary locations inmemory. One only needs to ensure that different threads donot write to the same location in memory to avoid undefinedresults. Therefore, the extrema extraction stage is imple-mented as a two-pass algorithm where the first pass assignsa unique index to each extremum, followed by a second passwhich scatters the extrema locations into a continuous arrayof extrema locations. For clarity reasons, we will show howthe CUDA language can be used to extract pixel locationsfrom a sparse single-component m×n matrix. This processis illustrated in the top half of Figure 2.

In the first pass, n simultaneously running threads pro-cess a single column each and have an integer valuedcounter initialized at 0. While iterating over the m rows,the counter value is incremented by one if an extremum isdetected. This results in a 1 × n array where each valuecorresponds to the number of detected extrema in its cor-responding column. Subsequently, this array is convertedinto another array Ne of equal size where each value is thesum of all previous values with the first value initialized at0. This conversion step can also be performed efficiently onthe GPU [4]. In the second pass, the n threads once againprocess a single column each, but now their counter valueis initialized at its corresponding value in Ne. This way,whenever an extremum is encountered for the second time,the counter value provides a unique index for the extremumand its location can be scattered into a continuous output ar-ray. Afterwards, this output array is downloaded to systemmemory to inform the CPU of the number of features andtheir respective locations. Note that this approach avoids theneed to download entire images to system memory as wellas scanning these downloaded images for extrema locationson the CPU while the large number of simultaneously run-ning threads compensates for the fact that the matrix needsto be traversed a second time.

Expanding this algorithm to extract feature data at mul-tiple scales from a 2D matrix of encoded 32-bit values isstraightforward. Instead of incrementing the counter valueby one, it is incremented by the number of extrema Np inthe encoded pixel p at position (x, y) where Np correspondsto the number of ones in the third byte. In the second pass,the feature data of the detected extrema is scattered to Np

contiguous 4-tuples in the output array starting at the indexstored in the counter as shown in the bottom half of Fig-ure 2. Note that, besides the 2D feature location in pixelunits of the original image, we also write out the scale andoctave for each feature. This additional information facili-tates selecting the necessary mipmap levels and color chan-nels to perform feature location interpolation.

Figure 3. Tiling of square neighbourhood regions at extracted fea-ture locations.

4.4. Feature Location Interpolation

For features detected at higher octaves, feature locationinterpolation is needed in order to get more accurate lo-calizations. Possible interpolation schemes range from asimple parabolic interpolation to the interpolation schemeproposed by Brown and Lowe [2]. Because they require avery limited amount of numerical operations and only haveto be performed on a small amount of data, the choice ofinterpolation scheme will not have a significant impact onglobal performance. Since all values are stored in a tex-ture mipmap, where each mipmap level corresponds to anoctave and each color channel corresponds to a scale, all in-formation needed to perform interpolation is present withina single texture mipmap. This means that interpolation forall features can be performed in a single pass while only asingle texture mipmap needs to be bound to the fragmentprogram. The ability to process all features in a single passeliminates many texture- and framebuffer-switches.

4.5. Feature Descriptors

In order to avoid loss of speed by reallocating data con-tainers, the data structures should be initialized once at ini-tialization time so that they can be reused. As the numberof extracted features per image can vary, the data containersused to store feature information are initialized at a fixedmaximum size. This section illustrates how orientation as-signment and feature descriptor calculation is performed fora maximum allowable number of 4096 features.

4.5.1 Orientation Assignment

Gradient information in the neighbourhood of the featurelocations is stored in 16 × 16 pixel-blocks which are tiledinto a texture of size (64×16)× (64×16) so that each fea-ture corresponds to one of the 4096 pixel-blocks, as shownin Figure 3. The neighbourhood area covered in the orig-inal image is chosen proportionally to the scale at whichthe feature has been detected. As the original image is be-

Page 6: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

ing loaded onto the graphics card in a texture mipmap, wecan let OpenGL perform linear interpolation in both image-space and mipmap-space automatically to avoid aliasing forfeatures detected at large scales. The actual gradient com-putation for a feature neighbourhood stores a 64-bit valuefor each pixel in the 16 × 16 pixel-block, corresponding totwo 32-bit floating point values representing the gradientsin x- and y-direction. These gradient values are also be-ing weighted by a Gaussian centered at the feature location.Once these Gaussian filtered gradients have been computed,the results are transferred into a pixel buffer which servesas the data link between OpenGL and the CUDA context inwhich further operations will take place.

Within CUDA, each pixel-block is processed by a multi-processor running 256 threads in parallel where each threadtransfers a pixel from the 16 × 16 pixel-block into fastshared-memory. In order to ensure efficient memory trans-fers, one has to take the warp-size of the multi-processorsinto account. The warp-size is the number of threads thatcan be combined into a single SIMD instruction (32 forGeforce8). As the width of a pixel-block corresponds to halfthe warp-size, the requirement imposed by CUDA neededto coalesce the memory accesses into a single contiguousaligned memory access, is met. Once this transfer has takenplace, the transferred gradient array is sorted according toits phase values by using a parallel bitonic sort algorithmand the dominant orientation is estimated by calculatingthe sum of all responses within a sliding orientation win-dow covering an angle of Π

3 . The horizontal and verticalresponses within this window are summed to yield a newvector. The longest such vector lends its orientation to theinterest point.

4.5.2 Aligned Feature Descriptors

The feature descriptor generation stage is similar to the ini-tial step in the orientation assignment. The feature neigh-bourhoods are tiled and multiplied with a Gaussian in thesame way but are now aligned to their assigned orienta-tion. As required by the SURF descriptor, the gradientvalues v = (dx, dy, |dx|, |dy|) are now stored in 128-bitpixels. In this step, the results are written to a texturemipmap with a base size of (64 × 16) × (64 × 16). Bymaking use of hardware accelerated mipmapping, the val-ues v′ = (Σdx, Σdy, Σ|dx|, Σ|dy|) for mipmap level 2 areautomatically generated as a 4 × 4 boxfiltering of the baselevel. As shown in Figure 4, this mipmap level is of size(64× 4)× (64× 4) and each feature now corresponds to a4× 4 pixel-block, resulting in a descriptor of size 64.

Although the resulting descriptor incorporates spatial in-formation about the feature, it still needs to be normalizedto provide invariance against changing lighting conditions.In order to avoid numerical inaccuracies when normalizing

Figure 4. Feature vector generation and normalization.

a vector of 64 floating point values, a multi-phase algorithmis needed. To this end, a texture mipmap with a base sizeof (64 × 4) × (64 × 4) is allocated and is used to drawthe squared values of v′. Once again, hardware acceler-ated mipmapping is used to generate mipmap level 2. Thismipmap level is of size 64× 64 and the sum of the four val-ues in each pixel now corresponds to the squared norm ofits corresponding feature vector. In a final step, the featurevectors are normalized according to these values. Becausehardware accelerated mipmapping can be seen as a repeti-tive 2× 2 boxfiltering, this is a multi-phase approach whereeach hierarchical level calculates the sum of four lower levelvalues, thereby reducing numerical inaccuracies.

5. Matching

Feature descriptor matching can be performed using dif-ferent criteria, such as the Sum of Squared Differences(SSD) or the dot product (DOT). Finding the best match fora feature vector f in a database of feature vectors D is doneby locating the feature d in D for which the SSD is mini-mal. Because the feature vectors have been normalized, thiscorresponds to finding the feature vector d for which the dotproduct with f is maximal:

SSD =63∑

i=0

(fi − di)2 (7)

=63∑

i=0

f2i +

63∑i=0

d2i − 2

63∑i=0

fi.di (8)

= 2− 2.DOT (9)

Page 7: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

If the feature descriptors are to remain on the graphics card,the tiled feature descriptor array can be converted to a 2Dmatrix where each tile is mapped onto a row of the 2D ma-trix. Similar to the transformation in the orientation as-signment pass, this step can be performed efficiently us-ing CUDA. Next, a correlation matrix C is computed asC = D.FT , where each entry Ci,j contains the dot prod-uct of the feature vectors di and fj . Because the correlationmatrix C is calculated as a basic matrix multiplication, itcan be performed efficiently on the GPU by making use ofthe CUBLAS library which provides optimized versions ofbasic linear algebra operations on the GPU. Finally, in or-der to extract the best match for each of the features in F ,a CUDA algorithm is used where each thread processes acolumn of C and returns the row index of the best match.When matching m features against a database of n featuresusing descriptors of size d, calculating the correlation ma-trix C has a computational complexity of O(m × n × d)whereas the extraction of the best matches has a computa-tional complexity of O(m× n).

6. ResultsThe proposed method has been tested using the datasets

and the evaluation tool provided by Mikolajczyk [9]. Thematching results for two image sequences along with therepeatability scores for the four most challenging imagesequences are shown in Figure 6. As expected, the re-peatability scores of the GPU implementation and the orig-inal SURF implementation are very similar. Timing resultsof the proposed method and the ones developed by Sinhaet al. [11] and Heymann et al. [3] are listed in Table 1.The tests were performed on images with a resolution of640 × 480. For the matching test, a feature database of3000 features is matched against another database of 3000features, similar to [3]. More detailed timing results for theproposed algorithm are shown in Figure 5. For an imageof size 640 × 480, the algorithm is able to extract 805 fea-tures locations and descriptors at 103 frames per second ona desktop pc and 35 frames per second on a mid-end work-station laptop.

As previously developed GPU implementations alreadyachieved near real-time framerates, one might fail to see thepractical value of an even faster implementation. Not onlydoes the method proposed in this paper allow for additionalfeature matching required by real-time applications such asobject recognition, it can also be used to speed up offlineapplications that need to process vast amounts of data. Forexample, given an image of an object, one might want to re-trieve alternate images containing this object by matchingagainst a collection of images selected from large imagedatabases [7, 6] based on image tag information. In bothcases, the proposed methods can be used for fast feature ex-traction and matching against large feature databases. Ad-

Figure 5. Detailed timing results for a desktop pc (GF8800GTX)and a mid-end workstation laptop (QFX1600M).

Sinha Heymann ThisGPU model 7900GTX QFX3400 8800GTX

Octaves 4 4 5Scales 3 3 4

Extraction Time 100ms 58ms 9.7ms

Matching Time N/A 500ms 32ms

Table 1. Performance comparison.

ditionally, in order to achieve higher recognition rates, theincrease in processing speed can be used to extend the fea-ture descriptors by using more and/or larger subregions.

7. Conclusion

In this paper, we have shown how the SURF algorithmcan be accelerated significantly using programmable graph-ics hardware, even when compared to existing GPU imple-mentations. By making extensive use of texture mipmap-ping and the CUDA programming language, feature ex-traction and descriptor generation can be performed at veryhigh rates. As a result, the SURF algorithm can be appliedto image sequences with 640 × 480 pixels at frameratesexceeding 100 frames per second. Furthermore, utilizingthe CUBLAS library also resulted in fast feature matchingagainst large databases.

Page 8: Fast Scale Invariant Feature Detection and Matching on ...mplab.ucsd.edu/wp-content/uploads/CVPR2008/... · Fast Scale Invariant Feature Detection and Matching on Programmable Graphics

References[1] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up

Robust Features. European Conference on Computer Vision,pages 404–417, 2006.

[2] M. Brown and D. Lowe. Invariant features from interestpoint groups. British Machine Vision Conference, Cardiff,Wales, pages 656–665, 2002.

[3] S. Heymann, K. Muller, A. Smolic, B. Frohlich, and T. Wie-gand. SIFT Implementation and Optimization for General-Purpose GPU, 2007.

[4] http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/scan/doc/scan.pdf.

[5] http://developer.nvidia.com/object/cuda.html.[6] http://images.google.com/.[7] http://www.flickr.com.[8] http://www.gpgpu.org.[9] http://www.robots.ox.ac.uk/vvgg/research/affine/.

[10] D. Lowe. Distinctive Image Features from Scale-InvariantKeypoints. International Journal of Computer Vision,60(2):91–110, 2004.

[11] S. Sinha, J. Frahm, and M. Pollefeys. GPU-based Video Fea-ture Tracking and Matching. EDGE 2006, workshop on EdgeComputing Using New Commodity Architectures, 2006.

[12] P. Viola and M. Jones. Rapid object detection using a boostedcascade of simple features. Proc. CVPR, 1:511–518, 2001.

Figure 6. Results indicating invariance to rotation and scaling(Bark,Boat) and affine transformations (Graffiti,Wall). Top: Ex-ample of matching results. Middle: Number of extracted featuresfor each data set. Bottom: Repeatability scores.