Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using...

8
Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, John Hsu Willow Garage 68 Willow Rd., Menlo Park, CA 94025, USA {rusu, bradski, thibaux, hsu}@willowgarage.com Abstract— We present the Viewpoint Feature Histogram (VFH), a descriptor for 3D point cloud data that encodes geometry and viewpoint. We demonstrate experimentally on a set of 60 objects captured with stereo cameras that VFH can be used as a distinctive signature, allowing simultaneous recognition of the object and its pose. The pose is accurate enough for robot manipulation, and the computational cost is low enough for real time operation. VFH was designed to be robust to large surface noise and missing depth information in order to work reliably on stereo data. I. I NTRODUCTION As part of a long term goal to develop reliable capabilities in the area of perception for mobile manipulation, we address a table top manipulation task involving objects that can be manipulated by one robot hand. Our robot is shown in Fig. 1. In order to manipulate an object, the robot must reliably identify it, as well as its 6 degree-of-freedom (6DOF) pose. This paper proposes a method to identify both at the same time, reliably and at high speed. We make the following assumptions. Objects are rigid and relatively Lambertian. They can be shiny, but not reflective or transparent. Objects are in light clutter. They can be easily seg- mented in 3D and can be grabbed by the robot hand without obstruction. The item of interest can be grabbed directly, so it is not occluded. Items can be grasped even given an approximate pose. The gripper on our robot can open to 9cm and each grip is 2.5cm wide which allows an object 8.5cm wide object to be grasped when the pose is off by +/- 10 degrees. Despite these assumptions our problem has several prop- erties that make the task difficult. The objects need not contain texture. Our dataset includes objects of very similar shapes, for example many slight variations of typical wine glasses. To be usable, the recognition accuracy must be very high, typically much higher than, say, for image retrieval tasks, since false positives have very high costs and so must be kept extremely rare. To interact usefully with humans, recognition cannot take more than a fraction of a second. This puts constraints on computation, but more importantly this precludes the use of accurate but slow 3D acquisition Fig. 1. A PR2 robot from Willow Garage, showing its grippers and stereo cameras using lasers. Instead we rely on stereo data, which suffers from higher noise and missing data. Our focus is perception for mobile manipulation. Working on a mobile versus a stationary robot means that we can’t depend on instrumenting the external world with active vision systems or special lighting, but we can put such devices on the robot. In our case, we use projected texture 1 to yield dense stereo depth maps at 30Hz. We also cannot ensure environmental conditions. We may move from a sunlit room to a dim hallway into a room with no light at all. The projected texture gives us a fair amount of resilience to local lighting conditions as well. 1 Not structured light, this is random texture

Transcript of Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using...

Page 1: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram

Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, John HsuWillow Garage

68 Willow Rd., Menlo Park, CA 94025, USA{rusu, bradski, thibaux, hsu}@willowgarage.com

Abstract— We present the Viewpoint Feature Histogram(VFH), a descriptor for 3D point cloud data that encodesgeometry and viewpoint. We demonstrate experimentally ona set of 60 objects captured with stereo cameras that VFHcan be used as a distinctive signature, allowing simultaneousrecognition of the object and its pose. The pose is accurateenough for robot manipulation, and the computational cost islow enough for real time operation. VFH was designed to berobust to large surface noise and missing depth information inorder to work reliably on stereo data.

I. INTRODUCTION

As part of a long term goal to develop reliable capabilitiesin the area of perception for mobile manipulation, we addressa table top manipulation task involving objects that can bemanipulated by one robot hand. Our robot is shown in Fig. 1.In order to manipulate an object, the robot must reliablyidentify it, as well as its 6 degree-of-freedom (6DOF) pose.This paper proposes a method to identify both at the sametime, reliably and at high speed.

We make the following assumptions.

• Objects are rigid and relatively Lambertian. They canbe shiny, but not reflective or transparent.

• Objects are in light clutter. They can be easily seg-mented in 3D and can be grabbed by the robot handwithout obstruction.

• The item of interest can be grabbed directly, so it is notoccluded.

• Items can be grasped even given an approximate pose.The gripper on our robot can open to 9cm and eachgrip is 2.5cm wide which allows an object 8.5cm wideobject to be grasped when the pose is off by +/- 10degrees.

Despite these assumptions our problem has several prop-erties that make the task difficult.

• The objects need not contain texture.• Our dataset includes objects of very similar shapes, for

example many slight variations of typical wine glasses.• To be usable, the recognition accuracy must be very

high, typically much higher than, say, for image retrievaltasks, since false positives have very high costs and somust be kept extremely rare.

• To interact usefully with humans, recognition cannottake more than a fraction of a second. This putsconstraints on computation, but more importantly thisprecludes the use of accurate but slow 3D acquisition

Fig. 1. A PR2 robot from Willow Garage, showing its grippers and stereocameras

using lasers. Instead we rely on stereo data, whichsuffers from higher noise and missing data.

Our focus is perception for mobile manipulation. Workingon a mobile versus a stationary robot means that we can’tdepend on instrumenting the external world with activevision systems or special lighting, but we can put suchdevices on the robot. In our case, we use projected texture1

to yield dense stereo depth maps at 30Hz. We also cannotensure environmental conditions. We may move from a sunlitroom to a dim hallway into a room with no light at all. Theprojected texture gives us a fair amount of resilience to locallighting conditions as well.

1Not structured light, this is random texture

Page 2: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

Although this paper focuses on 3D depth features, 2Dimagery is clearly important, for example for shiny andtransparent objects, or to distinguish items based on texturesuch as telling apart a Coke can from a Diet Coke can. Inour case, the textured light alternates with no light to allowfor 2D imagery aligned with the texture based dense depth,however adding 2D visual features will be studied in futurework. Here, we look for an effective purely 3D feature.

Our philosophy is that one should use or design a recogni-tion algorithm that fits one’s engineering needs such as scal-ability, training speed, incremental training needs, and so on,and then find features that make the recognition performanceof that architecture meet one’s specifications. For reasons ofonline training, and because of large memory availability,we choose fast approximate K-Nearest Neighbors (K-NN)implemented in the FLANN library [1] as our recognitionarchitecture. The key contribution of this paper is then thedesign of a new, computationally efficient 3D feature thatyields object recognition and 6DOF pose.

The structure of this paper is as follows: Related workis described in Section II. Next, we give a brief descriptionof our system architecture in Section III. We discuss oursurface normal and segmentation algorithm in Section IVfollowed by a discussion of the Viewpoint Feature Histogramin Section V. Experimental setup and resulting computationaland recognition performance are described in Section VI.Conclusions and future work are discussed in Section VII.

II. RELATED WORK

The problem that we are trying to solve requires global(3D object level) classification based on estimated features.This has been under investigation for a long time in variousresearch fields, such as computer graphics, robotics, andpattern matching, see [2]–[4] for comprehensive reviews. Weaddress the most relevant work below.

Some of the widely used 3D point feature extractionapproaches include: spherical harmonic invariants [5], spinimages [6], curvature maps [7], or more recently, PointFeature Histograms (PFH) [8], and conformal factors [9].Spherical harmonic invariants and spin images have beensuccessfully used for the problem of object recognition fordensely sampled datasets, though their performance seemsto degrade for noisier and sparser datasets [4]. Our stereodata is noisier and sparser than typical line scan data whichmotivated the use of our new features. Conformal factors arebased on conformal geometry, which is invariant to isometrictransformations, and thus obtains good results on databasesof watertight models. Its main drawback is that it can onlybe applied to manifold meshes which can be problematicin stereo. Curvature maps and PFH descriptors have beenstudied in the context of local shape comparisons for dataregistration. A side study [10] applied the PFH descriptorsto the problem of surface classification into 3D geometricprimitives, although only for data acquired using preciselaser sensors. A different point fingerprint representationusing the projections of geodesic circles onto the tangentplane at a point pi was proposed in [11] for the problem of

surface registration. As the authors note, geodesic distancesare more sensitive to surface sampling noise, and thus areunsuitable for real sensed data without a priori smoothing andreconstruction. A decomposition of objects into parts learnedusing spin images is presented in [12] for the problem ofvehicle identification.

Methods relying on global features include descriptorssuch as Extended Gaussian Images (EGI) [13], eigenshapes [14], or shape distributions [15]. The latter samplesstatistics of the entire object and represents them as distri-butions of shape properties, however they do not take intoaccount how the features are distributed over the surface ofthe object. Eigen shapes show promising results but theyhave limits on their discrimination ability since importanthigher order variances are discarded. EGIs describe objectsbased on the unit normal sphere, but have problems handlingarbitrarily curved objects.

The work in [16] makes use of spin-image signatures andnormal-based signatures to achieve classification rates over90% with synthetic and CAD model datasets. The datasetsused however are very different than the ones acquiredusing noisy 640 × 480 stereo cameras such as the onesused in our work. In addition, the authors do not providetiming information on the estimation and matching partswhich is critical for applications such as ours. A systemfor fully automatic 3D model-based object recognition andsegmentation is presented in [17] with good recognition ratesof over 95% for a database of 55 objects. Unfortunately, thecomputational performance of the proposed method is notsuitable for real-time as the authors report the segmentationof an object model in a cluttered scene to be around 2minutes. Moreover, the objects in the database are scannedusing a high resolution Minolta scanner and their geometricshapes are very different. As shown in Section VI, the objectsused in our experiments are much more similar in terms ofgeometry, so such a registration-based method would fail.In [18], the authors propose a system for recognizing 3Dobjects in photographs. The techniques presented can onlybe applied in the presence of texture information, and requirea cumbersome generation of models in an offline step, whichmakes this unsuitable for our work.

As previously presented, our requirements are real-timeobject recognition and pose identification from noisy real-world datasets acquired using projective texture stereo cam-eras. Our 3D object classification is based on an extension ofthe recently proposed Fast Point Feature Histogram (FPFH)descriptors [8], which record the relative angular directionsof surface normals with respect to one another. The FPFHperforms well in classification applications and is robust tonoise but it is invariant to viewpoint.

This paper proposes a novel descriptor that encodes theviewpoint information and has two parts: (1) an extendedFPFH descriptor that achieves O(k∗n) to O(n) speed up overFPFHs where n is the number of points in the point cloudand k is how many points used in each local neighborhood;(2) a new signature that encodes important statistics betweenthe viewpoint and the surface normals on the object. We call

Page 3: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

this new feature the Viewpoint Feature Histogram (VFH) asdetailed below.

III. ARCHITECTURE

Our system architecture employs the following processingsteps:

• Synchronized, calibrated and epipolar aligned left andright images of the scene are acquired.

• A dense depth map is computed from the stereo pair.• Surface normals in the scene are calculated.• Planes are identified and segmented out and the remain-

ing point clouds from non-planar objects are clusteredin Euclidean space.

• The Viewpoint Feature Histogram (VFH) is calculatedover large enough objects (here, objects having at least100 points).

– If there are multiple objects in a scene, they areprocessed front to back relative to the camera.

– Occluded point clouds with less than 75% of thenumber of points of the frontal objects are notedbut not identified.

• Fast approximate K-NN is used to classify the objectand its view.

Some steps from the early processing pipeline are shown inFigure 2. Shown left to right, top to bottom in that figure are:a moderately complex scene with many different vertical andhorizontal surfaces, the resulting depth map, the estimatedsurface normals and the objects segmented from the planarsurfaces in the scene.

Fig. 2. Early processing steps row wise, top to bottom: A scene, its depthmap, surface normals and segmentation into planes and outlier objects.

For computing 3D depth maps, we use 640x480 stereowith textured light. The texture flashes on only very brieflyas the cameras take a picture resulting in lights that look dimto the human eye but bright to the camera. Texture flashesonly every other frame so that raw imagery without texturecan be gathered alternating with densely textured scenes. The

stereo has a 38 degree field of view and is designed for closein manipulation tasks, thus the objects that we deal with arefrom 0.5 to 1.5 meters away. The stereo algorithm that weuse was developed in [19] and uses the implementation in theOpenCV library [20] as described in detail in [21], runningat 30Hz.

IV. SURFACE NORMALS AND 3D SEGMENTATION

We employ segmentation prior to the actual feature es-timation because in robotic manipulation scenarios we areonly interested in certain precise parts of the environment,and thus computational resources can be saved by tacklingonly those parts. Here, we are looking to manipulate reach-able objects that lie on horizontal surfaces. Therefore, oursegmentation scheme proceeds at extracting these horizontalsurfaces first.

Fig. 3. From left to right: raw point cloud dataset, planar and clustersegmentation, more complex segmentation.

Compared to our previous work [22], we have improvedthe planar segmentation algorithms by incorporating surfacenormals into the sample selection and model estimationsteps. We also took care to carefully build SSE aligneddata structures in memory for any computationally expensiveoperation. By rejecting candidates which do not supportour constraints, our system can segment data at about 7Hz,including normal estimation, on a regular Core2Duo laptopusing a single core. To get frame rate performance (realtime),we use a voxelized data structure over the input point cloudand downsample with a leaf size of 0.5cm. The surfacenormals are therefore estimated only for the downsampledresult, but using the information in the original point cloud.The planar components are extracted using a RMSAC (Ran-domized MSAC) method that takes into account weightedaverages of distances to the model together with the angleof the surface normals. We then select candidate table planesusing a heuristic combining the number of inliers whichsupport the planar model as well as their proximity to thecamera viewpoint. This approach emphasizes the part of thespace where the robot manipulators can reach and grasp theobjects.

The segmentation of object candidates supported by thetable surface is performed by looking at points whose projec-tion falls inside the bounding 2D polygon for the table, andapplying single-link clustering. The result of these processingsteps is a set of Euclidean point clusters. This works toreliably segment objects that are separated by about half their

Page 4: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

minimum radius from each other. An example can be seenin Figure 3.

To resolve further ambiguities with respect to the chosencandidate clusters, such as objects stacked on other planarobjects (such as books), we repeat the previously mentionedstep by treating each additional horizontal planar structureon top of the table candidates as a table itself and repeatingthe segmentation step (see results in Figure 3).

We emphasize that this segmentation step is of extremeimportance for our application, because it allows our methodsto achieve favorable computational performances by extract-ing only the regions of interest in a scene (i.e., objects thatare to be manipulated, located on horizontal surfaces). Incases where our “light clutter” assumption does not holdand the geometric Euclidean clustering is prone to failure,a more sophisticated segmentation scheme based on textureproperties could be implemented.

V. VIEWPOINT FEATURE HISTOGRAM

In order to accurately and robustly classify points withrespect to their underlying surface, we borrow ideas fromthe recently proposed Point Feature Histogram (PFH) [10].The PFH is a histogram that collects the pairwise pan, tilt andyaw angles between every pair of normals on a surface patch(see Figure 4). In detail, for a pair of 3D points 〈pi,pj〉, andtheir estimated surface normals 〈ni,nj〉, the set of normalangular deviations can be estimated as:

α = v · nj

φ = u ·(pj − pi)

dθ = arctan(w · nj , u · nj)

(1)

where u, v,w represent a Darboux frame coordinate systemchosen at pi. Then, the Point Feature Histogram at a patchof points P = {pi} with i = {1 · · ·n} captures all the setsof 〈α, φ, θ〉 between all pairs of 〈pi,pj〉 from P , and binsthe results in a histogram. The bottom left part of Figure 4presents the selection of the Darboux frame and a graphicalrepresentation of the three angular features.

Because all possible pairs of points are considered, thecomputation complexity of a PFH is O(n2) in the numberof surface normals n. In order to make a more efficientalgorithm, the Fast Point Feature Histogram [8] was de-veloped. The FPFH measures the same angular features asPFH, but estimates the sets of values only between everypoint and its k nearest neighbors, followed by a reweightingof the resultant histogram of a point with the neighboringhistograms, thus reducing the computational complexity toO(k ∗ n).

Our past work [22] has shown that a global descriptor(GFPFH) can be constructed from the classification resultsof many local FPFH features, and used on a wide rangeof confusable objects (20 different types of glasses, bowls,mugs) in 500 scenes achieving 96.69% on object classrecognition. However, the categorized objects were only splitinto 4 distinct classes, which leaves the scaling problemopen. Moreover, the GFPFH is susceptible to the errors of

the local classification results, and is more cumbersome toestimate.

In any case, for manipulation, we require that the robotnot only identifies objects, but also recognizes their 6DOFposes for grasping. FPFH is invariant both to object scale(distance) and object pose and so cannot achieve the lattertask.

In this work, we decided to leverage the strong recognitionresults of FPFH, but to add in viewpoint variance whileretaining invariance to scale, since the dense stereo depthmap gives us scale/distance directly. Our contribution to theproblem of object recognition and pose identification is toextend the FPFH to be estimated for the entire object cluster(as seen in Figure 4), and to compute additional statisticsbetween the viewpoint direction and the normals estimatedat each point. To do this, we used the key idea of mixing theviewpoint direction directly into the relative normal anglecalculation in the FPFH. Figure 6 presents this idea with thenew feature consisting of two parts: (1) a viewpoint directioncomponent (see Figure 5) and (2) a surface shape componentcomprised of an extended FPFH (see Figure 4).

The viewpoint component is computed by collecting ahistogram of the angles that the viewpoint direction makeswith each normal. Note, we do not mean the view angle toeach normal as this would not be scale invariant, but insteadwe mean the angle between the central viewpoint directiontranslated to each normal. The second component measuresthe relative pan, tilt and yaw angles as described in [8], [10]but now measured between the viewpoint direction at thecentral point and each of the normals on the surface. We callthe new assembled feature the Viewpoint Feature Histogram(VFH). Figure 6 presents the resultant assembled VFH for arandom object.

pi

ni

vp

vp-p

Fig. 5. The Viewpoint Feature Histogram is created from the extendedFast Point Feature Histogram as seen in Figure 4 together with the statisticsof the relative angles between each surface normal to the central viewpointdirection.

The computational complexity of VFH is O(n). In ourexperiments, we divided the viewpoint angles into 128 binsand the α, φ and θ angles into 45 bins each or a total of 263dimensions. The estimation of a VFH takes about 0.3ms onaverage on a 2.23GHz single core of a Core2Duo machineusing optimized SSE instructions.

Page 5: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

p7

p6

p8

p9

p10

p11

p5

p1

p2

p3

p4

c

nc=u

u

n5

v=(p5-c)×u

w=u×v

c

p5

w

v

α

φ

θ

Fig. 4. The extended Fast Point Feature Histogram collects the statistics of the relative angles between the surface normals at each point to the surfacenormal at the centroid of the object. The bottom left part of the figure describes the three angular feature for an example pair of points.

Viewpoint component

extended FPFH component

Fig. 6. An example of the resultant Viewpoint Feature Histogram for oneof the objects used. Note the two concatenated components.

VI. VALIDATION AND EXPERIMENTAL RESULTS

To evaluate our proposed descriptor and system archi-tecture, we collected a large dataset consisting of over 60IKEA kitchenware objects as show in Figure 8. These objectsconsisted of many kinds each of: wine glasses, tumblers,drinking glasses, mugs, bowls, and a couple of boxes. In eachof these categories, many of the objects were distinguishedonly by subtle variations in shape as can be seen for examplein the confusions in Figure 10. We captured over 54000scenes of these objects by spinning them on a turn table180◦2 at each of 2 offsets on a platform that tilted 0, 8, 16, 22and 30 degrees. Each 180◦ rotation was captured with about90 images. The turn table is shown in Fig. 7. We additionallyworked with a subset of 20 objects in 500 lightly clutteredscenes with varying arrangements of horizontal and verticalsurfaces, using the same data set provided by in [22]. No

2We didn’t go 360 degrees so that we could keep the calibration box inview

Fig. 7. The turn table used to collect views of objects with knownorientation.

pose information was available for this second dataset sowe only ran experiments separately for object recognitionresults.

The complete source code used to generate our experimen-tal results together with both object databases are availableunder a BSD open source license in our ROS repositoryat Willow Garage 3. We are currently taking steps towardscreating a web page with complete tutorials on how to fullyreplicate the experiments presented herein.

Both the objects in the [22] dataset as well as the oneswe acquired, constitute valid examples of objects of dailyuse that our robot needs to be able to reliably identify andmanipulate. While 60 objects is far from the number ofobjects the robot eventually needs to be able to recognize, itmay be enough if we assume that the robot knows what

3http://ros.org

Page 6: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

Fig. 8. The complete set of IKEA objects used for the purpose of ourexperiments. All transparent glasses have been painted white to obtain 3Dinformation during the acquisition process.

TABLE IRESULTS FOR OBJECT RECOGNITION AND POSE DETECTION OVER

54000 SCENES PLUS 500 LIGHTLY CLUTTERED SCENES.

Object PoseMethod Recognition EstimationVFH 98.52% 98.52%Spin 75.3% 61.2%

context (kitchen table, workbench, coffee table) it is in,so that it needs only discriminate among a small contextdependent set of objects.

The geometric variations between objects are subtle, andthe data acquired is noisy due to the stereo sensor character-istics, yet the perception system has to work well enough todifferentiate between, say, glasses that look similar but servedifferent purposes (e.g., a wine glass versus a brandy glass).

As presented in Section II, the performance of the 3Ddescriptors proposed in the literature degrade on noisierdatasets. One of the most popular 3D descriptor to date usedon datasets acquired using sensing devices similar to ours(e.g., similar noise characteristics) is the spin image [6]. Tovalidate the VFH feature we thus compare it to the spinimage, by running the same experiments multiple times.

For the reasons given in Section I, we base our recogni-tion architecture on fast approximate K-Nearest Neighbors(KNN) searches using kd-trees [1]. The construction of thetree and the search of the nearest neighbors places an equalweight on each histogram bin in the VFH and spin imagesfeatures.

Figure 11 shows time stop sequentially aggregated exam-ples of the training set. Figure 12 shows example recognitionresults for VFH. And finally, Figure 10 gives some idea ofthe performance differences between VFH and spin images.The object recognition rates over the lightly cluttered datasetwere 98.1% for VFH and 73.2% for spin images. The overallrecognition rates for VFH and Spin images are shown inTable I where VFH handily outperforms spin images for bothobject recognition and pose.

Fig. 9. Data training performed in simulation. The figure presents asnapshot of the simulation with a water bottle from the object modeldatabase and the corresponding stereo point cloud output.

VII. CONCLUSIONS AND FUTURE WORK

In this paper we presented a novel 3D feature descriptor,the Viewpoint Feature Histogram (VFH), useful for objectrecognition and 6DOF pose identification for applicationwhere a priori segmentation is possible. The high recognitionperformance and fast computational properties, demonstratedthe superiority of VFH over spin images on a large scaledataset consisting of over 54000 scenes with over 60 ob-jects. Compared to other similar initiatives, our architectureworks well with noisy data acquired using standard stereocameras in real-time, and can detect subtle variations in thegeometry of objects. Moreover, we presented an integratedapproach for both recognition and 6DOF pose identificationfor untextured objects, the latter being of extreme importancefor mobile manipulation and grasping applications.

Page 7: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

Fig. 10. VFH consistently outperforms spin images for both recognitionand for pose. The bottom of the figure presents an example result of VFHrun on a mug. The bottom left corner is the learned models and the matchesgo from best to worse from left to right across the bottom followed by left toright across the top. The top part of the figure presents the results obtainedusing a spin image. For VFH, 3 of 5 object recognition and 3 of 5 poseresults are correct. For spin images, 2 of 5 object recognition results arecorrect and 0 of 5 pose results are correct.

Fig. 11. Sequence examples of object training with calibration box on theoutside.

An automatic training pipeline can be integrated with our3D simulator based on Gazebo [23] as depicted in figure 9,where the stereo point cloud is generated from perfectlyrectified camera images.

We are currently working on making both the fully an-notated database of objects together with the source code

of VFH available to the research community as open source.The preliminary results of our efforts can already be checkedfrom the trunk of our Willow Garage ROS repository, butwe are taking steps towards generating a set of tutorials onhow to replicate and extend the experiments presented in thispaper.

REFERENCES

[1] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors withautomatic algorithm configuration,” VISAPP, 2009.

[2] J. W. Tangelder and R. C. Veltkamp, “A Survey of Content Based3D Shape Retrieval Methods,” in SMI ’04: Proceedings of the ShapeModeling International, 2004, pp. 145–156.

[3] A. K. Jain and C. Dorai, “3D object recognition: Representation andmatching,” Statistics and Computing, vol. 10, no. 2, pp. 167–182,2000.

[4] A. D. Bimbo and P. Pala, “Content-based retrieval of 3D models,”ACM Trans. Multimedia Comput. Commun. Appl., vol. 2, no. 1, pp.20–43, 2006.

[5] G. Burel and H. Henocq, “Three-dimensional invariants and theirapplication to object recognition,” Signal Process., vol. 45, no. 1, pp.1–22, 1995.

[6] A. Johnson and M. Hebert, “Using spin images for efficient objectrecognition in cluttered 3D scenes,” IEEE Transactions on PatternAnalysis and Machine Intelligence, May 1999.

[7] T. Gatzke, C. Grimm, M. Garland, and S. Zelinka, “Curvature Mapsfor Local Shape Comparison,” in SMI ’05: Proceedings of the Inter-national Conference on Shape Modeling and Applications 2005 (SMI’05), 2005, pp. 246–255.

[8] R. B. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms(FPFH) for 3D Registration,” in ICRA, 2009.

[9] B.-C. M. and G. C., “Characterizing shape using conformal factors,”in Eurographics Workshop on 3D Object Retrieval, 2008.

[10] R. B. Rusu, Z. C. Marton, N. Blodow, and M. Beetz, “LearningInformative Point Classes for the Acquisition of Object Model Maps,”in In Proceedings of the 10th International Conference on Control,Automation, Robotics and Vision (ICARCV), 2008.

[11] Y. Sun and M. A. Abidi, “Surface matching by 3D point’s fingerprint,”in Proc. IEEE Int’l Conf. on Computer Vision, vol. II, 2001, pp. 263–269.

[12] D. Huber, A. Kapuria, R. R. Donamukkala, and M. Hebert, “Parts-based 3D object classification,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR 04), June 2004.

[13] B. K. P. Horn, “Extended Gaussian Images,” Proceedings of the IEEE,vol. 72, pp. 1671–1686, 1984.

[14] R. J. Campbell and P. J. Flynn, “Eigenshapes for 3D object recognitionin range data,” in Computer Vision and Pattern Recognition, 1999.IEEE Computer Society Conference on., pp. 505–510.

[15] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, “Shape distri-butions,” ACM Transactions on Graphics, vol. 21, pp. 807–832, 2002.

[16] X. Li and I. Guskov, “3D object recognition from range images usingpyramid matching,” in ICCV07, 2007, pp. 1–6.

[17] A. S. Mian, M. Bennamoun, and R. Owens, “Three-dimensionalmodel-based object recognition and segmentation in cluttered scenes,”IEEE Trans. Pattern Anal. Mach. Intell, vol. 28, pp. 1584–1601, 2006.

[18] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce, “3d objectmodeling and recognition using local affine-invariant image descriptorsand multi-view spatial constraints,” International Journal of ComputerVision, vol. 66, p. 2006, 2006.

[19] K. Konolige, “Small vision systems: hardware and implementation,”in In Eighth International Symposium on Robotics Research, 1997, pp.111–116.

[20] “OpenCV, Open source Computer Vision library,” inhttp://opencv.willowgarage.com/wiki/, 2009.

[21] G. Bradski and A. Kaehler, “Learning OpenCV: Computer Vision withthe OpenCV Library,” in O’Reilly Media, Inc., 2008, pp. 415–453.

[22] R. B. Rusu, A. Holzbach, M. Beetz, and G. Bradski, “Detectingand segmenting objects for mobile manipulation,” in ICCV S3DVworkshop, 2009.

[23] N. Koenig and A. Howard, “Design and use paradigms for gazebo,an open-source multi-robot simulator,” in IEEE/RSJ InternationalConference on Intelligent Robots and Systems, Sendai, Japan, Sep2004, pp. 2149–2154.

Page 8: Fast 3D Recognition and Pose Using the Viewpoint Feature ... · Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu, Gary Bradski, Romain Thibaux,

Fig. 12. VFH Retrieval results: The model pose and object is at lower leftin each frame. The matches in order of best to worst go from left to rightstarting at bottom left followed by top right in each frame. We see that thebox and the mug (top and bottom) match perfectly while a glass (middle)has 3 correct matches followed by the 4th match having the wrong viewand 5th match selecting the wrong object.