[IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International...

8
3D Object Recognition via Multi-View Inspection in Unknown Environments Jamie Westell and Parvaneh Saeedi School of Engineering Science Simon Fraser University Burnaby, Canada Abstract—This paper presents a system for object recognition and localization within unknown indoor environments. The sys- tem includes a GUI design through which the user may describe an object of interest by means of color, size, and shape. A novel coarse to fine identification mechanism that incorporates multiple views of an object is then used to locate the described object within an unknown environment. The system includes a training stage in which representative information is extracted from database images. A stereo vision system, mounted on an indoor robot platform (Fig. 1), is used to retrieve the 3D location of potential match candidates in the scene and to inspect possible matches from three distinct viewpoints. Experimental evaluation is performed for indoor environments and promising results are shown for the application of this system. I. INTRODUCTION Employing mobile robots to locate and retrieve specific objects in unknown environments is a challenging task within the fields of robotics and computer vision. The applications of such systems are vast and they range from assistive robots for helping elderly or disabled to remotely controlled robots working in inaccessible or hazardous environments. Robotic 3D object recognition requires several key technologies includ- ing robotic path planning and collision avoidance as well as visual object recognition to be integrated into one system. A complete system is capable of receiving a command from a user and proceeding to identify and locate specific objects in the unknown environment. This paper presents a novel system that identifies and locates instances of objects in an unknown environment. In this system the user can describe the target object by means of shape, size or color descriptions. Upon entering an envi- ronment and capturing several scene images from different viewpoints, a novel object recognition algorithm is used to locate possible matches. A stereo imaging device is used to resolve the 3D coordinates of these matches so that the robot can subsequently inspect each region from three distinct viewpoints to identify positive matches within the environment. There are several benefits to using a mobile robot to identify and locate objects within an environment. Partially occluded objects maybe successfully matched by moving the robot to different viewpoints thus removing the initial oc- clusion. Perhaps the most significant benefit is the ability to move around and inspect from several viewpoints. Generating matches from 360 of viewpoints increases the confidence of object recognition and localization results. This is a very attractive property because most 3D objects exhibit different Fig. 1. PeopleBot with mounted BumbleBeeXB3 Imaging Device. characteristics while viewed from different viewpoints. An- other interesting advantage is the estimation of size, location and shape of the object for other potential system components such as a grasper to be used with object retrieval. The two main hardware components of this system are the mobile robot and the stereo imaging system. The mobile robot is the PeopleBot model from MobileRobots Inc. The height of the PeopleBot (112cm) provides the ability to look above most tables and counter tops. It has two motorized wheels with one caster as well as SONAR and infrared sensors for use with obstacle avoidance tasks. The robot control commands are communicated from a base station via a wireless LAN connection. The stereo imaging system used is the BumbleBeeXB3 Stereo Vision system from Point Grey Research Inc. Although the imaging system is trinocular, only two cameras are used at a time. Operation is interchangeable between a 12cm baseline and 24cm baseline. The focal length of each camera is 6mm with a 50 field of view. The camera is capable of capturing color images at 1280 × 960 resolution. The object database includes 114 objects. 36 images are taken around each object with angular separation of 10 . These images are used within a training stage in which information is extracted from the images to be used in the object recognition algorithm. II. PREVIOUS WORK In previous years, still image object recognition has been very popular in the field of computer vision and has enjoyed 978-1-4244-7815-6/10/$26.00 c 2010 IEEE ICARCV2010 2010 11th Int. Conf. Control, Automation, Robotics and Vision Singapore, 7-10th December 2010 2088

Transcript of [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International...

Page 1: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

3D Object Recognition via Multi-ViewInspection in Unknown Environments

Jamie Westell and Parvaneh SaeediSchool of Engineering Science

Simon Fraser UniversityBurnaby, Canada

Abstract—This paper presents a system for object recognitionand localization within unknown indoor environments. The sys-tem includes a GUI design through which the user may describean object of interest by means of color, size, and shape. Anovel coarse to fine identification mechanism that incorporatesmultiple views of an object is then used to locate the describedobject within an unknown environment. The system includes atraining stage in which representative information is extractedfrom database images. A stereo vision system, mounted on anindoor robot platform (Fig. 1), is used to retrieve the 3D locationof potential match candidates in the scene and to inspect possiblematches from three distinct viewpoints. Experimental evaluationis performed for indoor environments and promising results areshown for the application of this system.

I. INTRODUCTION

Employing mobile robots to locate and retrieve specificobjects in unknown environments is a challenging task withinthe fields of robotics and computer vision. The applicationsof such systems are vast and they range from assistive robotsfor helping elderly or disabled to remotely controlled robotsworking in inaccessible or hazardous environments. Robotic3D object recognition requires several key technologies includ-ing robotic path planning and collision avoidance as well asvisual object recognition to be integrated into one system. Acomplete system is capable of receiving a command from auser and proceeding to identify and locate specific objects inthe unknown environment.

This paper presents a novel system that identifies andlocates instances of objects in an unknown environment. Inthis system the user can describe the target object by meansof shape, size or color descriptions. Upon entering an envi-ronment and capturing several scene images from differentviewpoints, a novel object recognition algorithm is used tolocate possible matches. A stereo imaging device is usedto resolve the 3D coordinates of these matches so that therobot can subsequently inspect each region from three distinctviewpoints to identify positive matches within the environment.

There are several benefits to using a mobile robot toidentify and locate objects within an environment. Partiallyoccluded objects maybe successfully matched by moving therobot to different viewpoints thus removing the initial oc-clusion. Perhaps the most significant benefit is the ability tomove around and inspect from several viewpoints. Generatingmatches from 360∘ of viewpoints increases the confidenceof object recognition and localization results. This is a veryattractive property because most 3D objects exhibit different

Fig. 1. PeopleBot with mounted BumbleBeeXB3 Imaging Device.

characteristics while viewed from different viewpoints. An-other interesting advantage is the estimation of size, locationand shape of the object for other potential system componentssuch as a grasper to be used with object retrieval.

The two main hardware components of this system are themobile robot and the stereo imaging system. The mobile robotis the PeopleBot model from MobileRobots Inc. The heightof the PeopleBot (112cm) provides the ability to look abovemost tables and counter tops. It has two motorized wheels withone caster as well as SONAR and infrared sensors for usewith obstacle avoidance tasks. The robot control commandsare communicated from a base station via a wireless LANconnection.

The stereo imaging system used is the BumbleBeeXB3Stereo Vision system from Point Grey Research Inc. Althoughthe imaging system is trinocular, only two cameras are used ata time. Operation is interchangeable between a 12cm baselineand 24cm baseline. The focal length of each camera is 6mmwith a 50∘ field of view. The camera is capable of capturingcolor images at 1280× 960 resolution.

The object database includes 114 objects. 36 images aretaken around each object with angular separation of 10∘. Theseimages are used within a training stage in which information isextracted from the images to be used in the object recognitionalgorithm.

II. PREVIOUS WORK

In previous years, still image object recognition has beenvery popular in the field of computer vision and has enjoyed

978-1-4244-7815-6/10/$26.00 c⃝2010 IEEE ICARCV2010

2010 11th Int. Conf. Control, Automation, Robotics and VisionSingapore, 7-10th December 2010

2088

Page 2: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

much success ([1], [2], [3]). The SIFT-based technique useslocal invariant keypoints found in a model image and searchesfor matches to these keypoints in a scene image [2]. Thismethod has been shown to be invariant to scale as well assome affine transformations however plain objects with fewcharacteristic features become more difficult to recognize.

The use of color in object recognition has also been seenin previous research ([4], [5]). Color offers distinctive cues forobject recognition however a major struggle in using color torecognize objects is the problem of color constancy. Althoughthe human eye can perceive a single color under several illu-minations as the same color, computer vision requires separatealgorithms to achieve such constancy. Several techniques maybe used to overcome some color constancy issues ([6], [7]).

Recently, there has been a move toward the use of mobilerobots to navigate environments and capture images at severallocations to recognize objects within scenes more accurately.Helping to spur the development of the field are two robotcompetitions which require the use of object recognitionalgorithms: the Semantic Robot Vision Challenge (SRVC)[8] , and RobotCup@Home [9]. These competitions requireparticipants to design a platform which is capable of navigatingan unknown environment and searching for specific objectswithin that environment.

In the system proposed by Ekvall et al. [10], a SLAMalgorithm is integrated with an object recognition scheme fora mobile robot platform. The robot first plans a path using amap previously constructed by a SLAM technique and movesto a certain location within the map where the target object isknown to be. It then searches for the target object based onvisual cues. Similarly, Forssen et al. [11] developed a systemin which a robot navigates through a simple environment andtakes pictures of different objects within the room attemptingto recognize each one. The latter system does not rely on anyprevious knowledge of the environment however it requires tosearch until a certain viewpoint of a given object is capturedand successfully matches the database image.

III. TARGET OBJECT SELECTION

To locate target objects within an unknown environment,information must be knonw of the target object. It is thisinformation which is used to search the environment and tolocate possible regions of interest.

The first objective of the graphical user interface (GUI)in this work is to provide an interface where the user canenter the description of the desired object. Here the user maydescribe the target object by size, color(s), and shape. Thesecond objective of the GUI is to use such descriptions toautomatically select the target object from the object database.Once the target object has been selected, the subsequent multi-view inspection may be carried out in order to identify andlocate any instances of the object within the environment.

Several filters are combined to automatically select thetarget object from the object database.

1. Size: The size filter is incorporated based on the size ofthe largest object in the data base. Five different classes

Fig. 2. Graphical User Interface. The user may enter a description of anobject and the object of interest is returned. Once confirmed, the command issent to the robot to locate and identify the object in the environment.

were defined including very small, small, medium, large,and very large. The lower and upper boundaries werefound experimentally using all objects in the database.

2. Color: The color filter is implemented in the CIE L*a*b*color space. In this filter a number of base colors aredefined through a manual process. For each object acompressed color histogram is generated that is laterstrained to remove bins with small population (relativeto the object’s size).

3. Shape: Two groups of generic (1) and geometric (2)shape filters are utilized in this work as followings:

1-a) Histogram of Oriented Gradient (HOG): HOG uti-lizes the distribution edge directions to describe theshape [12]. In the training stage the HOG of alldatabase images are computed and stored. For thetesting stage, the most relevant HOG is loaded andthe distances between it and database HOGs arecomputed.

1-b) Chain Code Histogram (CCH): CCH [13] estimatesthe outer contour of the object using Moor Neigh-bor Boundary and approximates it by polygonalrepresentation using the sequence of steps in 8directions.

1-c) Hu Invariant Moments: Hu invariant moments arecalculated by taking 4 linear combinations of cen-tral moments [14]. For each object Hu momentsare computed and saved during the training phase.During the testing the most relevant vector is usedto detect the best match.

2-a) Circularity is determined by the compactness ratio(CR=𝑃 2/4𝜋𝐴) for object contour at each viewingangle. This measure returns the number of angles orviewpoints at which the CR of the object is withina small range around 1.

2-b) Rectangularity is measured by inspecting the objectat 3 different viewings: 0∘, 50∘ and 90∘. A 4-sidedpolygon is fitted to the object contours at the 0and 90 degree viewing while a six sided polygonis fitted to the object at 50 degrees. These polygonsare used to classify the objects as rectangular shape

2089

Page 3: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

or not.2-c) Cylindrical measure is assessed by the area vari-

ability. For objects with cylindrical profiles thisvalue is constant through various viewing angles.

For each class, the mean values/vectors of the class isselected as the representative. The X2 [15] and Euclideandistances are used for measuring the similarity betweenthe class representative and the target object candidates.

Once the object is identified using the above measures andfilters, the database images of the object are extracted. Inthe case where more than one object is found, the user ispresented with the matching results to select the desired object.A screenshot of the graphical user interface is seen in Fig. 2.

IV. OBJECT RECOGNITION

Once the target object has been described by the user andthe correct object from the object database has been chosen,an object recognition algorithm is required to locate the targetobject in a scene image. The following sections outline thetraining data extraction and object identification stages of theobject recognition algorithm.

A. Training Data Extraction

To locate a target object within an unknown environment,information must be known of the target object. It is thisinformation which is used to search the environment and tolocate possible regions of interest. The information extractedfrom a database object must be representative of the object andalso unique to the specific object. The information must alsobe robust to a variety of viewing conditions such as lightingand viewpoint. In this system, the color composition of thedatabase images are used to build a unique representation ofeach object. As color constancy is a concern in the use ofcolor for object recognition, the effect of lighting conditions islimited by using only the hue and saturation coordinates in theHSV color space. As well, the white patch retinex algorithmis applied as in [6] to further reduce the effect of illuminationchanges.

In each of the 36 database images of an object, squareimage patches containing only the object and no backgroundare extracted. The image patches are extracted at a range oflocations and scales within the object boundaries. For eachof the extracted image patches, a normalized 2D Hue vs.Saturation (H-S) histogram is constructed. The bin dimensionshave been set to 0.1× 0.1 resulting in a 100-bin histogram. Asimilar structure was used by Perez et al [16], where the H-Shistogram was used to track located objects across frames ina video sequence.

Performing the extraction process over all 36 images ofa given object within the database yields thousands of 100-bin histogram vectors. To obtain a compact representation ofthese extracted image patches, the vectors are clustered to100 representative histograms using k-means clustering. Theresulting 100×100 matrix provides a unique representation ofeach object and can be used to locate instances of the database

Fig. 3. Training Data Extraction. Top: example image patch extracted fromsingle database image. Bottom Left: image patch. Bottom Right: Normalized2D H-S histogram with 100 bins.

objects within a scene image. Fig. 3 illustrates this process ofdata extraction.

The training data extraction process is completed in apreprocessing stage for each object in the database. The totaltime to extract data from all 114 images was approximately2 hours 15 minutes. The resulting 100 × 100 H-S histogramrepresentation is stored in an XML file which is easily retrievedfor the object recognition process in real time.

B. Object Identification

To identify an object within a scene image, the object iden-tification stage produced a confidence map based on matchinginformation from the database to the scene. As in the offlinetraining data extraction stage, image patches are extracted froma range of scales and positions within the image. For eachimage patch extracted, the 2D H-S histogram is generated andcompared with each of the 100 H-S histograms representingthe target object.

The H-S histograms encapsulate the distinct color com-position of an object. Not only does it represent the colorsin the object but it also represents the color transitions. Forexample, the H-S histogram of the image patch extractedin Fig. 3 represents the adjacency of two prominent colorswithin the object. Matching the histograms extracted from thedatabase images to those extracted in scene images providesa likelihood that an object with identical color compositionand color adjacency is present in the image. Clearly, objectswhich have very similar color composition and color adjacencymay not be distinguishable with this H-S structure. This isa very challenging problem in which few object recognitionalgorithms are able to solve.

As this process of generating histograms for image patchesat a range of locations and scales can be very time consuming,a speed-up technique was used to decrease the processing timefor each image capture. Just as integral images were used in[1] to quickly find the sum of pixel values in a given imagepatch, integral histograms are used here to quickly find the

2090

Page 4: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

Fig. 4. Object Recognition. Left: scene image containing target object. Right:confidence map generated from the object recognition process.

histogram of a given image patch.In a integral histogram, each pixel location is represented by

a 2D H-S histogram of the portion of the image which is aboveand to the left of the pixel location. The generation of thisintegral histogram is performed only once and the histogramof any image patch within the image can then be obtained by aseries of three matrix additions and subtractions. In particular,with the coordinates of the four corners of the image patchgiven, the histogram of that patch can be evaluated by (1),where 𝐻𝑥,𝑦 is the integral histogram matrix at location (𝑥, 𝑦).

𝐻𝑥2,𝑦2 −𝐻𝑥2,𝑦1 −𝐻𝑥1,𝑦2 +𝐻𝑥1,𝑦1 (1)

The resulting histogram is then normalized and the correct H-Shistogram of the given image patch is obtained.

After extracting the H-S histogram of the image patch fromthe scene image, it is then compared with the H-S histogramsfrom the target object. A simple histogram intersection (HI)of the 100-bin histograms is calculated and if the HI score isgreater than 75%, the corresponding image patch locations ina separate confidence map is incremented. The threshold valueof 75% was found from empirical testing to provide accurateresults with some variance in lighting conditions.

Clearly, if the target object is located in the scene image,the region containing the target object will contain many imagepatches with HI scores greater than 75%. This leads to multipleincrements within the corresponding region in the confidencemap. Fig. 4 shows a simple scene image containing the objectfrom Fig. 3 as well as the corresponding confidence mapgenerated from this object identification process.

After this initial analysis, regions are identified in the confi-dence mask where possible matches may exist. These regionsare then analyzed to obtain two metrics: average confidence(AC) and matching ratio (MR). The AC is calculated as theaverage pixel value in the confidence region. This reflects howmany image patches were found as matches within this region.The MR is calculated as the ratio of histograms which werematched out of the 100 histograms representing the targetobject. A low MR score reflects few image patches from thedatabase images matching to the scene image. The analysisof these two scores provides a meaningful likelihood that thetarget object is present or not within a given region.

V. MULTI-VIEW INSPECTION

To identify and localize the target object within an un-known environment, the object recognition algorithm from the

Robot platform

Image No. 1-90 deg

Image No. 2-45 deg

Image No. 30 deg

Image No. 4+45 deg

Image No. 5+90 deg

Fig. 5. Room Scan Procedure. Five images are captured spanning a 230∘field of view in front of the robot.

previous section is combined with the functionality of therobot platform. In an initial scan of the environment, regionsof interest are found and the relative (𝑥, 𝑦) coordinates ofthe regions are obtained through the use of a stereo visionsystem. Each region is then approached and analyzed fromthree different viewpoints. If each of the viewpoints showa confident match to the target object, a positive match isconcluded. The environment scan and object inspection stepsare presented in more detail in the following sections.

A. Environment Scan

Upon entering a room the robot captures images spanninga 230∘ field of view. The robot first turns 90∘ to counter-clockwise and captures an image. The robot then iterativelyrotates 45∘ clockwise and captures images until it is facing+90∘. This yields five images captured for analysis as seen inFig. 5. With the field of view of the camera being 50∘, thesefive images cover a total span of 230∘.

The analysis of each captured image generate a confidencemap for each viewpoint. From the confidence maps, a list ofall regions of interest are sorted by the AC value. Associatedattributes of each region are also stored such as 3D locationand region size. Regions with AC or MR scores below a giventhreshold are ignored.

To proceed with object inspection for each of these regions,3D coordinates of each region must be resolved. Upon startingup the robot, a coordinate system is created with the robotlocated at (0, 0) with a heading of 0∘. To calculate the distanceto a region of interest from the origin, the image coordinates ofthe region within each confidence map is provided to the stereoimaging system. The average distance to all points within theregion is then returned and associated with the given region.To calculate the angular offset from the 0∘ heading to eachregion of interest, the robot heading at each image capture iscombined with the image coordinates of a region. Based onthese measurements and the 6mm focal length of the camera,the angular heading towards each region of interest is obtained.

Also, combining the robot heading for each image capturedwith the pixel location of each region in the confidence mapsyields an angular offset from the origin for each region. Withthese distance and heading offsets, (𝑥, 𝑦) coordinates may beresolved using the polar-to-Cartesian transformation

(𝑥, 𝑦) = (𝑑 ∗ 𝑠𝑖𝑛(𝜃), 𝑑 ∗ 𝑐𝑜𝑠(𝜃)), (2)

2091

Page 5: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

Fig. 6. Experimental Multi-View Inspection. The Purex Liquid Detergentcontainer being inspected from three viewpoints separated by 60∘. Left: −60∘viewpoint. Middle:0∘ viewpoint. Right: +60∘ viewpoint.

where 𝑑 is the distance to the region and 𝜃 is the angularheading to the region.

B. Object Inspection

During the object inspection stage, each of the regions inthe list of regions of interest, generated by the room scan,are inspected. Based on the (𝑥, 𝑦) coordinates of the region,three locations are calculated which are near the region ofinterest and provide three angular viewpoints of the objectseparated by 60∘. During an object inspection stage, it isunlikely to be able to obtain viewpoints from 360∘ aroundan object if it is placed on a table or counter, therefore theangular separation of 60∘ was found to be large enough toobtain distinct viewpoints while being physically achievablein a typical room environment. The mobile robot then movesto each of these locations, turns to face the region of interest,and captures an image. The object recognition process is thencarried out again producing AC and MR scores for eachviewpoint. If for each viewpoint, the AC and MR are abovesome threshold, the inspected region is considered to be apositive match. These thresholds have be determined by cross-verification as seen in Sec. VI. Fig. 6 shows experimentalresults for the object inspection stage.

Sensor data from the SONAR and infrared sensors equippedonboard the mobile robot are used to ensure that the robotdoes not collide with any obstacles. If an object is impedingthe path of the robot, the robot attempts to avoid the objectby reversing a small distance, turning a small amount, andcontinue moving toward the goal position. If the viewpointdestination calculated is unreachable after four successiveobstacle avoidance attempts, robot moves to the reachablelocation nearest to the destination point and turns to face theregion of interest. The object recognition stage is then carriedout from this location.

VI. EXPERIMENTS

To test the accuracy of the object recognition algorithmoutlined in Section IV, cross-verification within the databaseimages was performed. To test results in real world environ-ments, several experiments were carried out in which the entiresystem was examined. Simple scenes containing only one ofthe objects from the object database were tested as well asmore challenging scenes containing multiple objects.

A. Database Cross-Verification

Database cross-verification is used to ensure that H-S his-togram representation generated in the data extraction stage

Fig. 7. ROC Curves for database cross-verification. Single and Multiviewmethods are compared.

yields a unique and effective representation for each object.Two tests were carried out. The first test searches only the0 ∘ viewpoint images within the database for each objectindependently. As there are 114 objects in the database, 1142

results were generated. Of these, 114 of the scenarios shouldyield positive matches (scenarios in which an object is beingsearched for in its own database image) while the rest yieldsnegative matches (eg. a basketball being searched for in animage of a detergent container).

The second experiment included +60 ∘ and −60∘ viewpointimages from the database in the analysis. As in the objectinspection stage of this system, a match was consideredpositive only if all three viewpoints showed positive results.

To illustrate the results of these experiments, a ROC curvewas generated plotting the false positive rate vs. the true pos-itive rate. The thresholds for AC and MR values were iteratedfrom 0 to 100 independently. This yielded 1002 datapoints foreach experiment. Ideally, a datapoint is achieved in the upperleft corner (0% false positives and 100% true positives). Theresulting ROC curves for both experiments are seen in Fig. 7.

A common measure for the accuracy of an ROC curve is thetotal area under each curve. Ideally, the area of an ROC curveis 1 while any area less than or equal to 0.5 represents resultswhich are no better than random. From Fig. 7, curves generatedfrom the two experiments are shown. The areas under thecurves for the single view and multi-view are 0.963 and 0.979respectively. From these values it can be concluded that theH-S histogram representation provides a unique representationfor almost all objects in the database. Also the use of multiviewinspection clearly increases the accuracy of the algorithm asseen by the increase in ROC area.

B. Single Object Scene Experiments

To further evaluate the visual search and object inspectionalgorithms, a test scenario was constructed in a laboratoryenvironment. An object from the database was placed withinthe 180∘ span in front of the robot. The task of the robot was

2092

Page 6: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

then to identify and localize the target object in the scene.This test scenario was carried out for seven target objects.

Each target object was searched for in seven different scenescontaining each of the target objects. Similar to the databasecross-verification test, 49 experiments were carried out. Idealresults are composed of seven true positives and 42 true nega-tives. For each of the experiments there is a target object anda scene object. If the target object is the object in the scene, apositive match should be generated. The 0∘ viewpoint databaseimages for the seven objects used in these experiments areshown in Fig. 8.

For each test scenario, AC and MR values are generated. Ifno object was found in the initial room scan, these values wereboth set to zero. In some cases in which the target object wasnot present in the scene, a region of interest was still detected.Upon inspection of the region of interest an AC and MR valueis generated. Finally, the AC and MR values are comparedto thresholds in order to conclude whether a positive matchis present. The AC and MR values for these experiments arepresented in Table I. The top row of this table shows the targetobjects and the first column shows the scene in which they arebeing searched. Fig. 9 displays a single object scene test casein which basketball is correctly located while a match for thetide bottle is not found.

The AC and MR thresholds of 50 and 36 respectivelyprovide the best results. With these values, 7/7 true positiveswere detected with only 2/42 false positives. The two casesin which an object was identified incorrectly were when thebox of crackers was the target object in the basketball sceneand when the tide bottle was the target object in the BakingSoda scene. In these two situations, the scene object containedsimilar color composition to the target object.

C. Multi-Object Scene Experiments

Further testing was performed on scenes containing severalobjects. For these experiments, six objects were placed in alaboratory setting. Two of these objects were identical to testthe capability of the system to detect multiple objects. Fig. 10shows an image of the multi-object scene.

Fig. 10. Multi-Object Test Scenario. Six objects are placed in a laboratorysetting including two identical object. From Left to Right: Skittles Box, PurexBottle, Tide Bottle, Purex Bottle, Basketball, Lysol Container.

In this multi-object scene, each of the five distinct objectswere searched for. The multi-view inspection algorithm pre-sented in Sec. V was carried out in each case. The averageAC and MR scores for the three inspection viewpoints captured

TABLE IIMULTI-OBJECT SCENE EXPERIMENTAL RESULTS

Target ObjectFound Ball Tide Purex Skittles LysolBall 76.0/58.5 36.0/10.0 0/0 39.3/14.1 0/0Tide 46.2/44.4 54.2/92.7 0/0 0/0 0/0

Purex 1 0/0 0/0 60.8/36.3 0/0 0/0Purex 2 0/0 0/0 72.8/52.7 0/0 0/0Skittles 0/0 0/0 0/0 52.4/45 0/0Lysol 0/0 0/0 0/0 0/0 0/0

during the inspection stages for each experiment are presentedin Table II.

Based on the AC/MR thresholds of 50/36 found from thesingle-object experiments, the positive matches in Table IIare shown in bold. Here, 5/6 true positives are found withzero false positives. In the case of they Lysol container, acombination of lighting conditions and low saturation valueswithin the object created a challenging scenario for the H-Shistogram matching technique. Fig. 11 shows four examplesof the multi-object experiments that were carried out.

VII. PERFORMANCE ISSUES

One key performance issue is the time taken to locate anobject. The environment scan stage, where the 230∘ span infront of the robot is searched, took an average of 37 seconds.Subsequent inspections in which the robot navigated to threelocations and analyzed a single region of interest took anaverage of 1 minute 9 seconds to complete. These time may bereduced by further optimization or by using a reduced imageresolution.

VIII. CONCLUDING REMARKS

In this paper, a system for object recognition and localiza-tion within an unknown environment is presented. The systembegins with a semantic description of an object provided bythe user and ends with a robotic inspection of matching objectswithin a room scene. A stereo imaging system combined withthe functionality of a mobile robot is utilized to navigate in theunknown environment while an object recognition algorithmidentifies potential matches to the target object. Images ofpotential matches are analyzed from three distinct viewpointsto further test for a match. The results show that this systemmay be used to successfully identify and locate objects withinan unknown environment.

ACKNOWLEDGMENTS

Authors would like to acknowledge with gratitude theNSERC Canada for support through the NSERC DiscoveryGrant program. Authors would also like to thank Zachary Blairfor his work on the robot’s path planning and Hadi Hadizadehand Kyron Winkelmeyer for their contributions on the designand implementation of the GUI.

2093

Page 7: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

Fig. 8. Test Scenario Objects. 0∘ viewpoint database images for each of the seven test objects. Left to Right: basketball, Twizzlers box, Skittles box, Tidedetergent, Purex detergent, baking soda, and craker box.

Confidence mapfor the ball

AC = 58.69MR =71.16

Verdict: Match

AC = 34.3MR = 9

Verdict: No Match

Confidence mapfor Tide

Scene images

Fig. 9. Single Object Inspection. Top row: scene images. Second row: the confidence map when searching for the ball. Third row: The match results for closeup inspection of the match candidate. Last row: The confidence map when searching for Tide in the above scene. The low values of AC and MR rule out thepossibility of being a true match.

TABLE ISINGLE-OBJECT SCENE EXPERIMENTAL RESULTS

Target ObjectScene Object Basketball Tide Purex Skittles Twizzlers Crackers Baking Soda

Basketball 58.7/71.2 34.3/9 0/0 0/0 82.9/5 53.2/54.0 74.2/17.3Tide 42.0/16 54.2/89.7 0/0 0/0 50.3/12 32.9/23.4 74.4/22.7

Purex 15.4/12 12.7/4 63.9/50 0/0 0/0 0/0 0/0Skittles 0/0 17.6/3 0/0 52.1/43 0/0 0/0 0/0

Twizzlers 49.7/47 45.3/65 0/0 16.1/2 80.4/83 44.7/74.8 42.3/21.6Crackers 46.4/37.7 37.9/6.3 0/0 69.9/18.8 0/0 70.9/82.3 64.0/7.0

Baking Soda 36.3/32.3 63.2/52.3 0/0 0/0 45.2/53.5 32.9/12.1 78.3/45.0

REFERENCES

[1] P. Viola and M. Jones, “Robust real-time object detection,” in Interna-tional Journal of Computer Vision, 2001.

[2] D. Lowe, “Object recognition from local scale-invariant features,” inInternational Conference on Computer Vision, 1999.

[3] K. Mikolajczyk and C. Schmid, “An affine invariant interest pointdetector,” in Proc. European Conf. Computer Vision, 2002.

[4] M. Swain and D. Ballard, “Color indexing,” International journal ofcomputer vision, vol. 7, no. 1, pp. 11–32, 1991.

[5] A. Diplaros, S. Member, T. Gevers, and I. Patras, “Color-shape contextfor object recognition,” in IEEE Workshop on Color and PhotometricMethods in Computer Vision, 2004.

[6] B. V. Funt, K. Barnard, and L. Martin, “Is machine colour constancygood enough?” in In Proceedings of the 5th European Conference onComputer Vision, 1998.

[7] B. Funt and G. Finlayson, “Color constant color indexing,” IEEEtransactions on Pattern analysis and Machine Intelligence, 1995.

[8] Semantic robot vision challenge. [Online]. Available:http://www.semantic-robot-vision-challenge.org/

[9] Robocup@home. [Online]. Available: http://www.robocupathome.org/[10] S. Ekvall, P. Jensfel, and D. Kragic, “Integrating active mobile robot

object recognition and slam in natural environments,” in IntelligentRobots and Systems, 2006 IEEE/RSJ International Conference on, 2006.

[11] P.-E. Forssen, D. Meger, K. Lai, S. Helmer, J. Little, and D. Lowe,“Informed visual search: Combining attention and object recognition,”in IEEE International Conference on Robotics and Automation, 2008.,2008.

[12] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. CVPR’05, 2005.

[13] F. Mokhtarian and A. Mackworth, “Scale-based description and recogni-tion of planar curves and two-dimensional objects,” IEEE Transactionson Pattern Analysis and Machine Intelligence, 1986.

[14] K. Hu, “Visual pattern recognition by moment invariants,” IRE Trans-actions on Information Theory, 1962.

[15] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recog-nition using shape contexts,” IEEE Transactions on Pattern Analysis andMachine Intelligence, 2002.

[16] P. Prez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistictracking,” in In Proc. ECCV, 2002.

2094

Page 8: [IEEE Vision (ICARCV 2010) - Singapore, Singapore (2010.12.7-2010.12.10)] 2010 11th International Conference on Control Automation Robotics & Vision - 3D object recognition via multi-view

1st inspection

2nd inspection

TP

AC = 46.25

MR = 44.35TN

AC = 76

MR = 58.5

Scene images

Confidence map

1st inspection

2nd inspection

AC = 60.78

MR = 36.33

AC = 72.84

MR = 52.69

TP

TP

Scene images

Confidence map

1st inspection

2 inspection

is not preformed

due to the small MR

ndAC = 36.03

MR = 10

AC = 54.17

MR = 92.74

Scene images

Confidence map

1st inspection

2nd inspection

AC =52.38

MR = 45

AC = 39.33

MR = 14.12

TP

TN

Scene images

Confidence map

TP

Fig. 11. Test Scenarios: Four test results are displayed in each image. In each quadrant the top image displays the query object. The second row depicts 3 ofthe five scene scans. The third row highlights the confidence map. The brighter areas are the locations of potential matches. The fourth and fifth rows highlightthe detection results for potential match candidates.

2095