2D shape representation and similarity measurement for 3D recognition problems: An experimental...

Pattern Recognition Letters 33 (2012) 199–217

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

2D shape representation and similarity measurement for 3D recognition problems:An experimental analysis

Elizabeth González a,⇑, Antonio Adán b, Vicente Feliú a

a E.T.S. Ingenieros Industriales, University of Castilla La Mancha, Avda. Camilo José Cela s/n, Ciudad Real 13071, Spainb E. Superior de Informática, University of Castilla La Mancha, Paseo de la Universidad 4, Ciudad Real 13071, Spain

a r t i c l e i n f o a b s t r a c t

Article history:Received 19 July 2010Available online 8 October 2011Communicated by G. Borgefors

Keywords:3D object recognitionShape representationSimilarity measuresShape recognition

0167-8655/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.patrec.2011.09.033

⇑ Corresponding author.E-mail address: [email protected] (E. González

One of the most usual strategies for tackling the 3D object recognition problem consists of representingthe objects by their appearance. 3D recognition can therefore be converted into a 2D shape recognitionmatter. This paper is focused on carrying out an in depth qualitative and quantitative analysis with regardto the performance of 2D shape recognition methods when they are used to solve 3D object recognitionproblems. Well known shape descriptors (contour and regions) and 2D similarities measurements (deter-ministic and stochastic) are thus combined to evaluate a wide range of solutions. In order to quantify theefficiency of each approach we propose three parameters: Hard Recognition Rate (Hr), Weak RecognitionRate (Wr) and Ambiguous Recognition Rate (Ar). These parameters therefore open the evaluation to activerecognition methods which deal with uncertainty. Up to 42 combined methods have been tested on twodifferent experimental platforms using public database models. A detailed report of the results and a dis-cussion, including detailed remarks and recommendations, are presented at the end of the paper.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Three-dimensional object recognition is the process of findingan object in a scene. This task implies determining the object’sidentity and/or its pose (position and orientation) with regard toa particular reference frame. For instance, in object manipulationwith robots, the pose of the object must be extracted through anaccurate estimation of the translation and rotation parameterswith regard to the robot coordinate system.

In the field of three-dimensional object recognition using amonocular sensor, two main streams appear: view-based (orappearance-based) approaches and structural (or primitive-based)approaches. Since primitive-based approaches yield a low perfor-mance when unexpected changes occur in the scene, view-basedmethods have become a popular representation scheme owing totheir robustness to noise, photometric effects, blurred vision andchanging illumination. The main advantage of this approach(view-based methods) is that the image of the query object canbe directly compared with a set of stored images in a databasewhich are efficient and robust to variations in the scene. Indeed,the 3D problem has led to a 2D shape recognition question inwhich multiple views associated with the object from differentpoints of views have to be handled. Each view in the database isthus associated with a particular viewpoint that corresponds with

ll rights reserved.

).

the current camera position (position and orientation). From hereon, we shall use the term ‘shape’ to refer to the appearance ofthe object from a specific viewpoint – in a 2D context and ‘object’as a general word to describe something in a 3D dimension envi-ronment. The 3D object pose estimation will signify geometrictransformations between the camera position in the scene andthe viewpoint from which the object is viewed in the database,whereas shape pose estimation will concern rotation, translationand scale in a 2D context.

Meanwhile, when a single view is taken to recognize an object,the principal problem is that one 2D image frequently providesinsufficient information with which to identify the object and cor-rectly estimate its pose. Uncertainty and ambiguity problemsfrequently arise in such cases owing to the fact that no depth infor-mation is available. Different objects might therefore seem to bequite similar from different viewpoints, which affect the robustnessof the 3D recognition system. In active recognition systems thishandicap is addressed by moving the camera to different positionsand processing several captures of the object until the uncertaintyis resolved. Classical active recognition systems are made up ofthree main stages: the shape recognition algorithm – which con-cerns shape identification and shape pose estimation in a 2Dcontext, the fusion stage – in which the combination of the hypoth-eses obtained from each sensor position is carried out, and thenext-best-view planning stage – in which the optimal next sensorpositions are computed (González et al., 2008). The two last stagesare used to improve the active recognition efficiency, thus reducinghypothesis uncertainty.

http://dx.doi.org/10.1016/j.patrec.2011.09.033

mailto:[email protected]

http://dx.doi.org/10.1016/j.patrec.2011.09.033

http://www.sciencedirect.com/science/journal/01678655

http://www.elsevier.com/locate/patrec

200 E. González et al. / Pattern Recognition Letters 33 (2012) 199–217

Since the view-based strategy converts 3D object recognitioninto a 2D shape recognition problem, an enormous amount of ap-proaches concerning how to represent 2D shapes and how to mea-sure similarities between shapes can be found in literature (Bustoset al., 2005). However, to the best of our knowledge, no compara-tive study of different 2D shape recognition algorithms adapted toview-based 3D recognition systems has yet been reported. In orderto provide a solution to this issue, the goal of this paper is simply tocarry out an in depth qualitative and quantitative analysis withregard to the performance of 2D shape recognition methods whenthey are used to solve 3D object recognition problems.

The paper is structured as follows. In Section 2 we tackle therequirements of 2D shape representation models, and compare dif-ferent representation, identification and shape pose estimationmethods to be implemented in 3D applications. Several unclearquestions concerning the performance of shape recognition sys-tems in 3D recognition environments are also discussed. Section3 presents the statement of the experimental tests developed inSections 4 and 5. Sections 4 and 5 are focused on evaluating therecognition performance in systems using two platforms. Finally,in Section 6 we present a discussion of the experimental resultsalong with our conclusions.

2. 2D Shape recognition for 3D environments

2.1. 2D shape representation methods

2D shape representation is carried out through the use of twoprincipal descriptors: contour descriptors and region descriptors.Models based on contours are more popular than those based onregions. Contour-based methods necessitate the extraction ofboundary information which, in some cases, might not be avail-able. Region-based methods are more robust to noise and do notnecessarily rely on shape boundary information, but do not, how-ever, extract the features of a shape. The desirable characteristicsof a shape representation model running in a 3D object recognitionsystem are:

� Generality and sensitivity: The representation must be able todescribe any shape and to preserve both global and localinformation.� Efficiency: The representation should be as simple as possible

and should be obtained without excessive computational cost.� Stability: Small changes in object position should generate small

representation model changes.� Invariancy: The representation must be invariant to geometric,

affine or projective transformations.� Robustness: Robust in cases of noise and occlusions in the image.

Table 1 shows an overview of various 2D shape representationmethods and characteristics when applied to 3D recognition.

No matter what application is developed, all shape representa-tion models share a common problem: object appearance canchange drastically as the viewpoint changes owing to the transfor-mation of the perspective. Most authors calculate changes in theviewpoint’s orientation through affine transformations. This maybe an appropriate solution when the object is far away from thecamera since slight shape distortions arise when the cameramoves. Appearance-based object representations have differentpaths to achieve affine invariant properties: through the use ofshape-normalization procedures, by using invariant shape similar-ity measures or by defining invariant shape descriptors.

A shape normalization method is an elegant pre-processingtechnique that transforms the distorted input shape into its corre-sponding normalized shape so that it is invariant under translation,scaling, skew, and rotation. Apart from the affine transformation

parameters, to which the normalization is invariant, no other infor-mation is discarded. In fact, the normalization process consists ofestablishing an affine (linear) transformation that does not alterits original shape. A generalized normalization process with whichto determine invariants is provided in (Rothe et al., 1996), and im-age normalization is tackled in (Chen, 1993).

The second choice, concerning invariant similarity measure-ments, usually provides results of a low accuracy (Hagedoorn andVeltkamp, 1999; Srisuk et al., 2003) or high computational costs(Fu et al., 2008). An approach with a similarity metric that is invari-ant to rotation, translation, and is scaling based has been proposedin (Arkin et al., 1991). This work is based on turning functions (Volo-tao et al., 2008) and it is applicable only to polygonal shapes.

Finally, many authors use invariant shape descriptors (Flusser,2006; Tangelder and Veltkamp, 2008). Object recognition usingnormalized Fourier Descriptors and neural networks has been pre-sented in (Cohen and Wang, 1994), while genetic algorithms for af-fine-invariant shape recognition have been proposed in (Tsang,2003). Techniques based on the local features of curves include:grayscale local invariants based on the automatic detection ofpoints of interest (Schmid and Mohr, 1997), affine invariants basedon convex hulls for image registration (Yang and Cohen, 1999), andlocal deformation invariants for contour recognition based on im-plicit polynomials (Rivlin and Weiss, 1997). Other approachesmatch two given contours by evaluating the affine parameters thatmaximize their similarity measure. This optimization is based onboundary moments (Huang and Cohen, 1994) or Fourier descrip-tors (Persoon and Fu, 1977). They do not usually yield good resultssince, in most cases; part of the original curve is lost.

2.2. Similarity measures

We shall refer to ‘identification’ as the process in which a queryshape is classified in a database. This can be achieved throughmatching techniques or classification methods. Matching tech-niques have principally been developed for object recognitionunder several distortion conditions. Among the universe of classi-fication techniques, we are interested in those that use similaritymeasures. This kind of methods is used in applications such as rec-ognition in large image databases, where an image in the dataset‘‘looks’’ very similar to the query image according to a particularpre-defined criterion. This criterion could be established by meansof stochastic or deterministic methods. The selection of the correcthypothesis for deterministic methods uses a distance metric(Euclidean distance, norm Lp, Hausdorff’s distance, etc.), whereasthe stochastic methods are based on probabilistic or statisticsmethods (bayesian networks, learning methods, etc.).

Most papers in literature argue that similarity measures shouldhave the following properties:

� Invariance: This is necessary when the shape descriptors are notinvariant to transformations or to the starting point.� Robustness and sensitivity: Poor sensitivity leads to inadequate

discrimination capability. Changes in the shape owing to noisemust not affect the measure value.� Efficiency: The similarity values can be efficiently computed and

compared.

Table 2 presents the most meaningful similarity measures, andshows their properties.

2.3. Shape pose estimation

Shape pose estimation determines the scale, translation androtation parameters of the shape in a canonical reference system.

Table 2Properties of several similarity measures.

Method Type Properties

Euclidean distance (Shapiro and Stockman, 2001) Deterministic EfficiencyHausdorff distance (Huttenlocher et al., 1993) Deterministic Invariance: rotation, translation, scaleReflection distance (Veltkamp, 2001) Deterministic Invariance: rotation, translation, scale, skew. RobustnessCity-block distance (Veltkamp, 2001) Deterministic Invariance: rotation, translation, scale, skew. EfficiencyCosine of the Angle (Garcia, 2006) Deterministic EfficiencyBhattacharyya distance (Ye et al., 2003) Stochastic RobustnessMahalanolobis (Mahalanobis, 1936) Stochastic Invariance: scalestatistical distance (Ye et al., 2003) Stochastic RobustnessBayessian (Mahalanobis, 1936) Stochastic Invariance: scalek-nn (Shakhnarovich et al., 2006) Stochastic RobustnessSVM (Burges, 1999) Stochastic Robustness

Contour Region Stochastic

Shape Representation

Learning

Shape Recognition

Similarity Measures Shape Representation

Deterministic

Shape Pose Estimation

Geometric

Fig. 1. General scheme for a shape recognition system.

Table 12D shape representation methods and their principal characteristics.

Method Type Characteristics

Simple descriptors (area, circularity, eccentricity, . . ..) (Sarfraz and Ridha, 2007) Contour EfficiencyInvariance: rotation, translation, reflection

Shape signature (Giannarou and Stathaki, 2007) Contour Generality. Efficiency. StabilityInvariance: rotation, translation, scale

Boundary moments (Chen, 1993) Contour Generality. EfficiencyInvariance: rotation, translation, scale, reflection

Curvature scale space (Mokhtarian and Bober, 2003) Contour Generality. Efficiency. Robustness: noise. StabilityInvariance: rotation, translation, scale, reflection

Fourier Descriptors (Zahn and Roskies, 1972) Contour Generality. Efficiency. Robustness: noise. StabilityInvariance: rotation, translation, scale, reflection

Differential invariants (Cole et al., 1991) Contour Generality. EfficiencyInvariance: translation, scale, reflection

Integral Invariant (Manay et al., 2006) Contour Generality. Efficiency. Robustness: noise. StabilityInvariance: translation, scale, reflection

Shape context (Belongie et al., 2000) Contour Generality. Efficiency. Robustness: noise. StabilityInvariance: rotation, translation, scale, reflection

Hu moment (Hu, 1962) Region Generality. Robustness: noiseInvariance: rotation, translation, scale, reflection

Zernike moments (Khotanzad and Hong, 1990) Region Generality. StabilityInvariance: rotation, translation, scale, reflection

Generic Fourier Descriptors (Smach et al., 2008) Region Generality. StabilityInvariance: rotation, translation, scale, reflection

Complex moments (Flusser, 2000) Region Generality. Robustness: noise. StabilityInvariance: rotation, translation, scale, reflection

SIFT descriptor (Lowe, 1999) Region Generality. Robustness: noise. StabilityInvariance: translation, scale, reflection

E. González et al. / Pattern Recognition Letters 33 (2012) 199–217 201

Three pose methodologies can be used:

� Model based methods. The features of the representation modelare used to estimate the pose. Examples of this strategy are geo-metric moments and Fourier Descriptors.� Learning based methods. The system is first trained by using a

large set of images of the shape in different poses. A queryshape is then identified and posed with regard to the trainingset (see Shapiro and Stockman, 2001).� Geometric algorithms. These are based on calculating the shape

pose through geometric transformations between correspond-ing boundary points. One example in this field is IPC (IterativeClosest Point) based techniques.

In this report we have chosen model based methods to estimatethe shape pose, since the performance of these methods tends tobe high and does not require a large amount of training sets as inthe case of learning methods. Models based on geometric transfor-mations usually present low accuracy or the computational costtends to be very high. Pose estimation based on shape representa-tion models therefore offers the best relationship between cost/accuracy parameters.

2.4. Motivation: why a comparative study?

Fig. 1 shows a general chart for a shape recognition system ta-ken from the main modules and models discussed in the previoussections. In practice, most of the aforementioned shape representa-tion methods can be combined with the similarity measure andpose estimation methods. However, the key question is: what isthe optimal combination for a 3D recognition system?

Although most of the published shape recognition approachesshow a good performance, it is not possible to decide which is


the most appropriate since they have been tested in different dat-abases, and probably under different conditions. Only a few refer-ences compare various methods under the same conditions andwith the same databases. Table 3 presents references in whichtwo or more 2D recognition methods were compared by alwaysusing the same database. The columns in this table correspondwith: the shape representation method, the similarity measureand the database type. Note that none of the authors has eitherdealt with all the options presented in Table 3 or considered thepose estimation parameter.

The principal issue here is that questions such as: what arethe advantages/drawbacks of shape recognition systems usingdeterministic measure versus stochastic measures? or whatresults are expected from a 3D recognition system that usesshape descriptors based on contour or in region? have not yetbeen dealt with in literature, as is the case of the type of data(real or synthetic images) used in the recognition method andtheir influence on the final result. The first aim of the documentpresented here is therefore to answer the aforementioned ques-tions and to help future researchers to select robust and feasibleshape recognition frameworks, specifically in the 3D recognitionfield.

The second point to consider is how to measure the perfor-mance of a 2D recognition system when it will be used for a 3Dsystem. Traditional precision/recall analysis does not provide suffi-cient information to allow decisions about future strategies in a 3Drecognition system to be made. For example, an ambiguity mea-surement in the 2D outputs would help the researcher to selectevidence methods based on ambiguity to reduce uncertainty. Inthis paper, we consider other evaluation methods focused onanalyzing shape recognition output properties, such as the levelof ambiguity among hypotheses.

3. Statement of the experimental tests

It is not easy to make an experimental comparison between dif-ferent recognition methods since each one is tested under differentconditions and with different databases. Moreover, the amount of

Table 3Papers that compare different shape recognition systems by means of an experimental an

Shape representation

Contour Region

Moreels and Perona (2007) Shapecomplex

Steerable filters, PCA-SIFT,Differential invariants, SIFT

Smach et al. (2008) Motion-descriptor, ZMSkerl et al. (2004) Spine images

Chen et al. (2003) Multiple view descriptor, Shape 3Ddescriptor, 3D harmonics, lightfielddescriptors

Yadav et al. (2007) FourierDescriptors,Wavelet-FourierDescriptors

Generic Fourier descriptors

Mattern and Denzler (2004) PCA

Mikolajczyk and Schmid(2005)

Shapecomplex

Steerable filters, PCA-SIFT,differential invariants, spin images,SIFT, complex filters, momentinvariants, cross-correlation

details in each technique makes it impossible to reproduce theexperiments in exactly the same way.

3.1. Recognition methods set (RMS)

We have carried out a comparative study using a set of 2Dshape representation models combined with a set of different sim-ilarity measures. From here on, we shall denominate all thesemethods as the recognition method set (RMS). Of course, the selec-tion of the chosen similarity measures and 2D representation mod-els was not performed randomly. We chose only a few of theuniverse of methods that are available (some of which are refer-enced in Tables 1–3) after evaluating four aspects: (1) significance:the method had to be highly referenced by others authors inimportant events and journals; (2) reproducibility: it had to bepossible for us to reproduce the method from the original paper;(3) performance: there had to be a report proving a good perfor-mance of the method in the shape recognition field; (4) complete-ness: the overall set of methods had to cover a wide spectrum ofshape recognition strategies (as shown in Fig. 1). After consideringthe four criteria, we reached a consensus and selected a subset ofrepresentative techniques.

For all the images in a dataset, we first extracted the imageshape descriptors proposed in Table 4. For each shape descriptor,we experimentally found the lowest feature vector length thatprovided sufficient discriminatory power for our datasets. Weused the following length of feature vectors for each shapedescriptor:

3.1.1. Contour-basedIn order to measure the similarities between two contour-based

shapes using the methods from Table 4, it is necessary for the fea-ture vectors to have the same length. It is also possible to satisfythis constraint by applying a contour regularization process. Foreach shape descriptor, we experimentally found the lowest num-ber of items (length) for each feature vector that provided suffi-cient discriminatory power for our datasets.

alysis.

Similarity measures Dataset type

Deterministic Stochastic BDReal

BDSynthetic

Distance ratio X

SVM XEnergy of the histogram, mutual information,normalised mutual information, entropy,correlation coefficient, correlation ratio, pointsimilarity measure, a modified point similaritymeasure, Woods criterion

X

L1 distance X

Euclideandistance

X

Mixtures of probabilistic principal componentanalysis, Kernel principal component analysis,Nearest Neighbour classification

X

Euclideandistance

Mahalanobis distance X

Table 42D shape representation and similarity measures chosen for comparison.

2D shape representation Similarity measures

� Fourier Descriptors (FD)� Boundary Moments (HM)� Integral Invariants (II)� Shape Context (SC)� Zernike Moments (ZM)� Generic Fourier Descriptors

(GFD)� Complex Moments (CM)

� Euclidean (ED) (Shapiro and Stock-man, 2001)� City Block (CB) (Veltkamp, 2001)� Cosine of the Angle (C) (Garcia, 2006)� Mahalanobis (MD) (Mahalanobis,

1936)� Battacharyya (BD) (Ye et al., 2003)� Support Vector Machine (SVM)

(Burges, 1999)


� Fourier Descriptors (64): The contour has been regularized to64 elements. After the regularization process, the Fourierdescriptors were computed following the same process as in(Zahn and Roskies, 1972).� Boundary Moments (7): The canonical 7 Boundary (Chen’s)

moments were computed as in (Chen, 1993). In this case itwas not necessary to develop the contour regularization.� Integral Invariants (64): The contour was normalized to 64

points. The integral invariant was then computed as in (Manayet al., 2006) with radius r = 0.15.� Shape Context (7680): The contour was regularized to 128

points. The shape context descriptor was then computed as in(Belongie et al., 2000) with 12 angular bins and 5 radial bins.

3.1.2. Region-based

� Zernike Moments (121): We used the first 10 orders of Zernikemoments, which were computed as in (Khotanzad and Hong,1990).� Generic Fourier Descriptors (36): For efficient shape descrip-

tion, only a small number of GFD features are selected for shaperepresentation. In our implementation there are 36 GFD fea-tures reflecting four radial frequencies and nine angular fre-quencies (Smach et al., 2008).� Complex Moments (11): The first 11 Complex Moments were

taken (Flusser, 2000).

From here on, the number of features defined in this section willbe denoted as a canonical number.

3.2. Experimental platforms

We have implemented an object recognition system that iscapable of running with two different experimental platforms.Platform 1 uses the well known Amsterdam Library of ObjectImages (ALOI) benchmark (Geusebroek et al., 2005) and Platform2 uses a 3D Synthetic Library (3DSL) (Sarfraz and Ridha, 2007)database with 3D synthetic models in the dataset but with realimages as sample images.

Platform 1, based on the ALOI benchmark database, is an appro-priate dataset with which to test the performance of a shape recog-nition system. The use of a benchmark dataset to validate theperformance of any shape recognition represents an importantadvantage: the experimental results obtained can be validated byother researchers (using the same database). However, the ALOIdataset provides an incomplete object representation and the factthat the dataset does not include variant conditions in a real scene.This dataset could therefore be used to test shape recognition sys-tems in controlled environments (e.g. industrial applications).

We have, therefore, also developed another test using a 3D syn-thetic library (3DSL) which allows us to take any view of an object(Platform 2). In 3DSL library, the geometrical models of the objects

have been built in advance in our lab by using a VIVID 910 Minoltalaser scanner sensor. Given a 3D model, we can define a set ofhomogeneous viewpoints over the nodes of a tessellate sphereand extract the depth image of the object from each view. We thushave a representation model which contains the complete infor-mation about the appearance of the object. This experimental plat-form is valid to test the performance of 3D recognition systems innon-controlled environments in which the test images are cap-tured from a camera in a real scenario.

On both experimental platforms, the object contour is obtainedafter carrying out a thresholding process on the query image. In or-der to work under the same conditions in both databases and toachieve a better performance with the 3DSL database, the shapeshave been normalized to skew deformations. In the case of con-tour-based approaches, (i.e. Fourier Descriptors (FD), BoundaryMoments (HM) and Shape Context (SC)), Avrithis et al.’s methodhas been implemented (Avrithis et al., 2001). For Integral Invariant(II) descriptors, the shift invariance was applied in order to achieveinvariance to the starting point. However, for region-baseddescriptors (i.e. Zernike Moments (ZM), Generic Fourier Descrip-tors (GFD) and Complex Moments (CM)), the skew invariancewas obtained by using the Shen algorithm (Shen and Ip, 1997).

3.3. Evaluation parameters

In order to evaluate the recognition results, we have definedthree parameters, Hard Recognition Rate (Hr), Weak RecognitionRate (Wr) and Ambiguous Recognition Rate (Ar). The general ideahere is that the proposed parameters can be used to study theuncertainty problem in a shape recognition process so that wecan always decide the best strategy with which to solve this uncer-tainty. Whereas parameter Hr simply provides a strict one-to-onerecognition measure, Wr and Ar include softer alternative recogni-tion measures. Parameters Hr, Wr and Ar are defined in the follow-ing paragraphs.

1. Hard Recognition Rate (Hr) is:

Hr ¼ ne

N� 100; ð1Þ

where ne is the number of samples correctly identified and is the to-tal number of tests. A sample is correctly identified if the true viewis selected in the database. In such a case we label the solution as ahard solution.2. Weak Recognition Rate (Wr) is:

Wr ¼ ne þ np

N� 100; ð dvp;vGT Þ < e; e > 0 ð2Þ

in which we have added the number of weak solutions. We have aweak solution if the angle between the recognized and the trueviews is below a certain angular value e > 0. The Weak RecognitionRate parameter (Wr) should be used to implement strategies basedon finding solutions which are close to the true solution.3. Ambiguous Recognition Rate (Ar) is:

Ar ¼ ne þ na

N� 100 ð3Þ

in which we consider the number of wrong solutions owing to ambi-guity, na. The explication of this term is as follows.

The idea is that although the shape is not specifically recognizedwithin a view dataset, which implies a theoretical recognition fail-ure in terms of parameter, it could be correctly classified into agroup of similar shapes. If we generate the clustering of the origi-nal view dataset into a set of clusters, each containing similarshapes, and take into account the fact that the query view is


classified in the correct cluster, a softer recognizing measure can beevaluated. The term na is therefore added in Eq. (3). In summary, na

is the number of views wrongly recognized but that have been cor-rectly classified in the set of clusters of similar views.

The power of parameter Ar then takes place inside the activerecognition framework, specifically when the ambiguity problemarises in the recognition process. That is, if the shape is classifiedin the right cluster, the failure now somehow becomes a success.These cases, which frequently appear in 2D shape recognition,are considered in the definition of parameter Ar. The clusteringprocess was developed by using QT-clustering with a shape signa-ture descriptor (Belongie et al., 2000) as a characteristic vector.

Let us now provide a better motivation of the parameters pre-sented above. The goal of the proposed parameters is to studythe uncertainty problem and then decide the best strategy withwhich to solve the 3D recognition problem by following an activemethod. For example, if yields low values, it is clear that otherviews of the object must be collected to reduce the uncertainty,but if yields high scores, the use of viewpoints close to the currentview could be sufficient to decrease the uncertainty. Parameter Arscores the shape recognition capability to classify one view into aset of similar views so that high values signify that strategies basedon the use of discriminative viewpoints could be effective. Activerecognition systems based on the performance of the parameterAr must concentrate the recognition effort on only a small numberof hypotheses (clusters), the next best view being that whichmakes the uncertainty between all the hypotheses minimum. Fur-thermore, Ar is an indicator of the number of viewpoints (sensorpositions) required to reduce uncertainty. Object recognition sys-tems with a low Ar average imply the use of more viewpoints ow-ing to the low the shape recognition system effectiveness.

4. Recognition using Platform 1

4.1. Platform setup

The ALOI-VIEW collection consists of 1000 objects recordedunder various imaging circumstances. More specifically, theviewing angle, illumination angle, and illumination color aresystematically varied for each object. In our experiment, we haveused a collection of objects which have been imaged from viewingangles spanning a range of up to 5�. Fig. 2 shows an example of anobject represented from 72 viewpoints. RMS has been tested on12 objects (see Fig. 3). Note that the objects selected are verydissimilar free shapes.

Fig. 2. Object example from ALOI-VIEW collection viewed from 72 differentviewpoints. Source: From Geusebroek et al., 2005.

In order to develop a learning process, we generate a set ofimages obtained from the prototype in the ALOI-VIEW collectionfor each object and viewpoint. By adding different noise levelsand taking affine transformations on the original image we at-tempt to simulate external circumstances (illumination deficiency,lens blur, . . .) in the scene. The main objective of this step is toaccomplish the learning and simulate the efficiency of RMS in realscenes. The training set for one object and view is therefore com-posed of 15 images as follows: prototype image, two views closeto the prototype, six images obtained after applying affine transfor-mations to the prototype and six images with different Gaussiannoise levels injected into the prototype image. Of these, six imagesare chosen for the training process and the others are used as testimages. Fig. 4 shows the training image set, while Fig. 5 illustratesthe effect that the noise produces on the contour after the imagepreprocessing.

After completing the learning phase, we ran RMS over 72 viewswith three images each. The three images per view are generatedas follows: original image captured by the camera after applyingan affine transformation and two images with added Gaussiannoise (r1 = 0.005 and r2 = 0.05). Fig. 6 shows several examples ofthe images generated. A total of 2592 (3 ⁄ 72 ⁄ 12 (objects)) trialswere thus carried out whose objective was to compare the differ-ent combinations between shape representation and shape identi-fication algorithms in order to measure computational complexityand robustness in each situation.

4.2. Analysis of Hr, Wr and Ar parameters

Fig. 7 allows us to provide an overview of the results and tomake a comparison of the different combinations between shaperepresentations and similarity measures. Each column in the tableshows the recognition rates (in percentages) for each recognitionparameter evaluated.

Our conclusions, which are on the one hand split into contourand region based shape representations and are on the other handsplit into deterministic and stochastic similarity measures, are pre-sented below.

4.2.1. Contour-based shape representation methods

� Fourier Descriptors: Their robustness to noise is notable. In gen-eral, they perform best with deterministic measures and theSupport Vector Machine. Most identification errors are achievedas a result of their invariance to reflection, and the Hard Recog-nition Rate (Hr) and Weak Recognition Rate (Wr) parametershave similar values. They are a good choice for 3D recognitionsystems dealing with ambiguous hypotheses owing to the goodAmbiguous Recognition Rate (Ar) parameter results.� Boundary Moments: Although this descriptor is very popular in

2D shape representation references, its performance in 3D rec-ognition systems is lower than expected.� Integral Invariants: Although in previous experiments the

results when using test images without rotations (even withnoise) are excellent, in 3D recognition systems (in whichrobustness to geometrical transformations is mandatory) theresults using deterministic measures yield very poor scores.Only the Wr values are acceptable with stochastic measures.These low rates occur as a result of the imprecision at the start-ing point. This is the method’s weakest key point.� Shape Context: This is more robust than the Integral Invariants

method, and the improvements that appear in the case of Hr,Wr and Ar parameters are very similar to those of FourierDescriptors.

Fig. 4. The training set for one object and view is: (a) prototype image and two views close to the ground truth, (b) six images obtained after applying affine transformationsto the prototype and (c) six images with different Gaussian noise levels injected into the prototype image.

Fig. 5. Noise effects on the binary images after the image preprocessing.

Fig. 6. Examples of test images.

Fig. 3. Objects used in the experimental tests.


Hard Recognition Rate (Hr)

(a) (b) (c)Weak Recognition Rate (Wr) Ambiguous Recognition Rate (Ar)

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVMShape descriptors (without noise) Shape descriptors (without noise) Shape descriptors (without noise)

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Shape descriptors (σ1=0.005)Shape descriptors (σ1=0.005)Shape descriptors (σ1=0.005)

Shape descriptors (σ2=0.01) Shape descriptors (σ2=0.01) Shape descriptors (σ2=0.01)

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Fig. 7. Comparative analysis of combinations of different shape descriptors and various similarity measures using ALOI dataset. (a) Hard Recognition (Hr). (b) Weak Recognition (Wr). (c) Ambiguous Recognition(Ar). The graphic rows are related to the noise level injected to the images. FD HM II SC ZM GFD CM.

206E.G

onzálezet

al./PatternR

ecognitionLetters

33(2012)

199–217

0.000.100.200.300.400.500.600.700.800.90

FD HM II SC ZM GFD CM

Representation Model

Tim

e (s

)

Fig. 8. Processing time (in seconds) needed to obtain the representation model.

0

0,01

0,02

0,03

0,04

0,05

0,06

10701014010210102801035010420104901056010630107001077010840109101098010

(a)

(b)

(c)

Tim

e (s

) EDCBCBDSVM

0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

10 510 1010 1510 2010 2510 3010 3510 4010 4510Number of descriptors

Number of descriptors

Tim

e (s

)

MD

0

0,5

1

1,5

2

2,5

2 10 510 1010 1510 2010 2510 3010 35104010 4510 5010 5510 6010

Dataset views

Tim

e (s

) EDCBCMDBDSVM

Fig. 9. Processing time needed for similarity measures. Plots from (a) and (b) showthe time rates according to the number of descriptors in the shape representationmodel. (c) Times rates according to the number of views in the dataset for shapemodels using 128 descriptors.


4.2.2. Region-based shape representation methods

� Zernike Moments: These are very sensitive to noise. Their perfor-mance is particularly low with City Block distance.� Generic Fourier Descriptors: Better results than Zernike

Moments. They work quite well, especially when using Maha-lanobis distance, but in general, their performance is not reallynotable.� Complex Moments: These yield the best results even in added

noise cases. Their best performance is achieved by using sto-chastic measures.

4.2.3. Deterministic similarity measures

� Euclidean distance: Its effectiveness depends on the descriptor’srobustness to noise. The combination with Fourier Descriptorsand Complex Moments provides good rates, especially with Arvalues.� City Block: This gives similar results to Euclidean distances. Nev-

ertheless, the recognition rates fall when it is applied withregion-based methods, except in the case of Complex Moments.� Cosine of the Angle: This provides similar results to those of

Euclidean distance.

4.2.4. Stochastic similarity measures

� Mahalanobis distance: Poor recognition results (Hr and Wr)when using Mahanobis distance, although meaningful improve-ments appear in the case of Ar.� Battacharyya: Similar results to the case of Mahalanobis dis-

tance. In general, it gives lower overall rates than the othermethods.� Support Vector Machine: This method achieves the highest rates

and could be used in recognition systems that use the Wr and Arevaluation parameters.

4.3. Computational costs

One of the most important parameters used to evaluate thequality of a 3D recognition system is that of computational cost.Figs. 8 and 9 show the execution time (in seconds) for representa-tion models and similarity measures. The test has been run on a1.3 GHz Pentium IV computer. Fig. 8 specifically shows the pro-cessing time needed to obtain the shape representation model,while the plots in Fig. 9(a)–(c)) correspond to the processing timeneeded to calculate similarity measures when considering twoparameters: the number of descriptors used by the shape represen-tation model (Fig. 9(a) and (b)) and the number of views in thedataset (Fig. 9(c)). In the last cases, the number of descriptors is128.

Note that, as is clearly shown in Fig. 9(a), deterministic mea-sures, with the exception of the Cosine of the Angle (C), show a bet-ter performance than stochastic measures. Fig. 9(b) allows us tosee the exponential behavior of the Mahalanobis distance function(MD) as the number of the descriptors in the shape representationmodel increases.

After analyzing the time rates according to the number of viewsin the dataset from Fig. 9(c), we can state that SVM and MD overperform with deterministic measures, the BD measure is inade-quate for large datasets and all the deterministic measures (Euclid-ean distance, City Block, Cosine of the Angle) have similarcomputational costs.

4.4. Pose estimation accuracy

Another aspect which must be taken into consideration is that ofpose estimation accuracy. We have used the quadratic mean error(QME) between the silhouette extracted from the shape sample(S0) and the silhouette corresponding to the shape identified in thedataset (S) after being applied to a transformation (T) related tothe pose parameter (rotation, translation scale) computed

Table 5Average of the quadratic mean error (Av (QME)) and execution time in poseestimation.

Av (QME) Time (s)

FD 0.1377 0.230HM 1.4671 0.298SC 1.92668 0.330II 13.2210 0.375ZM 2.3306 0.863GFD 2.4992 0.605CM 2.4374 1.079

Ir (

%)

0

20

40

60

80

100

ED CB C MD BD SVM

FD

HM

II

SC

ZM

GFD

CM

Similarity Measures

Fig. 10. Object identification rates (Ir).


according to the representation model for Fourier Descriptors,Boundary Moments, Shape Context, Zernike Moments, Generic

Fig. 11. Samples of �3DSL

Fig. 12. Image depth extraction. (a) 3D object and the arrow corresponding to a selected

Fourier Descriptors and Complex moments. In the case of the Invari-ant Integral method, the pose estimation is obtained through thecontour normalization process parameters. Silhouettes are repre-sented by Q couples of coordinates (xq,yq)1 6 q 6 Q. Thus, let

x0q; y0q

� �be the coordinates of S0 and ð�xq; �yqÞ be the coordinates of

the silhouette S obtained with S ¼ T � S. Then

QME ¼PQ

q¼1 �xq � x0q� �2

þ �yq � y0q� �2

Q: ð4Þ

Table 5 shows the average of the quadratic mean error (Av(QME)) and the execution times required to compute the posefor different shape representation models. Of all the methods, theFourier Descriptors yield the best accuracy and pose estimationtime. The contour normalization process with regard to the start-ing point is very inaccurate, and it is for this reason that the aver-age of the quadratic mean error (QME) is so high for IntegralInvariant descriptors.

4.5. Object identification rates

So far, the experimental discussion in the previous sections hasbeen focused on calculating the pose of the object in the scene. It isclear that estimating the pose of one object is a more restrictivetask than recognizing the object within an object database. Obtain-ing the right pose through a 2D projection of the object may, insome cases, be a difficult and unsolved problem. For example,the pose estimation might be wrong as a result of symmetries or

synthetic collection.

viewpoint. (b) Object model viewed from the selected viewpoint. (c) Image depth.

Fig. 13. Training images from a node of the tessellated sphere. The first row showsimages around a sphere node, the second row shows the image with Gaussian, saltand pepper and sparkle noise.


illumination effects. In order to compare the identification andpose results in our framework we have carried out a simple test.Fig. 10 shows the object identification rate IR achieved during thetests developed on Platform 1. In this case, we count the numberof times that the RMS system successfully classifies the object.

Upon comparing results of Ir with the corresponding threeparameters Hr, Wr and Ar, which are shown Fig. 7, we can appreci-ate notable differences between object identification and objectpose results. We realized that the RMS system identifies the rightobject but that it sometimes chooses the wrong view. The success-ful score variation for Hr, in other words jHr � Irj, is between 7%and 34%, in which Hr < Ir in all cases. The score variation ranges

Fig. 14. Experimental setup. The experimental platform uses a Staübli robot with a webcaThe robot positions correspond to nodes of an imaginary tessellate sphere centered in t

Fig. 15. Images correspon

for parameter when compared with Ir is between 2% and 25%, inwhich Wr < Ir is also the case. As regards parameter Ar, which con-siders similar views belonging to different objects, it is not surpris-ing that the successful score variation has the contrary sign in thiscase. The range is thus between 3% and 32%, but now, Ar > Ir.

5. Recognition tests using Platform 2

5.1. Platform setup

The objects belonging to the 3DSL dataset have been built in ourlab. To do this, a high accuracy three-dimensional mesh model ofeach object was obtained in advance by means of a laser scannersensor. Fig. 11 presents a selection of objects from the 3DSL data-base. Note that the database is composed of both free and polyhe-dral shapes and even includes some similar objects. For instance, itwould appear to be quite difficult to distinguish between objects 6and 7.

As was previously mentioned, the set of tests that we have car-ried out on the 3DSL dataset are focused on evaluating the perfor-mance of RMS in non-controlled scenarios. In this case, the testimage is taken from a camera in a real scene without special light-ing requirements. In this section we shall focus on presenting theresults for parameters Hr, Wr and Ar. Since the processing time val-ues are similar to those of the ALOI, no additional data and com-ments regarding computational cost are provided.

The 3DSL dataset is built by viewing the synthetic models froma set of homogeneous viewpoints and subsequently extracting the

m on the end-effector. The robotic vision system captures images around the object.he scene.

ding to various tests.


(a) (b) (c)Weak Recognition Rate (Hr) Ambiguous Recognition Rate (Ar)

Rec

ogni

tion

Rat

e (%

) R

ecog

nitio

n R

ate

(%)

Rec

ogni

tion

Rat

e (%

) R

ecog

nitio

n R

ate

(%)

Rec

ogni

tion

Rat

e (%

) R

ecog

nitio

n R

ate

(%)

0

20

40

60

80

100

ED CB C MD BD SVM ED CB C MD BD SVM ED CB C MD BD SVM

0

20

40

60

80

100

ED CB C MD BD SVMED CB C MD BD SVM ED CB C MD BD SVM0

20

40

60

80

100

Shape descriptors (without noise) Shape descriptors (without noise) Shape descriptors (without noise)

0

20

40

60

80

100

0

20

40

60

80

100

0

20

40

60

80

100

Shape descriptors (σ1=0.005) Shape descriptors (σ1=0.005) Shape descriptors (σ1=0.005)

Fig. 16. Comparative analysis of combinations of different shape descriptors and various similarity measures using 3DSL dataset. (a) Hard Recognition (Hr). (b) Weak Recognition (Wr). (c) Ambiguous Recognition(Ar). The graphic rows are related to the noise level injected to the images. FD HM II SC ZM GFD CM.

210E.G

onzálezet

al./PatternR

ecognitionLetters

33(2012)

199–217

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100


(a) (b) (c)Weak Recognition Rate (Wr) Ambiguous Recognition Rate (Wr)

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

ED CB C MD BD SVM

Shape descriptors (80 views)

ED CB C MD BD SVM


ED CB C MD BD SVM


ED CB C MD BD SVM


ED CB C MD BD SVM


ED CB C MD BD SVM


Fig. 17. Comparative analysis of combinations of different shape descriptors and various similarity measures using 3DSL dataset. (a) Hard Recognition (Hr). (b) Weak Recognition (Wr). (c)Ambiguous Recognition (Ar). The first row corresponds to Dataset with 80 views and the second row corresponds to 320 views (Dataset 3). FD HM II SC ZM

GFD CM.

E.González

etal./Pattern

Recognition

Letters33

(2012)199–

217211

Fig. 19. Tendency of parameter Hr versus the relative number of features withregards to the canonical number.


corresponding silhouettes from the projected images. These view-points are set by the lines from the vertexes of a tessellated spherewith 80 nodes to the object’s centroid. Fig. 12 illustrates an objectmodel inside the tessellated sphere, the projected image of themodel and the depth-image from a specific viewpoint. In order toobtain the true object pose in the real scene (rotation with regardto the object in the database), a set of images around the object arecaptured beforehand by the camera on-board the robot. Each im-age is then manually associated with its corresponding depth im-age in the database.

The same training and testing procedure as that used in theALOI dataset case is now followed. Thus, for each object, we gener-ate 15 training images from each canonical view, which is definedby a node on the tessellate sphere. In this case, there are six train-ing images corresponding to viewpoints around the node, whichsignifies ‘‘moving’’ the camera slightly but always maintainingthe model’s centroid in the optic axis. The other training imagesare obtained by adding Gaussian, salt and pepper and sparkle noiseto the canonical image. Fig. 13 shows several training images of thedinosaur. The recognition test is accomplished by capturing oneimage from a camera located on the end-effector of a robot(Fig. 14). We have taken a total of 86 images for this test. Somesamples of this test data appear in Fig. 15.

5.2. Analysis of Hr, Wr and Ar parameters

As in Figs. 7, 16 shows a collection of plots including the recog-nition results for the original image (in the first row) and for theimage with added Gaussian noise (second row). After comparing3DSL results with the ALOI results, we can conclude that the aver-age recognition rate decreases up to 20% for the 3DSL dataset. HardRecognition Rate and Weak Recognition Rate values prove to havethe worst performance in the majority of the methods. Only theAmbiguous Recognition Rate parameter presents acceptable rates,particularly for SVM. The reason for this is that we are using a testimage in real conditions which implies introducing uncontrollednoise and variations in the input image. For instance, there aresmall variations in the true viewpoint with regard to the theoreti-cal viewpoint in the database. On the other hand, the image is pro-cessed to obtain the contour and it is therefore slightly modified asa result of the image processing. All these factors make the queryimage appear to be deformed with regard to the best hypothesisin the database. The experimental test proves that although wehave attempted to train the system by simulating possible shapedeformations, the system behavior during the test is relatively hardto emulate using synthetic images.

Our conclusions, which are split into contour and region basedshape representations and deterministic and stochastic similaritymeasures, are presented below.

Fig. 18. (a) Recognition mean relative variation for meshes of 80 and 320 nodes for difshape representation models.

5.2.1. Contour-based shape representation methods

� Fourier Descriptors: Even when their performance decreases byapproximately 20% with regard to their performance in the ALOIdataset, the recognition parameter rates generally show betterresults than the other contour descriptors in the case of theAmbiguity Recognition parameter, and maintain their robust-ness to noise. However, this is not recommended with WeakRecognition.� Boundary Moments: Low performance in all cases.� Integral Invariants: More sensitive to noise than Fourier Descrip-

tors in the case of deterministic measures.� Shape Context: Its good performance is notable, combined with

stochastic similarity measures which maintain high rates in thecase of the Ambiguous Recognition parameter.

5.2.2. Region-based shape representation methods

� Zernike Moments: Their results are below 40% with regard to theALOI database, showing high sensitivity to small shapevariations.� Generic Fourier Descriptors: The improvement is not really

notable.� Complex Moments: As in the case of the ALOI dataset, they yield

the best results even in the case of added noise. Their best per-formance is achieved using stochastic measures.

5.2.3. Deterministic similarity measures

� Euclidean distance: Its performance decreases considerably (upto 10% for the Ambiguous Recognition parameter), and it losesits good qualities in the case of the ALOI dataset.� City Block: This gives similar results to those of Euclidean dis-

tance, showing a bad performance.� Cosine of the Angle: We see similar results to those of Euclidean

distance.

ferent similarity measures. (b) Recognition mean relative variation for different 2D


(a) (b) (c)Weak Recognition Rate (Wr) Ambiguous Recognition Rate (Ar)

Rec

ogni

tion

Rat

e (

%)

0

20

40

60

80

100

Rec

ogni

tion

Rat

e (

%)

0

20

40

60

80

100

Rec

ogni

tion

Rat

e (

%)

0

20

40

60

80

100

Rec

ogni

tion

Rat

e (

%)

0

20

40

60

80

100

ED CB C MD BD SVM

Rec

ogni

tion

Rat

e (%

)

0

20

40

60

80

100

Shape descriptors (Dataset 1)

ED CB C MD BD SVM


ED CB C MD BD SVM


ED CB C MD BD SVM


ED CB C MD BD SVM


ED CB C MD BD SVM


Rec

ogni

tion

Rat

e (

%)

0

20

40

60

80

100

Fig. 20. Comparative analysis of combinations of different shape descriptors and various similarity measures using two different 3DSL datasets. (a) Hard Recognition (Hr). (b) Weak Recognition (Wr). (c)Ambiguous Recognition (Ar). Dataset 1 (up row) includes dissimilar objects (objects 1, 3, 8, 14, 18), Dataset 2 (down row) has objects with some similarity (objects 2, 6, 7, 9, 12). FD HM II SC

ZM GFD CM.

E.González

etal./Pattern

Recognition

Letters33

(2012)199–

217213

Fig. 21. (a) Recognition score mean variation between Databases 1 and 2 for different similarity measures. (b) Recognition score mean variation for different shaperepresentation models.


5.2.4. Stochastic similarity measures

� Mahalanobis distance Poor recognition results (Hr and Wr) whentaking Mahalanobis distance. As regards Ar, it shows similar ratesfor all descriptors.� Battacharyya: Similar results to those of Mahalanobis distance.� Support Vector Machine: As in the case of the ALOI dataset, the

best rates are achieved with this method when combined withComplex Moments, Shape Context or Fourier Descriptors.

5.3. Sensitivity to the number of views of the 3D representation model

The last tests developed in this experimental platform are re-lated to the shape recognition performance when the number ofviews used to represent the object is increased. As was mentionedpreviously, in the aforementioned analysis the objects are repre-sented by 80 nodes. In this subsection we have therefore at-tempted to evaluate the effects on a shape recognition systemwhen the objects are represented by views captured from a tessel-late sphere with 320 nodes. We have repeated the tests developedin Section 3.2, but using Dataset 3.

Fig. 17 compares the results for meshes with 80 and 320 nodes,showing the recognition rates for parameters Hr, Wr and Ar. Notethat, in summary, the recognition rates show slight variations inparameters Wr and Ar, but that Hr is approximately 15% lower withregard to the dataset with objects represented by 320 nodes owingto the uncertainty added by high similarities in the views closer tothe correct solution.

Fig. 18 again studies the relative variation of the recognitionscore for models of 80 and 320 nodes depending on the similaritymeasure and the kind of descriptors. Eq. (5) represents the recog-nition mean relative variation for each parameter P and similaritymeasure S, denoted as DRðP; SÞ;Rsrm;i being the mean of the recog-nition score considering all the shape representation models formeshes with i nodes. Eq. (6) likewise formalizes the recognitionmean relative variation for each parameter P and shape represen-tation model M, denoted as DQðP;MÞ;Q sm;i being the mean of therecognition score considering all the similarity measures formeshes with i nodes.

DRðP; SÞ ¼ Rsrm;80ðP; SÞ � Rsrm;320ðP; SÞRsrm;80ðP; SÞ

P 2 fHr;Wr;Arg; S 2 fED;CB; C;MD;BD; SVMg; ð5Þ

DQðP;MÞ ¼ Q sm;80ðP;MÞ � Q sm;320ðP;MÞQ sm;80ðP;MÞ

P 2 fHr;Wr;Arg; M 2 fFD;HM; II; SC; ZM;GFD;CMg: ð6Þ

Part (a) corresponds to DRðP; SÞ for different parameters and simi-larity measures. Note that only parameter Hr is really sensitive to

the number of nodes on the model, with variation rates of up to39% (MD case). Parameters Wr and Ar hardly vary the recognitionpercentages: 2% and 3%, respectively. It is interesting to note that,contrary to the previous section, Ar’s variation is slightly higherthan that of Wr’s. However, we can state that the alteration in thenumber of nodes in the representation model does not significantlyaffect recognition parameters Wr and Ar. The same can be said forpart (b) of Fig. 18 in which DQðP;MÞ is represented. In this case,most of the variation percentages for Wr and Ar are below 2% and4% and parameter Hr rises up to 62%. Descriptors HM (61.9%) andII (40.3%) again appear to be the most sensitive, whereas descriptorCM (19.8%) seems to be the most robust.

5.4. Sensitivity to the number of features

Another point to considerer in this analysis is the relationshipbetween the numbers of features in the feature vector and theHr, Wr and Ar results. Since the number of features for each partic-ular representation method uses a different dimension, we have at-tempted to unify this analysis by showing the average recognitionrates when we decrease or increase the number of features withregard to the canonical number defined in section 3.1. Thus, inFig. 19, for each shape representation and for parameter Hr, weprovide the average rate of all the similarity measures versus therelative number of features compare with the canonical numberimpose in section 3.1. In this case, we show only the plots for theHr parameter because the tendency is the same for the Wr andAr parameters.

The results from Fig. 19 prove that the canonical number of fea-tures, which corresponds to the value 100%, is optimal. Taking aminor number of features (which would correspond to values be-low 100%) the average of Hr clearly decreases, whereas, more itemsonly maintain or decrease the Hr results.

5.5. Comparison of results for different object datasets

In addition to the analysis developed using the 18 objects fromFig. 11, we have also experimented with two reduced datasets (5objects) in order to discover the effect on the Hr, Wr and Ar param-eters when a different database is used.

Dataset 1 includes dissimilar objects (objects 1, 3, 8, 14, 18),while Dataset 2 contains objects with some degree of similarity(objects 2, 9, 6, 7, 12). Experiments for these two datasets havebeen developed in the same manner as that explained in subsec-tion 3.2, although only the object related to the dataset in the testwas used, and noise was not added to the input images.

Fig. 20 shows the experimental results for these two datasets. Inthis case, the intention of the analysis is to compare the shaperecognition performance in a dataset with objects that share asimilar appearance (Dataset 2) and a dataset with dissimilar


objects (Dataset 1). Moreover, the aforementioned studies (Fig. 16)and those developed in this subsection have allowed us to discoverthe ‘‘sensibility’’ of the shape recognition system to variations inthe objects in the dataset.

A comparison of the plots in Fig. 20 with the plots in Fig. 16shows an increment in the Hr and Wr parameter rates in Dataset1. However, the Ar parameter maintains similar rates. In the caseof Dataset 2, the Ar parameter also maintains similar rates butthe Hr and Wr parameter rates decrease in most cases.

This variation in rates for the Hr and Wr parameters proves theirdependence on the shape appearance in the dataset. If the objectsare very different from each other, the shape recognition score/rateperformance is very high and is affected only when shape descrip-tors are invariant to reflection. On the contrary, datasets with sym-metrical objects and a similar appearance to each other achieveacceptable rates for the parameter only.

We shall now provide a more in-depth analysis, showing therelative variation of the recognition score for Datasets 1 and 2depending on the similarity measure and the kind of descriptors.In order to make this document clearer, we shall maintain thenomenclature introduced in Section 4.8. Eq. (7) represents the rec-ognition mean relative variation for each parameter P and similar-ity measure S, denoted as DRðP; SÞ;Rsrm;i being the mean of therecognition score considering all the shape representation modelsfor the ith database. Eq. (8) likewise formalizes the recognitionmean relative variation for each parameter P and shape represen-tation model M, denoted as DQðP;MÞ;Q sm;i being the mean of therecognition score considering all the similarity measures for theith database. Fig. 21(a) and (b) presents the results for both casesDRðP; SÞ and DQðP;MÞ.

DRðP; SÞ ¼ Rsrm;1ðP; SÞ � Rsrm;2ðP; SÞRsrm;1ðP; SÞ

P 2 fHr;Wr;Arg; S 2 fED;CB; C;MD;BD; SVMg; ð7Þ

DQðP;MÞ ¼ Q sm;1ðP;MÞ � Q sm;2ðP;MÞQ sm;1ðP;MÞ

P 2 fHr;Wr;Arg; M 2 fFD;HM; II; SC; ZM;GFD;CMg: ð8Þ

In general, we can state that parameter Hr is the most sensitiveto dataset changes, since it has a lower variation for parameters Wrand Ar. Part a) clearly shows how MD and BD similarity measuresachieve the maximum global variation, particularly in Hr (62% and58%) and Wr (40% and 41%). For parameter Ar, lower and similar re-sults are yielded for all similarity measures below 12%. In part b)we can see that the recognition score variation greatly dependson the descriptors for parameter Hr (particularly high variationsof 53% and 69% for HM and II). As regards parameter Wr, methodsSC and CM (21% and 25%) are the least sensitive. Finally, all theshape representation methods provide a similar variation percent-age of around 10–12% for parameter Ar.

6. Final discussion and conclusions

This paper presents a qualitative and quantitative study of theperformance of a set of representative 2D shape recognition strat-egies when they are used as the pillar of 3D recognition solutions.In order to implement different recognition approaches, we havecombined several of the most important 2D shape descriptors to-gether with a set of deterministic and stochastic similarity mea-surements. Up to 42 combinations have been considered. Theentire method set has been denominated as the RMS (RecognitionMethod Set).

The evaluation of the RMS is also focused on providing a quantita-tive criterion with which to develop strategies for uncertainty reduc-

tion in active recognition systems. The RMS is evaluated by means ofthree proposed parameters called Hard Recognition Rate (Hr), WeakRecognition Rate (Wr) and Ambiguous Recognition Rate (Ar). Theevaluation of the RMS has also been extended to other parameters,such as computational cost and pose estimation accuracy.

Two different experimental platforms have been used duringthe tests to simulate the performance of RMS in controlled (Plat-form 1) and un-controlled (Platform 2) environments. The experi-mental results show that the performance of the shape recognitionsystem varies according to the features of the experimentalplatform.

After analyzing the results of this experimental study we canpresent the following summary of conclusions, which is dividedinto two sections.

(A) General remarks concerning shape representation modelsand similarity measures, no matter which experimentalplatform the 3D recognition system is performed on.
– In general, appearance-based methods with a single view
show acceptable recognition rates which decrease whenexternal noise is added to the original images. Ambigui-ties caused by similarities between the appearance of dif-ferent objects, symmetry factors and shape deformationsowing to segmentation processes, illumination changesetc., increase the uncertainly of 3D recognition systemsin real contexts.

– Euclidean distance and cosine of angle distances yieldvery similar results in all cases.

– Shape Context, Fourier Moments and Complex Momentsshow good rates using SVM.

– Although the Complex Moments show the best results,the execution time in this case is very high.

– Since the 3D object recognition system might not requirethe accurate estimation of the object pose, Shape Contextdescriptors show good performance during the identifi-cation process for parameter Wr, along with low execu-tion times.

– For large datasets, the Battacharyya similarity measure(BD) presents high computational costs. Meanwhile, theManhalanobis measure (MD) is recommended for shaperepresentation models using few descriptors, owing toits exponential time rate behavior when the number ofdescriptors in the shape representation model isincreased.

– SVM have the lowest computational costs for large data-sets, but depend on the number of descriptors in theshape representation model.

– Fourier Descriptors are a good choice because, apart fromtheir invariant properties, they require low computa-tional time and are robust to noise.

(B) Detailed remarks about the experimental review. Someobservations and recommendations are now provided withregard to each of the platforms on which the 3D recognitionwill be implemented and the requirements of the 3D recog-nition application (computational costs, pose estimationaccuracy).
� Platform 1: Strategies based on Wr and Ar parameters
could be used under:– High-accuracy pose

⁄ Low computational cost: With regard to theWr parameter, two choices are available:FD combined with CB or SVM. In the caseof the Ar parameter, the same last choicesare viable if ED is also added. The rates forAr are better and more robust than those for.


⁄ Computational cost irrelevant: ComplexMoments are definitely the best representa-tion method combined with stochastic mea-sures for Wr strategies or with any Wrstrategies. In the last case, GFD shows a goodperformance when combined with SVM andMahalanobis measures.

– Low accuracy pose:
⁄ Low computational cost: II and SC descriptors
combined with SVM should be a good choicefor Wr and Ar based systems.

⁄ Computational cost irrelevant: Apart fromHM, all shape descriptors can be combinedwith SVM in Wr systems. GFD and determin-istic measures are feasible for Ar systems.

� Platform 2: Low Wr rates for most of the RDS suggest thatthis strategy should be rejected, since only the CMdescriptor shows acceptable values. The analysis withthis database is thus focused solely on Ar strategies.
– High-accuracy pose estimation
⁄ Low computational cost: Fourier Descriptorswith SVM are the best choice.

⁄ Computational cost irrelevant: CM and GFDdescriptors with stochastic measures, EDand C deterministic measures improve otherRDS.

– Low-accuracy pose estimation:
⁄ Low computational cost: we recommend SC
and SVM.⁄ Computational cost irrelevant: ZM should be

combined with MD and BD distances.

None of the evaluation parameters (Hr, Wr and Ar) recognize anobject, but rather a particular view. In other words, they recognizea particular 2D shape. The recognition task is performed solely on adataset composed of all available views of all the considered ob-jects. Of course, in an active recognition system the object can berecognized when several particular views of the object are recog-nized. Therefore, rather than the concept ‘‘object’’, which is notexplicitly used in the document, the concept ‘‘shape’’ – meaning2D shape – is used. Section 5.4 (future research: comparing differ-ent object databases), in which the concept object is present, hasnow been added to make the document more complete. Of course,this is only an initial study but we believe that the analysis of theperformance of the recognition methods with regard to differenttypes of objects, considering 3D properties such as shape complex-ity, symmetry, curvature, etc. is an interesting work to be tackledin the future.

Although this paper shows an exhaustive analysis of the perfor-mance of 3D systems using one view, it is important to bear inmind that the selection of the strategies with which to developany 3D recognition system must consider the kind of objects inthe dataset, which signifies a degree of similarity between them.

Another key point is the evaluation of the environmental condi-tions in which the 3D recognition system will be performed. Ifthese conditions are controlled (background, illumination, etc.),the objects could be represented with real images such as thosein Platform 1 and the expected recognition rates will be higherthan when using synthetic objects (Platform 2).

Moreover, of the results obtained with Platform 2, which is amore realistic situation than Platform 1, only strategies yield reli-able results. But these strategies only associate a view to a clusterof silhouettes. For a realistic 3D object recognition problem, several

views must therefore be taken to reliably identify an object, andthe ‘‘best next view’’ algorithm may play a fundamental role inthe entire recognition process.

Acknowledgments

This research was supported by the Spanish Government re-search Programme via Projects DPI 2009-14024-C02-01 andDPI2009-09956(MCyT), by the Junta de Comunidades de Castilla-La Mancha via project PCI-08-0135, and the European Social Fund.

References

Arkin, E.M., Chew, L.P., Huttenlocher, D.P., Kedem, K., Mitchell, J.S.B., 1991. Anefficiently computable metric for comparing polygonal shapes. IEEE Trans.Pattern Anal. Machine Intell. 13 (3), 209–216.

Avrithis, Y., Xirouhakis, Y., Kollias, S., 2001. Affine-invariant curve normalization forobject shape representation, classification, and retrieval. Machine Vision Appl.13, 80–94.

Belongie, S., Malik, J., Puzicha, J., 2000. Shape Context: A new descriptor for shapematching and object recognition. NIPS, 831–837.

Burges, C.J.C., 1999. A tutorial on support vector machines for pattern recognition.Data Min. Knowl. Discov. 2 (2).

Bustos, B., Keim, D.A., Saupe, D., Schreck, T., Dejan, V., Vranic, D., 2005. Feature-based similarity search in 3D object databases. ACM Comput. Surv. 37 (4), 345–387.

Chen, C.C., 1993. Improved moment invariants for shape discrimination. PatternRecognition 26 (5), 683–686.

Chen, D.Y., Tian, X.P., Shen, Y.T., Ouhyoung, M., 2003. On visual similarity based 3Dmodel retrieval. Comput. Graph. Forum 22 (3), 223–232.

Cohen, F.S., Wang, J., 1994. Part II: 3-D object recognition and shape estimation fromimage contours using B-splines, shape invariant matching, and neural network.IEEE Trans. Pattern Anal. Machine Intell. 16 (1), 13–23.

Cole, J.B., Murase, H., Naito, S., 1991. A lie group theoretic approach to the invarianceproblem in feature extraction and object recognition. Pattern Recognition Lett.12 (8), 519–523.

3D Synthetic Library (3DSL). <http://www.isa.esi.uclm.es/descarga-objetos-adan/objects.html>.

Flusser, J., 2000. On the independence of rotation moment invariants. PatternRecognition 33 (9), 1405–1410.

Flusser, J., 2006. Moment invariants in image analysis. Proc. World Acad. Sci. Eng.Technol. 11, 196–201.

Fu, J., Joshi, S.B., Simpson, T.W., 2008. Shape differentiation of freeform surfacesusing a similarity measure based on an integral of Gaussian. Computer-AidedDes. 40 (3), 311–323.

Garcia, E., 2006. Cosine Similarity and Term Weight Tutorial.Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M., 2005. The Amsterdam Library

of Object Images. Int. J. Comput. Vision 61 (1), 103–112.Giannarou, S., Stathaki, T., 2007. Shape signature matching for object identification

invariant to image transformations and occlusion. Lect. Notes Comput. Sci.4673, 710–717.

González, E., Adán, A., Feliú, V., Sánchez, L., 2008. Active object recognition based onFourier Descriptors clustering. Pattern Recognition Lett. 29 (8), 1060–1071.

Hagedoorn, M., Veltkamp, R.C., 1999. Reliable and efficient pattern matching usingan affine invariant metric. Int. J. Comput. Vision 31 (2–3), 203–225.

Hu, M.K., 1962. Visual pattern recognition by moment invariants. IRE Trans. Inform.Theory 8 (2), 179–187.

Huang, Z., Cohen, F.S., 1994. Affine-invariant B-spline moments for curve matching.Comput Vision Pattern Recognition, 490–495.

Huttenlocher, D.P., Klanderman, G.A., Rucklidge, W.A., 1993. Comparing imagesusing the Hausdorff distance. IEEE Trans. Pattern Anal. Machine Intell. 15 (9),850–863.

Khotanzad, A., Hong, Y.H., 1990. Invariant image recognition by Zernike moments.IEEE Trans. Pattern Anal. Machine Intell. 12 (5), 489–497.

Lowe, D.G., 1999. Object recognition from local scale-invariant features. In: ICCV,pp. 1150–1157.

Mahalanobis, P.C., 1936. On the generalised distance in statistics. Proc. Natl. Inst.Sci. India 2 (1), 49–55.

Manay, S., Hong, B.W., Cremers, D., Jackson, J., Yezzi, A., Soatto, S., 2006. Integralinvariants for shape matching. IEEE Trans. Pattern Anal. Machine Intell. 28 (10),1602–1618 (archive).

Mattern, F.W., Denzler, K., 2004. Comparison of appearance-based methods forgeneric object recognition. Pattern Recognition Image Anal. 14 (2).

Mikolajczyk, K., Schmid, C., 2005. A performance evaluation of local descriptors.IEEE Trans. Pattern Anal. Machine Intell., 1615–1630.

Mokhtarian, F., Bober, M., 2003. Curvature scale space representation: theory,applications, and MPEG-7 standardization. Comput. Imaging Vision 25.

Moreels, P., Perona, P., 2007. Evaluation of features detectors and descriptors basedon 3D objects. Int. J. Comput. Vision 73 (3), 263–284.

Persoon, E., Fu, K., 1977. Shape discrimination using Fourier Descriptors. IEEE Trans.Systems Man Cybernet. 7 (3), 170–179.

http://www.isa.esi.uclm.es/descarga-objetos-adan/objects.html

http://www.isa.esi.uclm.es/descarga-objetos-adan/objects.html


Rivlin, E., Weiss, I., 1997. Deformation invariants in object recognition. ComputVision Image Understand 65, 95–108.

Rothe, I., Süsse, H., Voss, K., 1996. The method of normalization to determineinvariants. IEEE Trans. Pattern Anal. Machine Intell. 18 (4), 366–376.

Sarfraz, M., Ridha, A., 2007. Content-based image retrieval using multiple shapedescriptors. Comput. Systems Appl., 730–737.

Schmid, C., Mohr, R., 1997. Local grayvalue invariants for image retrieval. IEEETrans. Pattern. Anal. Mach. Intell. 19, 530–535.

Shakhnarovich, G., Darrell, T., Indyk, P., 2006. Nearest-Neighbor Methods inLearning and Vision: Theory and Practice (Neural Information Processing).The MIT Press.

Shapiro, L.G., Stockman, G.C., 2001. Computer Vision. Prentice Hall.Shen, D., Ip, H.H.S., 1997. Generalized affine invariant image normalization. Trans.

Pattern Anal. Machine Intell. 19 (5), 431–440.Skerl, D., Likar, B., Pernus, F., 2004. Evaluation of nine similarity measures used in

rigid registration. In: 17th International Conference on Pattern Recognition, vol.2, pp. 794–797.

Smach, F., Lemaítre, C., Gauthier, J.P., Miteran, J., Atri, M., 2008. Generalized FourierDescriptors with applications to objects recognition in SVM context. J. Math.Imaging Vision 30 (1), 43–71.

Srisuk, S., Tamsri, M., Fooprateepsiri, R., Sookavatana, P., Suna, K., 2003. A new shapematching measure for nonlinear distorted object recognition. In: Proc. VIIthDigital Image Computing: Techniques and Applications.

Tangelder, J.W.H., Veltkamp, R.C., 2008. A survey of content based 3D shaperetrieval methods. Multimedia Tools Appl. 39 (3), 441–471.

Tsang, P.W.M., 2003. Enhancement of a genetic algorithm for affine invariant planarobject shape matching using the migrant principle. Vision Image Signal ProcessIEE Proc 150 (2), 107–113.

Veltkamp, C.R., 2001. Shape matching: Similarity measures and algorithms. In: 2001International Conference on Shape Modeling and Applications (SMI 2001). IEEEComputer Society, Genova, Italy.

Volotao, C.F.S., Santos, R.D.C., Dutra, L.V., Erthal, G.J., 2008. Using turning functionsto refine shapes. In: Barneva et al. (Eds.), Object Modeling, Algorithms andApplications. Research Publishing, pp. 31–44.

Yadav, R.B., Nishchal, N.K., Gupta, A.K., Rastogi, V.K., 2007. Retrieval andclassification of shape-based objects using Fourier, generic Fourier, andwavelet-Fourier Descriptors technique: A comparative study. Opt. Lasers Eng.45 (6), 695–708.

Yang, Z., Cohen, F.S., 1999. Image registration and object recognition using affineinvariants and convex hulls. IEEE Trans Pattern Anal. Mach. Intell. 8, 934–946.

Ye, N., Borror, C.M., Parmar, D., 2003. Scalable chi-square distance versusconventional statistical distance for process monitoring with uncorrelateddata variables. Quality Reliab. Eng. Internat. 19, 505–515.

Zahn, C.T., Roskies, R.Z., 1972. Fourier Descriptors for plane closed curves. IEEETrans. Comput. 21, 269–281.

2D shape representation and similarity measurement for 3D recognition problems: An experimental...

Documents

Transcript of 2D shape representation and similarity measurement for 3D recognition problems: An experimental...