3-D to 2-D Pose Determination with Regions

International Journal of Computer Vision 34(2/3), 123145 (1999)c 1999 Kluwer Academic Publishers. Manufactured in The Netherlands.

3-D to 2-D Pose Determination with Regions

DAVID JACOBSNEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA

[email protected]

RONEN BASRIDepartment of Applied Math., The Weizmann Inst. of Science, Rehovot, 76100, Israel

[email protected]

Received July 23, 1997; Revised September 4, 1998; Accepted March 30, 1999

Abstract. This paper presents a novel approach to parts-based object recognition in the presence of occlusion.We focus on the problem of determining the pose of a 3-D object from a single 2-D image when convex partsof the object have been matched to corresponding regions in the image. We consider three types of occlusions:self-occlusion, occlusions whose locus is identified in the image, and completely arbitrary occlusions. We showthat in the first two cases this is a convex optimization problem, derive efficient algorithms, and characterize theirperformance. For the last case, we prove that the problem of finding valid poses is computationally hard, but providean efficient, approximate algorithm. This work generalizes our previous work on region-based object recognition,which focused on the case of planar models.

Keywords: object recognition, pose determination, linear programming, line traversal, occlusion, regions, parts,convexity

1. Introduction

Recognizing a known 3-D object using a single 2-Dimage is a central and difficult problem in visual recog-nition. One of the key issues is developing adequaterepresentations to support flexible recognition of gen-eral objects. Existing approaches are often well-suitedto only a small class of objects (e.g. polyhedra, ro-tationally symmetric objects, low-order algebraic sur-faces). In this paper we show how to make use of a verysimple and general representation of the parts of 3-Dobjects to determine their pose. At the same time, wealso provide results on the fundamental computational

A preliminary version of this paper has appeared as 3-D to 2-DRecognition with Regions by D. Jacobs and R. Basri, IEEE Conf. onComputer Vision and Pattern Recognition, San Juan 1997:547553.A brief overview of these and related results has appeared (Basri andJacobs, 1996).

complexity of determining the pose of 3-D objects inthe presence of occlusion.

A good representation for 3-D recognition should:(1) be rich enough to describe the shape of 3-D ob-jects; (2) have a 2-D analog that can be reliably com-puted from an image; and (3) allow us to understandaspects of the relationship between the 3-D representa-tion and its 2-D projection that are needed to performuseful recognition tasks. In this paper, we focus on thecapability of a representation to support pose deter-mination, one of the most basic problems faced by acomplete recognition system.

1.1. Past Work

Interesting methods have been developed for 3-D to2-D recognition. However, the representations thesemethods use still have significant limitations. Many

124 Jacobs and Basri

previous approaches have relied on finding a corre-spondence between simple geometric features, suchas points or lines. Lowe (1985) and Clemens (1991),for example, determine pose based on a match be-tween line segments, while Fischler and Bolles (1981),Huttenlocher and Ullman (1990), Horaud (1987),Ullman and Basri (1991), Jacobs (1997), Rothwell et al.(1993), and Alter and Jacobs (1998) use point featuresto determine pose, and Thompson and Mundy (1987)make use of vertices. It is fairly well understood howto use local features for pose determination or indexing,but they have significant weaknesses. Local featuresoften do not capture the shape of complex, curved 3-Dobjects. And it may be quite difficult to locate 2-D im-age features that correspond to the local features of anon-polyhedral 3-D object, since the contour generatorof such objects is completely viewpoint dependent.

There has been some recent work that extracts andmatches point features from the outlines of smooth ob-jects. Forsyth et al. (1992) use points derived from thebitangents of objects to derive an invariant descriptionof the contour. This work, however, is limited to rota-tionally symmetric objects. Vijayakumar et al. (1995)show how to build an indexing function using bitan-gents for more general curved 3-D objects. They showthe surprising result that a description based on bitan-gents can be represented as 1-D curves in a lookuptable. These approaches use only a limited amount ofthe structural information available, however.

Another approach to recognizing smooth 3-D ob-jects involves describing the 3-D object and 2-D imagewith algebraic surfaces and curves, and then register-ing these algebraic descriptions. Kriegman and Ponce(1990) have taken this approach, using eliminationmethods to solve for object pose. While this approachhas provided significant insight into how the over-all problem may be solved, it has the disadvantageof requiring a somewhat complex, iterative solutionmethod. Specifically, their method requires a good es-timate of pose to begin with, and then uses a variationof Newtons method to converge to the locally optimalpose. Forsyth (1996) has shown how to use an alge-braic description of an image contour to determine theprojective shape of the algebraic surface that producedit. This result is not practical, however, as it is ex-tremely sensitive to noise. In general, while algebraicdescriptions may be used to accurately represent a 3-Dmodel, it is extremely difficult to derive a correspond-ing description of an image, since such descriptionsmay be very sensitive to noise.

A number of other approaches attempt to repre-sent 3-D models using a specific vocabulary of shapes,typically based on an algebraic description, such asgeneralized cylinders (Binford, 1971; Brooks, 1981;Gross and Boult, 1990; Marr and Nishihara, 1978;Ponce et al., 1989; Shafer and Kanade, 1983; Ulipinarand Nevatia, 1995; Zerroug and Nevatia, 1994), su-perquadrics (Pentland, 1987; Rivlin et al., 1995; Solinaand Bajcsy, 1990; Terzopoulos and Metaxas, 1991),and geons (Bergevin and Levine, 1993; Biederman,1985). Often these approaches handle only a limitedclass of objects. For example, when generalized cylin-ders are used a major difficulty lies in computing the2-D projection of the 3-D axis and sweeping rule.Image occlusion and noise make this problem espe-cially difficult. Considerable effort has led to solu-tions of this problem for only some restricted classes ofshapes.

Moment-based methods are somewhat related toours, in that they compute a description of image re-gions to match to model volumes. These methods mightalign regions based on their center of mass, or onhigher order moments. Examples of this approach canbe found in (Dudani et al., 1977; Hu, 1962; Nayar andBolle, 1996; Nagao and Grimson, 1994; Persoon andFu, 1977; Reeves et al., 1984; Richard and Hemami,1974; Sadjadi and Hall, 1980). These methods do notextend to the recognition of a 3-D object from a single2-D image, however. First of all, volumes of 3-D pointsalways produce self-occlusion, since different subsetsof the surface of the volume are visible from differentviewpoints. Therefore, the center of mass of the pro-jection of a model volume will not be the projectionof the center of mass of the 3-D point set. Second, thecenter of mass of a surface that is curved in 3-D doesnot project to the center of mass of its image, evenwhen there is no self-occlusion. This is because the ex-tent to which different portions of the 3-D surface areforeshortened will depend on the viewing direction. Afinal disadvantage of methods based on moments is thatthey are sensitive to occlusion by other objects in thescene.

In summary, previous research has shown the diffi-culty of finding general representations to support ob-ject recognition. Representations capable of describ-ing very large classes of objects have proven difficult touse. Representations that facilitate tasks such as posedetermination are often restricted to particular classesof objects, such as polyhedral, planar or rotationallysymmetric objects.

3-D to 2-D Pose Determination with Regions 125

1.2. Our Approach

We address the problem of 3-D recognition from a 2-Dimage in a parts-based framework. That is, like mostwork on the recognition of general, curved 3-D ob-jects, we divide a 3-D object into its component parts,and expect that a bottom-up grouping system will iden-tify image regions that are candidate matches for theseparts. We use a simple, direct representation of anobjects parts as general volumes in 3-D, using 2-Dareas to represent their image projection. This repre-sentation can be applied to any 3-D shape (we discussrestrictions concerning convexity later), and derivedfrom a 2-D image without any need to fit algebraicconstructs (e.g., conics, lines, corners) to parts locatedin an image, or to compute other intermediate prop-erties of parts such as their axes. Therefore, our firsttwo goals for a representation are satisfied very gener-ally, with no assumptions except those imposed by aparts-based approach to recognition.

The bulk of this paper will attack the third goal of arepresentation by showing how we can relate 3-D vo-lumes to 2-D regions for the important problem of posedetermination. This extends our previous results, whichfocused on planar objects (Basri and Jacobs, 1997). Atthe same time, we use this general representation toshow novel results about the fundamental complexityof pose determination for general 3-D objects. To dothis, we divide the possible types of occlusions thatmay make pose determination difficult into three con-ditions. First, we consider self-occlusions. We providea simple algorithm that uses linear programming tofind pose, given correspondences between 3-D volumesand 2-D regions. This is the same method that we ap-plied to planar objects in (Basri and Jacobs, 1997),although we provide some new results demonstratingwhen this algorithm will produce correct results. Sec-ond, we consider occlusions whose position has beenidentified in the image. That is, we assume that eachedge bounding an image region is labeled as either aregion boundary, or an occlusion boundary. We showthat a variation on our original algorithm can handlethis case too. Third, we consider arbitrary occlusionsof unknown location. That is, we assume that a 2-Dregion is a subset of the projection of the correspond-ing 3-D volume, but that any of the boundaries of theregion may be due to occlusion. We show that the prob-lem of pose determination in this case is fundamentallyharder, by showing that this is a superset of a problem incomputational geometry that is known to be hard. We

stress that this result is not specific to our algorithms. Itshows that for parts-based object recognition, the prob-lem of pose determination is provably more difficultwhen occlusions are of unknown location in an imagethan when their position is known. This tells us, for ex-ample, that many locally optimal poses can exist. Thiscan present problems for gradient descent algorithms.However, we then provide an approximate algorithm,which is computationally efficient, and show that in anumber of cases this leads to accurate results. Finally,we show how to handle degenerate solutions, whichcan especially occur with this approximate solution.

Our choice of representation is also general in thatit can include as a subset several commonly used localgeometric features, such as points, corners, line seg-ments, or lines. Each of these features can be usedby our methods as a 3-D volume and a corresponding2-D region. Therefore, when local features are presentin an object, these can be integrated into a commonalgorithm with extended features.

While we focus on the pose determination problem,we expect these results to fit into a complete recognitionsystem as follows. At compile time, we divide an ob-ject up into component parts, preferably convex. At runtime, we use a grouping system to identify candidateimage groups. One might use intensity-based segmen-tation or a system that finds salient convex sets of edges(Jacobs, 1996). We then consider matches between im-age groups and model parts, with a search that may bedirected with the addition of cues such as color, as wasdone by Nayar and Bolle (1996). Pose is determinedusing our current work, and then a hypothetical projec-tion of the model may be confirmed or rejected usingadditional cues. The steps of this process are illustratedin numerous experiments described in Jacobs (1992).That system robustly matched convex object parts, butused a feature-based indexing system not suitable fornon-polyhedral 3-D objects. In our current paper, allof these steps are also implemented, except we do notexperiment with search to match object parts, focus-ing instead on determining the capability of our sys-tem to produce accurate poses once the correct matchis found.

In Section 2 we describe algorithms for determiningpose in the case of self-occlusion alone (Section 2.1),occlusion of known locus (Section 2.2), and arbitraryocclusion (Section 2.3). Section 2.3 shows that arbi-trary occlusion is difficult, and describes an approxi-mate algorithm. In Section 3 we describe how much in-formation each algorithm needs to uniquely determine


pose. Section 4 describes how to recover from degener-ate solutions, and Section 5 presents our experiments.

2. Using Volumes and Regions to Determine Pose

We assume that a hypothesized match exists betweena set of model volumes and image regions. Our goal isto use this match to determine the pose of the model.We will assume these volumes and regions are con-vex throughout most of this section, and discuss issuesrelated to non-convexity at the end.

We assume that the model consists of a set of 3-D vo-lumes denoted: V1; : : : ; Vk R3, where each volumeis an arbitrary subset ofR3. Similarly, we assume thatthe image consists of 2-D regions, which are each sub-sets ofR2, and which we denote by: R1; : : : ; Rk R2.Our solution methods will apply to the case where themodel and image sets are convex; if we wish to makeuse of non-convex volumes or regions we should firsttake their convex hulls. This means that our meth-ods can naturally apply also when some or all of thecorrespondences are between point features, or (possi-bly partially occluded) line segments, since these areconvex.

Next, we suppose that the image was generated byapplying some transformation, T , that maps points inthe 3-D model to points in the image. For the mostpart we will assume that this is a 3-D to 2-D affinetransformation. We denote a point in model space byEp D .x; y; z/ and in image space by Eq D .u; v/. IfEq D T . Ep/ then we denote u D Tu. Ep/ and v D Tv. Ep/.Our goal is to identify the transformation that will bestexplain the image regions as the product of their cor-responding model regions.

As we point out in (Basri and Jacobs, 1997), the prob-lem of finding a transformation that perfectly matchesa set of model volumes to their corresponding imageregions is a non-convex optimization problem. Thisfollows from the fact that the set of feasible trans-formations need not be convex, or even connected.Consider for example the case of a model squarematched to an identical image square. Matching themodel exactly to the image can be performed in fourways (separated from each other by a 90 rotation).Obviously, no intermediate transformation provides asolution to this matching problem. While non-convexoptimization problems are often attacked using toolssuch as gradient descent, we instead take the approachof showing that different problem formulations canmake the matching process convex, and therefore easier

to optimize. In Section 3 we discuss the conditions un-der which these formulations will produce the correctmodel pose.

2.1. The Forward Constraints: Self-occlusion

First, we show that our problem becomes convex if wemerely require that every model point projects insidethe corresponding image region, by reviewing resultsthat we have shown in (Basri and Jacobs, 1997). For-mally, the forward constraints are satisfied by the trans-formation, T , if and only if 8 Ep 2 Vi , T Ep 2 Ri (that is,T Vi Ri ). These constraints allow a volume to oc-clude itself (i.e., two model points may project to thesame image point). They do not capture all possibleconstraint, however, since they do not require that eachimage point be explained by a corresponding modelpoint.

We first consider a projection model consisting of a3-D affine transformation followed by an orthographicprojection, then we consider perspective projection.Denote the linear part of T by A, where A is a non-singular 2 3 matrix with elements ti j , and the trans-lation part by Et D .tx ; ty/. Then:

u D t11x C t12 y C t13z C txv D t21x C t22 y C t23z C ty :

(1)

This projection model and its equivalent has been re-cently used by a number of researchers (Lamdan andWolfson, 1988; Ullman and Basri, 1991; Koenderinkand van Doorn, 1991; Tomasi and Kanade, 1992;Jacobs, 1997). It is also equivalent to applying scaledorthographic projection followed by a 2-D affine trans-formation (Jacobs, 1997), that is, taking a picture of apicture. Alternately, it is equivalent to a paraperspec-tive projection followed by translation (Basri, 1996),where paraperspective is a first-order approximationto perspective projection (Poelman and Kanade, 1997;Sugimoto, 1996).

To express the forward constraints, we note that sinceR is convex, there exists a set of lines L R bounding Rfrom all directions such that for every point Eq 2 R andfor every line l 2 L R we can write

l.Eq/ 0: (2)

Let the line l be expressed by the equation: AuC BvCC 0, where .A; B/ is the unit vector normal to theline. Then the constraint T V R can be written as


follows. Every point Ep 2 V should be mapped by Tto some point Eq 2 R, and so

A.t11x C t12 y C t13z C tx /C B.t21x C t22 y C t23z C ty/C C 0: (3)

This constraint is linear in the transformation parame-ters. Denote

EwT D .t11; t12; t13; tx ; t21; t22; t23; ty/

the vector of unknown transformation parameters, and

EgT D .Ax; Ay; Az; A; Bx; By; Bz; B/:

We can rewrite the forward constraints as

EgT Ew C: (4)

We can similarly handle affine transformations fol-lowed by perspective projection. In that case

u D f .t11x C t12 y C t13z C tx /t31x C t32 y C t33z C tz

v D f .t21x C t22 y C t23z C ty/t31x C t32 y C t33z C tz

where f is the focal length. The forward constraintAuC BvCC 0 now contains the term t31x C t32 yCt33zC tz in the denominator. This term must be positivesince we require the object to appear in front of the cam-era. So, we can multiply both sides of the inequality bythis term, again obtaining a linear constraint with thesame general form as Eq. (4), with different definitionsof Ew and Eg.

The set of forward constraints consists of all suchconstraints obtained for each pairing of a boundingline l 2 L R and a model point Ep 2 V . This gives usone constraint for every point in the model volumesand for every line tangent to the image regions. Forcurved objects, therefore, the number of constraints isinfinite, but we may sample them as accurately as wedesire. The issue of sampling is addressed in (Basri andJacobs, 1995). For polyhedral volumes and polygonalregions the number of independent constraints is finite,and given by the product of the number of vertices ofthe model volumes and the sides of the image regions.The rest of the constraints are redundant.

We therefore seek a set of transformation parametersthat satisfy a set of linear constraints of the form:

EgTi Ew ci ; i D 1; : : : ; n: (5)

This can be written in matrix notation as:

G Ew Ec: (6)

We may find a set of parameters that satisfy these lin-ear constraints using linear programming. To do this,we must also specify a linear objective function. Acommon way of doing this is by introducing an addi-tional unknown, , in the following way.

max

s.t. G Ew Ec C E1 (7)

A solution to Eq. (6) exists if and only if a solution toEq. (7) with 0 exists. (Note that other objectivefunctions, e.g., the perceptron function, can be used forrecovering Ew, see e.g., (Duda and Hart, 1973) for a dis-cussion of solutions to the linear discriminant functionsproblem.)

When 0 its value represents the minimal dis-tance of a point to any line bounding the region (Fig. 1).Maximizing amounts to attempting to contract themodel volume inside the image region as much as pos-sible. When < 0 this attempt fails. In this case anymodel point that violates the constraints is mapped toa distance of no more than jj from a line bounding itstarget regions.

Enforcing the forward constraints is therefore verysimilar to finding a transformation that minimizesthe directed Hausdorff distance, h, from the transfor-med model volume to the image region. This can be

Figure 1. The dark circles are positioned by the similarity transfor-mation that maximizes relative to the larger, shaded circles.


defined as:

h.M; I / D maxm2M

mini2Ikm ik

Hausdorff matching has previously been effectivelyused in computer vision by, for example, Huttenlocheret al. (1993a, 1993b), and Rucklidge (1997). Thesepapers primarily focus on the use of robust variants ofthe undirected Hausdorff distance, which is defined asmax.h.M; I /; h.I;M//, for aligning objects undergo-ing 2-D transformations. Amenta (1994) specificallydiscusses the efficient Hausdorff matching of convexshapes undergoing translation and scaling. While re-lated to these approaches, our work is different in thatwe demonstrate new, efficient algorithms that computesomething similar to the directed Hausdorff distance.These algorithms compute the pose of 3-D objects insingle 2-D images.

More specifically, solving the system Eq. (7) mayresult in over-contraction. Consider, for example, thecase of matching a single volume V to a single regionR. The forward constraints restrict the set of possibletransformations to those that map every point Ep 2 Vinside the region R. Assume T is a feasible transfor-mation, that is T V R, then applying any contractingfactor 0 s 1 to V would also generate a validsolution; namely, T .sV / R. (We assume here with-out the loss of generality that the origin of the modelis set at the centroid of V .) Consequently, the caseof matching one volume with one region necessarilyintroduces multiple solutions. The solution picked byEq. (7) is the one with s D 0. This will contract V toa point, which is then translated to the point inside Rfurthest from any of its bounding tangent lines. Thissolution produces the largest value of . Clearly, thecase of matching one volume to one region cannot besolved by the forward constraints alone. However, wewill show that we can determine pose accurately whenwe match a larger number of volumes and regions.

2.2. The Forward Constraintswith Known Occlusion

The forward constraints are suitable when there is onlyself-occlusion; the backward constraints, as we willdiscuss, hold in the presence of arbitrary occlusion.We now consider one other possibility, that some modelvolumes may be partially occluded in the image, butthat the extent of this occlusion is known. That is, weassume that some portion of boundary of the image

regions may be caused by an occluding object in frontof the volume, rather than the boundary of the vol-ume itself, but that we have identified those portions ofthe region boundaries. This is a situation of consider-able interest; for example, if we identify several differ-ent parts of an object we may be able to determine thatsome of these parts lie in front of and occlude others(e.g., by identifying concave sections in the bound-aries of a region corresponding to a convex volume).Also, it is well-known in the psychology literature thatthe knowledge that certain boundaries are due to oc-clusions can greatly assist human perception (Rock,1983). Suppose we have identified a region in the im-age, but we know that some of the boundary of thisregion is due to another, occluding object and is notin fact the boundary of the region itself. By allowingfor the region to be extended in the direction of suchocclusions, we can construct a larger convex regionwhich we know should contain the projection of thecorresponding model volume.

This can be implemented with very little modifica-tion to the above algorithm. Suppose we approximatethe border of a detected region with a set of line seg-ments, some of which we know come from the bound-ary of the region, and some of which come from anocclusion. We apply the forward constraints, as de-scribed above, using only those line segments thatoriginate due to the boundary of the object. Implicitly,this restricts the model volume to project within thelargest possible convex region that is consistent withthe known region boundary.

2.3. The Backward Constraints:Unknown Occlusions

We can allow for arbitrary, unidentified occlusions inthe image with the backward constraints. The back-ward constraints are the inverse of the forward. Insteadof requiring that each model volume project completelyinside each corresponding image region, the backwardconstraints require that each image region lie com-pletely inside the projection of each correspondingmodel volume. That is, the backward constraints aresatisfied by the transformation, T , if and only if forevery point, Eq, in every image region, Eq 2 T Vi (that is,Ri T Vi ).

It is preferable to enforce the backward constraints,rather than the forward constraints, when recognizingobjects in the presence of occlusion. Typically, whena model volume is occluded by a different object or a


different model volume, a region detector will locateimage regions that are subsets of the true, unoccludedregions. In this case, the true transformation may mapa model volume so that part of the volume accountsfor the visible portion of the region, and part of themodel volume is mapped outside this region. There-fore, the backward constraints express precisely ourstate of knowledge about the pose when we allow forarbitrary occlusion of the image regions.

As we show in (Basri and Jacobs, 1997), in the caseof planar objects we may solve the backward con-straints using linear programming. This is because thetransformations that we use to relate a planar modelto a planar image (similarity, affine or projective) areinvertible. Therefore, requiring that the viewing trans-formation maps the model volumes so that they com-pletely enclose the corresponding image regions isequivalent to requiring that the inverse transformationmap the image regions completely inside the corre-sponding model regions. For planar models, the for-ward and backward constraints have an identical form,with the role of the image and model reversed, and sowe may find poses that satisfy them using the samemethod.

In the case of a 3-D model and a 2-D image, the 3-Dto 2-D viewing transformation is no longer invertible,however. It is therefore not obvious how one can applythe backward constraints in this case.

In fact, we can prove that it is not possible to directlyextend our planar results on the backward constraintsto 3-D models. To do this, we now show that there isa close connection between the problem of solving thebackward constraints and the computational geometryproblem of finding a line traversal, i.e., finding a linethat intersects a set of 3-D solids. We will show that forone special case, finding a pose that satisfies the back-ward constraints is equivalent to finding a line traversalof the model volumes. This will show that dealing withthe backward constraints is at least as difficult as theproblem of finding a line traversal of the volumes.

Suppose first that we have an arbitrary set of modelvolumes, but that every image region consists of onlya single point, and that these points are identical, de-noted by Nq (see Fig. 2). The backward constraints aresatisfied by a transformation if and only if the 3-D linethat projects to Nq intersects every model volume. Sucha transformation exists if and only if there exists a sin-gle line that intersects each of the model volumes. Thisshows that the problem of solving the backward con-straints is at least as hard as the problem of determining

Figure 2. On the left, we show three model volumes projecting tothree, overlapping image regions. On the right, we imagine that theseregions are occluded, except for one point which they all share. In thiscase, the problem of solving the backward constraints is equivalentto that of finding a line traversal of the model volumes.

whether a line traversal exists for a set of 3-D volumes.Note that this reasoning applies equally to orthographicor perspective projection.

Unfortunately, Amenta (personal communication)has shown that finding a line traversal to n arbitrary3-D volumes is not an LP type problem, because itcan contain O.n/ solutions that are disconnected inthe space of all possible lines. This implies that gra-dient descent methods of solving the backward con-straints may produce locally optimal solutions that arenot globally optimal. Moreover, existing algorithmsfor finding line traversals are not nearly as efficientas our linear programs, and do not appear to be ex-tendible to solving the backward constraints withoutadding considerably greater complexity. For example,Pellegrini (1990) gives an algorithm for finding a linetraversal of general polyhedra with edges that have onlya constant number of different possible directions. Thisalgorithm requires O.n2 log.n// time for n polyhedra.It is not clear how to improve upon this result if thepolyhedra are restricted to be convex. Solving the for-ward constraints for such polyhedra would require onlyO.n/ expected time, because each polyhedron wouldproduce a constant number of constraints, yielding atotal of O.n/ constraints, and because fixed dimen-sion linear programs can be solved in O.n/ expectedtime (Seidel, 1990). And solving the full backwardconstraints will be much harder than solving the linetraversal problem, because the image regions will not,in general, all consist of a single point, and because afull 3-D to 2-D projection must be considered.

Our result is significant because it is not merely acomment on our algorithms; it reflects on the difficulty


of a basic problem in recognition. We have shownthat the general problem of determining 3-D modelpose in a 2-D image, when one allows for arbitraryocclusions of unknown locus, is at least as hard as adifficult computational geometry problem. Moreover,Amentas construction (personal communication) usesonly volumes that are lines in 3-D. This implies that theproblem is difficult even when we restrict ourselves tovery simple shapes.

Our result is also somewhat surprising because theforward and backward constraints each encode someof the constraints that together imply a perfect fit be-tween model and image. The backward constraintsallow for partial occlusion of the model, while the for-ward constraints, in some sense allow for partial oc-clusion of the image (some of the image region maygo unexplained). Moreover, for the case of planar ob-jects, these constraints are symmetric, and give rise toequally complex problems. It is therefore somewhatunexpected that they would give rise to problems ofsuch different complexity for the case of 3-D objects.

There is, however, a special case for which the linetraversal problem can be solved efficiently. This iswhen the set of 3-D volumes are all axial rectangu-loids, (i.e., their sides are aligned with the x-, y-, andz-axes). Amenta (1992) gives an efficient algorithmfor this case, and subsequently (Megiddo, 1996) hasrefined this result, showing that this problem can besolved by linear programming. We will now extendthis result to propose a useful approximation algorithmfor finding object pose, in which we approximate modelparts using their bounding axial rectanguloids.

Based on Megiddos method (1996), we can solvethe backward constraints using linear programming,for the case where the model volumes are axial rect-anguloids and the projection model is affine. (Noticethat since we are free to change the coordinate frameof the model, in fact we only require the models tobe rectanguloids with parallel sides.) To do this, it ishelpful to think of affine projection in the followingway. First, the set of 3-D model points are extended inany one direction to a set of parallel lines; then a 3-Daffine transformation is applied to this set of lines sothat they are made normal to the image plane. Each3-D point projects to a 2-D point where its correspond-ing line intersects the image plane. Now we may thinkof the inverse of this transformation as applying a 3-Daffine transformation to the set of lines normal to theimage plane, that intersect each image point. Speci-fically, let fpi g denote a set of image points. Let fni g

denote the corresponding set of lines, so that ni is nor-mal to the image plane and includes pi . Then fpi g, arethe projection of a set of model points, fqi g, if and onlyif there exists a 3-D affine transformation so that eachline, ni , is transformed to include qi . The backwardconstraints, then, are satisfied by an affine transforma-tion that maps each normal line so that it intersects thecorresponding model volume.

We now show how to express the constraint that animage point pi 2 Ri must be the projection of somepoint in an axial rectanguloid, Vi , as a linear constrainton the inverse of the 3-D affine transformation thatmaps the model into the scene. We let T1 denote thisinverse transformation, with components denoted byt1. Suppose Vi has a lower corner of .xl ; yl ; zl/ and anupper corner of .xh; yh; zh/. Then the transformed lineintersects this rectanguloid if and only if there exists apoint on the line whose x; y and z coordinates lie withinthese ranges. Consider a line normal to the z D 0 imageplane passing through the point .u; v/. If we transformthis line by T1, and parameterize it by , then a pointon this transformed line has the coordinates:

pi Dt111 u C t112 v C t1x C t113 ;t121 u C t122 v C t1y C t123 ; t133

;

for any choice of . The backward constraint pi 2 T Viis satisfied if and only if it is possible to satisfy thefollowing inequalities:

xl t111 u C t112 v C t1x C t113 xhyl t121 u C t122 v C t1y C t123 yhzl t133 zh :

Assume now that t113 ; t123 > 0 (in practice we must

run four linear programs to account for the differentpossible signs of t113 and t

123 ). Then we have:

xl t111 u t112 v t1xt113

xh t111 u t112 v t1x

t113yl t121 uC t122 vC t1y

t123 yh t

121 u t122 v t1y

t123zl

t133 zh

t133:

These inequalities are satisfied if and only if everyquantity constrained to be less than is smaller thanevery quantity constrained to be bigger than . Namely,


we can replace them with nine inequalities (three ofwhich are trivial) that do not mention at all. These in-equalities appear non-linear, but we can linearize themwith a change of variables. Let:

s11 D t111

t113; s12 D t

112

t113; s13 D 1

t113; wx D t

1x

t113;

s21 D t121

t123; s22 D t

122

t123; s23 D 1

t123; wy D

t1yt123

;

s33 D 1t133

;

and we get six non-trivial constraints:

s13xl s11u s12v wx s23 yh s21u s22v wys13xl s11u s12v wx s33zhs23 yl s21u s22v wy s13xh s11u s12v wxs23 yl s21u s22v wy s33zh

s33zl s13xh s11u s12vwxs33zl s23 yh s21u s22vwy :

These inequalities are homogeneous, reflecting thefact that some components of the transformation areredundant. We can get around this by setting t133 D 1,creating difficulties only in a degenerate case. We cantherefore express the backward constraints for rectan-guloids with parallel sides as a set of linear inequalities,and solve them using linear programming, as we did theforward constraints. For the case of non-rectanguloidmodel volumes, we can approximate the backward con-straints by replacing each model volume with the axialrectangloid that bounds it.

With this approximation the backward constraintsrequire that the bounding rectanguloid of each modelvolume project into the image so that it completelycontains the corresponding image region. These pro-vide correct constraints on the transformation, since ifa volume projects into the image so that it contains acorresponding region, so must its bounding rectangle.The accuracy of the rectangular backward constraintswill depend on the specific shape of the model vol-umes. However, we note that without loss of generalitywe may apply an arbitrary 3-D affine transformation tothe set of model volumes before bounding them, to im-prove our approximation. In Section 3 we will describethe consequences of this approximation in more detail,showing that in some cases of interest we can find thecorrect model pose in spite of this approximation.

We are required to run four linear programs, whicheach correspond to a different visual aspect of the rect-anguloids. It is not the case, however, that we can usemore complex shapes to approximate the volumes, run-ning a different linear program for each aspect, sincethe reduction to linear programming depends on de-coupling the x and y coordinates, not on treating eachaspect separately. This is clear also from Amentasproof (personal communication) that the general prob-lem is not an LP-type problem.

2.4. Local Geometric Features

We now point out that the solution methods describedabove may be applied to local geometric features aswell, provided these can be described as convex sets.There are many existing methods for finding pose usinglocal geometric features. Our solution methods will besuitable only when one wishes to combine informationfrom these features with information based on regionsand volumes.

We consider in particular the use of points, corners,and line segments. With the exception of corners, eachof these can be thought of as a convex subset of R3that projects to a convex subset of R2. A corner canbe thought of as a point with associated vectors, to indi-cate the directions of the lines forming the corner. Thesevectors are also convex subsets of the space of vectors.

To apply the forward constraints, one must describeeach model feature with one or more points, and eachimage feature as the intersection of a set of half-planes.Obviously, a model point is trivially described as apoint. An image point can be described as the intersec-tion of three half-planes intersecting at the point. Amodel line segment can be described by its endpoints,which form its convex hull. An image line segment canbe described by the intersection of four half-planes.Two half-planes can be along the line segment, but inopposite directions, so that their intersection is the linecontaining the line segment. Each end point of the linesegment can be delimited by another half-plane. Eitheror both of these can be omitted if it is known that oneor both of the line segments end points is due to an oc-clusion. The vector giving the orientation of a modelcorner may also be described using any point along the3-D ray that describes the corner, which is constrainedto lie along a ray in the image, given by the positionand direction of the corner. Given such descriptionsof local geometric features, the forward constraints arethen applied as described above.


Similarly, we may apply the rectangular back-ward constraints after describing image features us-ing points, and after placing axial rectanguloids aboutmodel features. Note that a 3-D point feature can al-ways be exactly described by an axial rectanguloid.A 3-D line segment can be exactly described if it isaligned with one of the axes. Otherwise it may be ap-proximated. The direction vector of a corner can beexactly described in the model if it is aligned with anaxis, otherwise an approximation to it will be so largeas to be of little value.

2.5. Non-convex Shapes

For the most part, the solution methods we describecan also be applied to non-convex shapes, if we firstapproximate them by taking their convex hull. We canapply the forward constraints by requiring that eachmodel volume project inside the convex hull of thecorresponding image region. To do this, it is sufficientto require every vertex of the convex hull of the modelvolume to project inside the half-plane correspondingto every edge bounding the convex hull of the cor-responding image region. The rectangular backwardconstraints can similarly be applied by approximatingeach model volume using a rectangular box, and usingthe vertices of the convex hull of the image region.

However, a complication arises in applying the for-ward constraints, with known occlusion. In that case,we cannot bound the possible extent of a partiallyoccluded image region. If the model volume is non-convex, we cannot assume that by extending the corre-sponding image region convexly in the direction of anocclusion that we will cover all of the region that hasbeen occluded. It is possible that the occluded portionof the region contains concavities, and extends outsideof the largest possible convex region containing the de-tected region. This is illustrated in Fig. 3. We will notconsider this problem further.

Figure 3. Suppose the cross-hatched, polygonal area is an imageregion known to be occluded by the dark rectangle (left). When thecorresponding model volume is convex, we may apply the forwardconstraints, using the largest possible convex region consistent withthis occlusion (middle). However, this may be incorrect if the oc-cluded region is not convex (right), and we wish to use the convexhull of the region for the forward constraints.

3. Uniqueness of Solutions

We have described two different types of constraintsthat we may use to efficiently determine model pose.The forward constraints are correct when there is onlyself-occlusion, or occlusion whose location is known.The backward constraints are correct even in the pre-sence of unknown occlusion. However, these con-straints are not complete: the forward constraints donot express the constraints that each model volumeshould explain every point in each matching image re-gion, while the backward constraints only bound thetrue set of backward constraints, through the use of ax-ial rectanguloids. We will now none-the-less show that,although incomplete, these constraints are sufficient inmany realistic situations to correctly determine modelpose.

We first consider the performance of the forwardconstraints, in the presence of only self-occlusion. Weshow that, in general, when we have matched threeor four volumes, we may determine the correct modelpose with linear programming, while, in general, twomatches are not sufficient to determine pose uniquely.Next we consider the performance of the backward con-straints when the model volumes are each planar (butnot mutually coplanar). In this case, some subset ofthe exact constraints will be present, and we show thatin many cases these can determine the correct solution.We also point out that when we apply the rectangu-lar back constraints to curved 3-D volumes they canonly produce an approximate solution, and not an ex-actly correct one. Finally, we point out that the forwardconstraints with known occlusions can also produce thecorrect model pose. Throughout our discussion we willmention situations in which degenerate solutions maybe found instead of the correct ones. By restricting thesolution to represent a rigid transformation we can inmany cases use a degenerate solution to determine thecorrect one. This will be shown in Section 4.

3.1. The Forward Constraints

In this section, we suppose that T is a 3-D to 2-D affinetransformation that maps the model volumes to the cor-responding image regions. This transformation mapscurves on the 3-D volumes to the bounding curves ofthe 2-D regions. These 3-D curves are called the con-tour generators. Our proofs assume the contour genera-tors are closed 1-D curves, which is generically true forconvex volumes. With some complication, the proofsmay be extended to handle the non-generic case.


We then ask under what circumstances T is theunique transformation that satisfies the forward con-straints. When T is unique, this means that using alinear program to find a transformation satisfying theforward constraints must in fact produce T , the correctsolution. We had previously (Basri and Jacobs, 1997)reported preliminary results on this problem for thespecial case of planar model volumes that are not mu-tually coplanar. We now extend those results to the caseof fully three-dimensional model volumes.

Theorem 1. Suppose the transformation T maps thefour model volumes; V0; V1; V2; V3 to the four im-age regions; R0; R1; R2; R3; and there does not exista plane that intersects all four volumes. Then; T isthe only transformation that satisfies the forward con-straints.

This is proven in Appendix A.

Theorem 2. T is not the only transformation sat-isfying the forward constraints if and only if; for thecontour generators T produces; there exists a plane;P; such that; for all contour generators; either P con-tains the contour generator; or P intersects it in twopoints; such that the orthographic projection onto Pof the tangents to the contour generators at the in-tersection points are all parallel. This implies that forthree model volumes viewed in general position; T willbe the unique transformation that satisfies the forwardconstraints.

This is proven in Appendix A.We now consider the case in which only two model

volumes are matched to two image regions. We do notprovide necessary and sufficient conditions for theseto determine a unique transformation. However, wedo note that any two model volumes may be viewed sothat their corresponding image regions intersect. Whenthis occurs, the forward constraints will be satisfied byany transformation that shrinks the model volumes toa very small size, and projects them inside this regionof intersection. Therefore, any two model volumes willlead to a non-unique solution when viewed from a sig-nificant range of viewpoints.

3.2. The Rectangular Backward Constraints

We now address the problem of determining whenthe rectangular backward constraints are sufficient tocorrectly determine the pose of a model. Recall that the

backward constraints bound each model volume withan axial rectanguloid, and then require a valid transfor-mation to map this rectanguloid into the image so thatit completely contains the corresponding image region.

Let Ci denote the axial rectanguloid that bounds thevolume Vi . Note that the contour generator for Ci willconsist of six line segments, or four line segments forthose special viewpoints in which only one face of therectanguloid is visible. This means that the projectionof Ci will be a six (or four) sided convex polygon.The rectangular backward constraints require that Ri T Ci . Since Vi Ci , the correct transformation willsatisfy these constraints.

In general, T Vi may lie completely inside T Ci . Infact, T Vi and T Ci will touch only in the case where Vitouches one of the line segments that form the bound-ary of Ci (or for special viewpoints in which a side ofCi projects to a line segment in the image). For a gen-eral smooth 3-D volume, no bounding rectanguloid (orpolygon of any sort) will touch the volume in one of itsedges, since this would require a discontinuity in thevolume. On the other hand, a convex planar volumecan always be oriented so that a bounding rectangu-loid is also planar, and touches it in at least four points.Therefore, cases of interest exist in which either T Vi isalways completely inside T Ci or in which T Vi \ T Cicontains at least four points.

Suppose first that Ri D T Vi lies entirely inside T Ci ,for all volumes Vi . In this case, it is obvious that anysmall perturbation to the transformation T will not vio-late the rectangular backward constraints. Therefore,in such a situation the backward constraints will notuniquely determine the correct pose, although they maystill produce a good approximation to this pose.

This is not surprising; after all the rectangular back-ward constraints involve an approximation, and so oneexpects them to produce poses with some error. Sur-prisingly, though, we can also show that for an inter-esting class of volumes the rectangular backward con-straints will produce exactly the correct answer. Sup-pose that the volumes are all planar, and that they eachlie in either of two different planes. This is a commonoccurrence if the volumes are either surface markingson two different faces of a polyhedron, or on the wallsor ceiling of a room. We show that such volumes canlead to a unique solution to the backward constraints.

In particular, we show that when two or more modelvolumes are planar, and lie in the same plane, therecan be a unique transformation of that plane thatsatisfies the rectangular backward constraints. If in ad-dition there is another planar model volume, lying in


a different plane, there can generically be a uniquesolution to the rectangular backward constraints. Aprecise statement and proof of this can be found inAppendix B. In brief, when planar volumes lie in twodifferent planes we can, without loss of generality, as-sume that these planes are axial. In this case, axialrectangles touch the boundary of the model volumes atpoints that always project to the boundary of image re-gions. Therefore, the rectangular backward constraintsinclude some tight constraints that can allow only aunique transformation to be valid.

3.3. The Forward Constraints,with Known Occlusion

We now point out that our discussion of the back-ward constraints also provides an example of a situ-ation in which the forward constraints, with knownocclusion, can lead to a unique transformation. Whenwe use the forward constraints with known occlu-sions we are making use of a subset of the constraintsthat would be available in the absence of occlusion.As we have described, the rectangular backward con-straints can include a small subset of the complete back-ward constraints. For planar volumes, those points thatare extremal in the axial directions lead to tight con-straints on the model pose. Suppose now that the sameextremal points that we make use of in Section 3.2 arevisible in the image. In this case, exactly the same rea-soning applies to show that a unique transformationwill satisfy the forward constraints. Consequently, ifany but these extremal points are occluded, we maystill apply the forward constraints to find the correcttransformation. This provides an example that showsthat even with a large amount of occlusion, the forwardconstraints can produce the correct transformation, pro-vided that this occlusion is identified.

In general the forward constraints with known oc-clusions will be much more effective in determining aunique solution than will the rectangular backward con-straints, since every unoccluded point on the boundaryof the image regions will provide a tight constraint onthe transformation. It is, however, beyond the scope ofthis paper to fully characterize when these constraintslead to a correct solution.

4. Recovering from Degeneracies

We now consider how we may handle certain situa-tions in which the model volumes lead to a non-unique

solution. The most common cause of this occurs whenthe model itself is in a sense degenerate, and so a3-D affine transformation that satisfies analogs to theforward and backward constraints may be found. How-ever, we will show that in this case we can often recon-struct the correct transformation, undoing the effectsof this degeneracy.

Suppose there exists a 3-D to 3-D affine transfor-mation, A 6D I , such that AVi Vi for all i . In thiscase, the forward constraints can almost never lead toa unique solution. If the correct transformation is T ,then the transformation T A will clearly also satisfy theforward constraints. Moreover, T A 6D T except forspecial cases when projection completely removes theeffects of A.

If the model contains three distinct volumes, anysuch transformation must contain three fixed points. Ifthe volumes are not traversed by a single line, A mustconsist of a fixed plane, with a contraction in somedirection toward that plane.

Similarly, a degeneracy will almost always occurin the rectangular backward constraints when there isan affine transformation A such that Vi ACi . Thistypically occurs when a plane exists that intersects theCi rectanguloids on sides that can be divided into twoparallel sets. In this case A can be the affine trans-formation that leaves this plane fixed, and expands themodel in the direction shared by the rectanguloid sidesintersected by the plane. For example, suppose all therectanguloids have minimum x coordinates less thanzero, and maximum x coordinates greater than zero.In this case, the x D 0 plane will intersect each Ci intwo sides parallel to the y D 0 plane and in two sidesparallel to the z D 0 plane. In this case, we may expandthe model rectanguloids in the x direction (which is theintersection of the y D 0 plane and the z D 0 plane) soas to leave the x D 0 plane fixed. This transformationhas the form:

A Ep D

0B@1C 0 00 1 00 0 1

1CA Epfor any > 0. In this example, Vi Ci ACi . Insuch a situation, whenever T satisfies the rectangularbackward constraints, so will T A.

However, since the model is accessible to us inadvance, we may detect when such degeneraciesoccur, and undo their effect. To determine the possibi-lity of such a degeneracy, we may apply the forwardor backward constraints for the case of 3-D affine


transformations. For the forward constraints we com-pare the model to itself. For the backward constraintswe compare the model volumes to their bounding rect-anguloids. Since 3-D affine transformations are linear,these constraints can be solved for using linear pro-gramming, and will reveal the presence of a transforma-tion such as A. However, in the experiments describedbelow, we have simply detected this possibility byhand.

If a model can be contracted or expanded perpen-dicularly in a single direction, we first preprocess themodel so that this direction is aligned with one of theaxes. Without loss of generality, assume this is thex-axis. Let the contraction/expansion transformationapplied to the model be given by

A Ep C Eb D

0B@ 1C 0 00 1 00 0 1

1CA Ep C ax0:

Notice that we must allow for an unknown transla-tion of ax in the x direction, as a part of the con-tracting/expanding transformation. This is because,although we assume that contraction or expansion is inthe x direction, and that there is a fixed plane perpen-dicular to the direction, we do not assume that this is thex D 0 plane. In some cases there will not be a uniqueplane that might be the fixed plane, there might be afamily of parallel planes any one of which might havebeen fixed in the contraction/expansion of the model.Now, suppose the image is obtained by applying arigid transformation to the model followed by a scaledorthographic projection. The imaging process can bewritten as

S Ep 0 C Et D

s11 s12 s13

s21 s22 s23

Ep 0 C

txty

;

where the entries si j are the first two rows of a scaledrotation matrix. Our solution methods, then, will pro-duce a transformation T that is a composition of thetwo transformations, as follows:

T EpD S.A EpC Eb/C Et

D

s11 s12 s13

s21 s22 s23

0B@0B@ 1C 0 00 1 0

0 0 1

1CA Ep C ax01CA

C

txty

:

Given the transformation, T , we wish to recover thematrix S that indicates the true scaled orthographic pro-jection of the model. This is easily done, since

s12 D t12 s22 D t22s13 D t13 s23 D t23:

We may then readily determine the values of s11; s21which will satisfy the rigidity constraints of the matrix,so that:

s11s21 C s12s22 C s13s23 D 0s211 C s212 C s213 D s221 C s222 C s223

Thus we can recover the scaled rotation matrix and alsothe y translation that produced the image regions. Wecannot directly recover the translation ax . However,once we have determined the rest of the transforma-tion, it is easy to determine the appropriate translation.For example, we could run linear programming again,allowing for only an x translation.

5. Experiments

We now present some experiments that demonstratethe feasibility of our approach to recognition. Theseexperiments will provide useful information about theaccuracy of the poses that we can recover using boththe forward and the rectangular backward constraints.

In these experiments a model was first constructedby hand, using images of the object and knowledge ofits structure. Then, the Canny edge detector (Canny,1986) was run on a new image of the object. We auto-matically extracted sets of edges that formed salientconvex groups, using the grouping system describedin (Jacobs, 1996). The localization of groups in theimage will therefore contain errors due to running a realedge detector and grouping system on real images. Asubset of these groups were then extracted and matchedby hand to groups in the model. See Nayar and Bolle(1996) for one suggestion about how to use intensityinformation to match such regions automatically. Oursystem then used these matches to determine the modelposes shown here.

In the first set of experiments we use a black boxwith different shaped regions painted on the side. Al-though this object is somewhat artificial, it allows usto experiment with a variety of different conditions. InFigs. 410 we show the poses derived by the forwardand backward constraints using different combinations


Figure 4. This shows poses derived using the forward and backward constraints. The image is shown, with the regions used to derive thepose marked with a white hatching. Superimposed over the image are the white outlines of all model groups, to indicate the pose that has beenderived using the forward constraints (left) and backward constraints (right). This figure shows the poses derived using five regions.

Figure 5. Another example using five regions. The pose derived using the forward constraints is on the left, and the pose from the backwardconstraints is on the right.

Figure 6. An example using four regions. Note that one of the regions is partially occluded, leading to a poor solution with the forwardconstraints (left).

Figure 7. An example using three regions. The backward constraints (right), which are approximate, lead to a much noisier solution than theforward (left).


Figure 8. An example using three regions, one of them occluded. This leads to a poor solution for the forward constraints (left), which do notallow for occlusion. The backward constraints (right), produce a much more accurate pose. Note that in spite of the occlusion, the backwardconstraints produce a more accurate solution than they did in Fig. 7 because a more stable triple of image regions are matched.

Figure 9. An example using three regions, with two of them occluded. This leads to a very poor solution for the forward constraints (left),while the solution for the backward constraints also degrades. Occlusion may cause the correct solution to become non-unique, and indeed wecan see that the pose found satisfies the backward constraints very well.

Figure 10. This example shows poses derived from two regions. We can see that these are not sufficient to determine the correct pose.

of regions. This shows that both sets of constraintsare able to produce accurate model poses when at leastthree region correspondences are used. It also illus-trates how the poses degrade as we use fewer regions,and as the amount of occlusion increases. Of course theforward constraints are especially vulnerable to uniden-tified occlusions.

In Figs. 1113 we demonstrate the method dis-cussed in Section 4 for determining the correct scaled

orthographic projection from a degenerate transforma-tion. The model regions used in these examples arecoplanar, and thus lead to a degeneracy. Note, how-ever, that degeneracies can occur even when 3-D modelvolumes are used.

Figure 14 shows the forward and backward con-straints being used to recognize two other objects.These experiments show that even with noise due tothe edge detection and grouping process, we can use


Figure 11. Here we show a set of regions that produce a degenerate solution. The forward constraints are applied. However, since the modelregions matched are really coplanar, a solution is found that contracts the 3-D model into a single plane (left). Since this potential degeneracy canbe detected ahead of time, we may postprocess the pose to uncontract it, producing a scaled orthographic transformation (right) that matchesthe model and image better.

Figure 12. Similarly, we show the backward constraints applied to produce a degenerate solution (left), along with the uncontracted pose.

Figure 13. Similar results for a different object. The pose of the soda can found by the forward constraints (left) is significantly contracted.After uncontracting the pose (right) into a scaled orthographic projection, we obtain a better fit.

region matches to accurately determine pose withoutexplicit correspondences between local features suchas points or line segments.

Finally, Figs. 1518 show the results of our algori-thms for a synthetic model composed of 3-D volumes.The object is composed of seven volumes: the fourlegs, the body, the neck and the head. That is, eachvolume is a part; the faces of volumes are not usedas separate parts. We segment and match these parts

perfectly in the images. Being synthetic, this objectis simple to model accurately. But moreover, sincethe only source of error is digitization in the projec-tion, any inaccuracies that we find are due to limita-tions of the methods proposed. Figure 16 shows that thebackward constraints can produce very accurate poses,in spite of the approximations made by taking bound-ing rectanguloids. Note that none of the model partsare at all rectangular; the legs and neck, for example are


Figure 14. These figures show the performance of the system on more realistic objects. On the left, the system uses the forward constraints toaccurately determine the pose of the soda can. Four regions are used in this case. The regions are surface markings on the cylinder of the can,and a circular region from the top of the can. On the right, the backward constraints are used to locate the pen box. Note that the soda can isoccluding some of the regions used.

Figure 15. A simple synthetic animal, seen from along the x-axis (left) y-axis (center) and z-axis (right). Bounding boxes for the rectangularbackward constraints were built with the model in this reference frame.

Figure 16. These figures show the rectangular backward constraints applied to a synthetic, 3-D object. We show the image used with some ofthe boundary of the volumes of the projected model superimposed in white. On the left, the pose found using all seven model volumes. In thecenter, we use four volumes: the head, the two left legs, and the front right leg. In both cases, we find accurate poses. On the right, we use onlythe head and the two left legs, and a poorer pose is found.

Figure 17. The results using the forward constraints using the same image and volume/region matches as in the previous figure. Since there islittle occlusion of objects parts by other parts, the forward constraints produce accurate poses. When only three volumes (right) are used, theycontinue to produce an accurate pose. Note that the backward constaints lead to a much less accurate pose in this case, as shown in the previousfigure.


Figure 18. Here we show a view of the object in which the leg volumes are significantly occluded by the body. We match all seven volumesto regions. The rectangular backward constraints (left) produce a fairly accurate pose. The forward constraints (right), which do not allow forsuch occlusion, produce a much noisier pose.

nine and seven sided polygons, respectively. Figure 17shows that the forward constraints do indeed produceaccurate results when only three volumes are matched.Finally, Fig. 18 shows the performance of the methodswhen some model volumes are partially occluded, inthis case by other parts of the object.

6. Conclusion

Recognition of 3-D objects in 2-D images has beenhampered by the difficulty of finding representationsthat can faithfully model complex 3-D objects and stillbe used to determine pose based on their 2-D images.In this work we make use of a simple representation,which divides objects into parts and then representseach part as a volume of points. This representation canclearly be applied to a large class of objects. Our con-tribution is to show that it can also be used to accuratelydetermine the pose of these objects. We show that evenwithout specific correspondences between local geo-metric features, we may use region matches to de-termine the correct model pose. At the same timeour method allows us to incorporate correspondencesbetween points, lines or line segments, should they beavailable.

Specifically, we present new results that show thatthe forward constraints, which allow for self-occlusion,may correctly determine the pose of a 3-D object, typi-cally when a correspondence has been found betweenthree 3-D parts of the object and three matching 2-Dimage regions. We have also shown that it is a moredifficult problem to find poses that satisfy the back-ward constraints, which allow for unidentified occlu-sions. This problem can have multiple disconnectedsolutions. These results apply not just to our algorithm

but to any parts-based recognition system which de-termines pose while allowing for arbitrary image oc-clusions of unknown locus. However, we have alsodevised a novel algorithm, based on work in compu-tational geometry, that finds solutions that satisfy anapproximation to the backward constraints, using lin-ear programming. And we show that in some cases ofinterest, this approximate solution leads to our findingexactly the correct model pose. These results demon-strate that we can recognize 3-D objects using a verysimple, and novel representation of their structure.

Appendix A: Uniqueness of Forward Constraints

In this section, we prove Theorems 1 and 2. We firstprove the following useful lemma:

Lemma 1. Suppose that T 0 is a transformation thatalso satisfies the forward constraints. Then for eachmodel volume Vi and image region Ri ; there exists apoint; Epi 2 Vi such that T Epi D T 0 Epi .

Proof: Throughout this section, we will assume thatthe contour generator of Vi is a 1-D curve on Vi thatprojects to the 1-D boundary of Ri . This will be truefor smooth, generic, convex volumes in general posi-tion. Our reasoning can be readily extended to othercases, however we will not consider those in this paperto simplify our arguments. Clearly we can constructa 2-D surface, call it Si , such that Si is bounded bythe contour generator, such that Si Vi and such thatT Si D Ri . Therefore, T will define a continuous one-to-one mapping between Si and Ri . So even though Tis not in general invertible, we may invert T when werestrict its domain to Si . Let NT denote the restriction of


T to this domain, and let NT1 denote its inverse, a map-ping from Ri to Si . Because T 0 satisfies the forwardconstraints, T 0Si Ri . Therefore T 0 NT1 defines acontinuous mapping from Ri into Ri . Brouwers fixedpoint theorem, a basic result in functional analy-sis (see, for example, Conway (1990)), tells us thatsince Ri is convex, and hence topologically equiv-alent to a disc, such a mapping must have a fixedpoint. That is, there exists some point Eqi 2 Ri such thatEqi D T 0 NT1 Eqi . Therefore, T . NT1 Eqi / D T 0. NT1 Eqi /,proving the lemma. 2

Using this lemma, we may now show the following:

Theorem 1. Suppose the transformation T maps thefour model volumes;V0; V1; V2; V3 to the four image re-gions, R0; R1; R2; R3; and there does not exist a planethat intersects all four volumes. Then; T is the onlytransformation that satisfies the forward constraints.

Proof: Let T 0 map the four model volumesV0; : : : ; V3 to inside the corresponding image regionsR0; : : : ; R3. By Lemma 1 for every volume Vi ,i D 0; : : : ; 3 there exists a point Epi 2 Vi such thatT 0. Epi / D T . Epi /. Since there exists no plane thatintersects all four volumes the points Ep0; : : : ; Ep3 arenot coplanar. Consequently, since correspondences offour non-coplanar points determine a 3-D to 2-D affinetransformation uniquely then T D T 0. 2

We now show that typically, only three matches arerequired to uniquely determine the pose. In (Basri andJacobs, 1995) we have proven the following lemma:

Lemma 2. T is uniquely determined for the set ofvolumes Vi if and only if it is uniquely determined forthe set of volumes QVi ; where Q is any 3-D affinetransformation.

This tells us that we may, without loss of generality,place the model volumes in any affine reference frame.This can significantly simplify our reasoning, since itallows us to assume without loss of generality that Tis the identity transformation, plus orthographic pro-jection. We now suppose that we have matched threemodel volumes to three image regions, and that theimage regions are not all intersected by a single line.Further, we suppose that the image regions were pro-duced by applying the transformation, T , to the modelvolumes, and that a different transformation, T 0 6D T ,

also satisfies the forward constraints. Then there ex-ist three non-collinear model points, Ep0; Ep1; Ep2 suchthat T Ep0 D T 0 Ep0, T Ep1 D T 0 Ep1, and T Ep2 D T 0 Ep2.Lemma 2 tells us that we may assume, without lossof generality, that these three points are in the z D 0plane and fixed under the transformation T , and there-fore also under T 0. This means that the entire z D 0plane is fixed under these two transformations. Fur-ther, without loss of generality, we may assume thatT .0; 0; 1/ D .0; 0/. Therefore, we may write:

T D1 0 0

0 1 0

and

T 0 D1 0 k1

0 1 k2

;

where either k1, k2 or both are non-zero.By choosing this affine reference frame, we have

constructed things so that each contour generator, si ,projects orthographically to form the boundary of thecorresponding region. Let Ep be an arbitrary pointon V0s contour generator, with coordinates .x; y; z/.Then T Ep D .x; y/, and T 0 Ep D .x; y/C z.k1; k2/. Thistells us that T 0 maps the contour generator so that it isdisplaced from the region boundary in either the direc-tion .k1; k2/ or .k1; k2/, depending on whether thepoint is above or below the image plane.

We may use this fact to place constraints on the di-rection of the contour generators tangent. If one ofthe contour generators is either entirely above, or en-tirely below the image plane, then clearly T 0 will mapsome of these points outside the corresponding region,violating our assumptions.

Next, suppose that the image plane intersects thecontour generator in at least two discrete points, butnot in an entire subcurve of the contour generator. LetRi \ si be the points Ep 1i ; Ep 2i . Let the tangent to Riat Ep ji be Ew D .wx ; wy; 0/. Let the tangent to siat the point Ep ji have the direction Ev. Then the direc-tions of Ew; T Ev and T 0 Ev must all be the same. Thatis true because if a point in the model projects to theboundary of the image region, and the tangents of theregion point and the projected model point differ, thenthe projected model volume will not be contained inthe image region. So, since Ew D T Ev, we must haveEv D .wx ; wy; vz/ for some vz . The points Ep ji are alsofixed under T 0, since they lie in the z D 0 plane whichis fixed under T 0. Therefore, the tangent to T 0Vi at Ep ji


is .wx C k1vz; wy C k2vz/. Again, the condition thatT 0Vi Ri implies that T 0 Ev must have the same direc-tion as Ew. Since either k1 6D 0 or k2 6D 0 it follows thatthe directions of .wx ; wy/ and .k1; k2/must be parallel.Therefore, the tangents to each region Ri at a point Ep jimust all be parallel to .k1; k2/, and so they must all beparallel to each other.

If the image plane intersects the contour generatorin an entire curve, we may apply the same reason-ing to the two end points of the curve. In the specialcase that the entire contour generator lies in the imageplane, then that volume will not constrain k1 and k2,for T 0 close to T .

This has shown:

Theorem 2. T is not the only transformation satisfy-ing the forward constraints if and only if; for the con-tour generators T produces; there exists a plane; P;such that; for all contour generators; either P containsthe contour generator; or P intersects it in two points;such that the orthographic projection onto P of thetangents to the contour generators at the intersectionpoints are all parallel.

Simple variable counting now tells us that for gen-eral shapes, in general position, this will not be pos-sible. Given a set of model volumes and a particularviewpoint, the contour generators will be fixed. First,for general objects the contour generator will not beplanar. Next, we have three degrees of freedom inchoosing a plane to intersect the contour generators.Each plane will determine the direction of six tangentvectors projected into that plane. For all these tangentsto be parallel provides five degrees of constraint; how-ever we have only three degrees of freedom available tosatisfy these constraints. Therefore, in general, givena set of model volumes and a view of them, there willbe a unique transformation that satisfies the forwardconstraints, provided that the image regions cannot beintersected by a single line.

Appendix B: Uniqueness of Backward Constraints

We begin our analysis by supposing that at least twoof the model volumes, V0; V1 are planar and lie in thesame plane. Without loss of generality, we assume thatthis is the z D 0 plane. Note that we are free to prepro-cess the model volumes with an affine transformation

to achieve this before approximating them with rect-angles. The volumes will then be approximated by2-D rectangles in this plane, which we call C0 andC1. Each side of each rectangle will touch the cor-responding volume in at least one point. These willbe the points with the highest and lowest values intheir x and y coordinates. We will call these pointsEx0;l ; Ex0;h; Ey0;l ; Ey0;h; Ex1;l ; Ex1;h; Ey1;l ; Ey1;h , where, for ex-ample, Ex0;h is the point in V0 with the highest x coordi-nate (see Fig. 19). Note that if there is a line segmenton the boundary of one of the volumes with constantx or y values we can pick any of these points, or wecan easily extend the reasoning given below to includethis case. Also, one point may be extremal in boththe x and y direction; this does not effect the argu-ment given below. At these eight points, the backwardconstraints will require that the corresponding regionpoint found in the image will lie on the appropriateside of the projection of a horizontal or vertical linepassing through that point in the model. Therefore, wewill have eight constraints that are not approximate, butthat are subsets of the constraints we would have if we

Figure 19. Two coplanar volumes V0 and V1 bounded withflat rectanguloids R0 and R1. The points Ex0;l ; Ex0;h ; Ey0;l ; Ey0;h ; Ex1;l ;Ex1;h ; Ey1;l ; Ey1;h are the contact points of the volumes and the rectan-guloids. In the case on the top, the two pairs of points Ey0;l ; Ey1;l andEy0;h ; Ey1;h are linearly separable, as is indicated by the dashed line.On the bottom, the volumes lead to a unique solution in the z D 0plane.


could apply the backward constraints exactly. A trans-formation infinitesimally different from the correct onewill not violate any of the other rectangular backwardconstraints, but it may violate these ones.

Now we consider when these eight constraints willsuffice to uniquely determine that portion of the pro-jection that effects the z D 0 plane. From Lemma 2 wemay assume, without loss of generality, that T leavesthe z D 0 plane fixed;1 we must then determine thecircumstance under which there exists a different trans-formation, T 0, which has a different effect on the z D 0plane while respecting the rectangular backward con-straints. It will be convenient to focus on the inverseof T 0 when we restrict its effect to the z D 0 plane.We ask when there is an inverse transformation thatcan map each image point to the appropriate side ofthe bounding model rectangle. We denote this inversetransformation T 01, with components t 01 given theappropriate subscript.

Note that the x coordinates of points in the z D 0plane are effected only by t 0111 ; t

0112 and t 01x , while the

y coordinates are changed by t 0121 ; t0122 and t 01y . Since

the Ex0;h; Ex0;l ; Ex1;h; Ex1;l points are constrained only inthe x direction, and the corresponding y points areconstrained only in the y direction, we may considerthese two sets of points separately. A non-unique so-lution exists if and only if either there exist values oft 0111 ; t

0112 and t 01x that map the points Ex0;h; Ex0;l ; Ex1;h; Ex1;l

within the minimum and maximum x values of theirbounding rectanguloids, or if there similarly exist val-ues of t 0121 ; t

0122 and t 01y that map the y points in the

appropriate y directions.The transformation T 01 will change the x coordi-

nate of the point .x; y/ by the amount:

T 01x .x; y; 0/ x Dt 0111 1

x C t 0112 y C t 01x :

Therefore, the linet 0111 1

x C t 0112 y C t 01x D 0

will have its x coordinates fixed (note that in the case ofpure translation we may think of the equation t 01x D 0as describing a vertical line at x D 1). On one side ofthis line, the value of .t 0111 1/x C t 0112 yC t 01x is pos-itive, on the opposite side this value is negative. Thismeans that either all points will shift their x coordinatetowards this fixed line, or all points will shift away fromit. Therefore, the only way such a transformation can

satisfy the backward constraints is if we can draw aline that separates the points Ex0;l ; Ex1;l from the pointsEx0;h; Ex1;h . In that case any such separating line can bethe fixed line, and T 01 can map all points towards thatline, in the x direction. It may or may not be pos-sible to find such a separating line, depending on theconfiguration of the model.

Exactly the same reasoning holds for the pointsEy0;l ; Ey0;h; Ey1;l ; Ey1;h . Therefore, there exist models forwhich the rectangular backward constraints uniquelydetermine the transformation of the z D 0 plane, asillustrated in Fig. 19. While it is somewhat difficult tofind such a configuration in two model volumes, simi-lar reasoning holds when more than two volumes arecoplanar, except that now a larger set of points mustbe linearly separable for a transformation that satisfiesthe backward constraints to be non-unique.

Suppose now that the model contains a set of vol-umes for which the backward constraints uniquely de-termine the transformation of the z D 0 plane. Thistells us that any transformation, T 0 6D T satisfying thebackward constraints must have the form:

T 0 Ep D1 0 t 013

0 1 t 023

Ep;

with either t 013 6D 0 or t 023 6D 0. We now ask whenan additional planar model volume, not in the z D 0plane, will suffice to completely determine the model-to-image transformation. Without loss of generality wemay assume that this volume lies in the y-z plane, sothat it is bounded by a rectangle in this plane, and hasfour extremal points in the y and z directions. Further-more, we may assume that the transformation T hasthe form:

T Ep D1 0 1

0 1 0

Ep:

That is, T is the identity transformation when appliedto the x-y plane, while it maps the y-z plane to the x-yimage plane by mapping .0; y; z/ to .z; y/. Therefore,T acts also like an identity transformation on the y-zplane, while converting it into the x-y plane. Now con-sider the effect that T 0 has as it maps the y-z plane intothe x-y plane. It leaves the z D 0 line fixed, so T 0 canconsist only of a contraction of the y-z plane towardsthis line, along with an interchange of coordinates. Ifboth extremal points of the model volume in the z di-rection are on the same side of the line z D 0, then


any contraction or expansion towards or away fromthat line will violate one of the constraints imposedby the bounding rectanguloid. That is, once the back-ward constraints imply a unique solution within oneplane, they fail to produce a unique solution only whenthis plane intersects the remaining volumes. This willnever happen if the volumes are surface markings ona convex polyhedra, as for example, when they lie onthe walls or ceiling of a room.

We have now shown that, though approximate, therectangular backward constraints may still uniquely de-termine the correct model to image transformation insituations of real interest. At the same time, unlike ourprevious 2-D formulation of the backward constraints,we can now integrate information from planar volumesthat are not coplanar, or from non-planar volumes.

Acknowledgment

The authors wish to thank Baskaran Vijayakumar forhis assistance in building the experimental system andin running the experiments. This research was sup-ported by the Unites States-Israel Binational ScienceFoundation, Grant No. 94-100. The vision group atthe Weizmann Inst. is supported in part by the IsraeliMinistry of Science, Grant No. 8504. Ronen Basri isan incumbent of Arye Dissentshik Career DevelopmentChair at the Weizmann Institute.

Note

1. This point is somewhat subtle. We may assume this because thebounding rectanguloids can be axial to any affine reference frame,which need not be the natural Euclidean reference frame in whichwe store the model.

References

Alter, T.D. and Jacobs, D. 1998. Uncertainty propagation in model-based recognition. International Journal of Computer Vision,27(2):127159.

Amenta, N. 1992. Finding a line traversal of axial objects in threedimensions. In Proc. of the Third ACM-SIAM Symp. on DiscreteAlg., pp. 6671.

Amenta, N. Personal communication.Amenta, N. 1994. Bounded boxes, Hausdorff distance, and a new

proof of an interesting Helly-type theorem. In Proceedings ofthe 10th Annual ACM Symposium on Computational Geometry,pp. 340347.

Basri, R. 1996. Paraperspective affine. International Journal ofComputer Vision, 19(2):169179.

Basri, R. and Jacobs, D.W. 1995. Recognition using region corres-pondences. Technical Report CS95-33, The Weizmann Instituteof Science.

Basri, R. and Jacobs, D. 1996. Matching convex polygons and poly-hedra, allowing for occlusion. First ACM Workshop on AppliedComputational Geometry, pp. 5766.

Basri, R. and Jacobs, D. 1997. Recognition using region correspon-dences. International Journal of Computer Vision, 25(2):145166.

Bergevin, R. and Levine, M. 1993. Generic object recognition:Building and matching coarse descriptions from line drawings.IEEE Trans. on Pattern Analysis and Machine Intelligence,15(1):1936.

Biederman, I. 1985. Human image understanding: Recent researchand a theory. Computer Graphics, Vision, and Image Processing,(32):2973.

Binford, T. 1971. Visual perception by computer. In IEEE Conf. onSystems and Control.

Brooks, R. 1981. Symbolic reasoning among 3-D models and 2-Dimages. Artificial Intelligence, 17:285348.

Canny, J. 1986. A computational approach to edge detection.IEEE Transactions on Pattern Analysis and Machine Intelligence,8(6):679698.

Clemens, D. 1991. Region-based feature interpretation for recogniz-ing 3D models in 2D images. MIT AI TR-1307.

Conway, J.B. 1990. A Course in Functional Analysis. Springer-Verlag.

Duda, R.O. and Hart, P.E. 1973. Pattern Classification and SceneAnalysis. Wiley-Interscience Publication, John Wiley and Sons,Inc.

Dudani, S.A., Breeding, K.J., and McGhee, R.B. 1977. Aircraft iden-tification by moments invariants. IEEE Transactions on Compu-tations, C-26(1):3946.

Fischler, M.A. and Bolles, R.C. 1981. Random sample consensus:A paradigm for model fitting with application to image analy-sis and automated cartography. Com. of the A.C.M., 24(6):381395.

Forsyth, D. 1996. Recognizing algebraic surfaces from their outlines.International Journal of Computer Vision, 18(1):2140.

Forsyth, D., Mundy, J., Zisserman, A., and Rothwell, C. 1992.Recognising rotationally symmetric surfaces from their outlines.European Conf. on Comp. Vis., pp. 639647.

Gross, A. and Boult, T. 1990. Recovery of generalized cylinders froma single intensity image. In Proc. of the DARPA IU Workshop,pp. 557564.

Horaud, R. 1987. New methods for matching 3-D objects with sin-gle perspective views. IEEE Trans. Pattern Anal. Machine Intell.,9(3):401412.

Hu, M.K. 1962. Visual pattern recognition by moment invariants.IRE Transactions on Information Theory, IT-8:169187.

Huttenlocher, D., Klanderman, G., and Rucklidge, W. 1993a. Com-paring images using the Hausdorff distance. IEEE Transactionson Pattern Analysis and Machine Intelligence, 15(9):850863.

Huttenlocher, D., Noh, J., and Rucklidge, W. 1993b. Tracking non-rigid objects in complex scenes. In 4th Int. Conf. on ComputerVision, pp. 93101.

Huttenlocher, D.P. and Ullman, S. 1990. Recognizing solid objectsby alignment with an image. Int. J. Computer Vision, 5(2):195212.

Jacobs, D. 1992. Recognizing 3-D objects using 2-D images. MITAI Memo 1416.


Jacobs, D. 1996. Robust and efficient detection of salient convexgroups. IEEE Trans. Pattern Anal. Machine Intell., 18(1):2337.

Jacobs, D. 1997. Matching 3-D models to 2-D images. Int. J. Com-puter Vision, 21(1/2):123153.

Koenderink, J. and van Doorn, A. 1991. Affine structure from motion.Journal of the Optical Society of America, 8(2):377385.

Kriegman, D. and Ponce, J. 1990. On recognizing and positioningcurved 3-D objects from image contours. IEEE Trans. PatternAnal. Machine Intell., 12(12):11271137.

Lamdan, Y. and Wolfson, H.J. 1988. Geometric hashing: A generaland efficient model-based recognition scheme. In Second Interna-tional Conference Computer Vision, pp. 238249.

Lowe, D. 1985. Perceptual Organization and Visual Recognition.Kluwer Academic Publishers: The Netherlands.

Marr, D. and Nishihara, H. 1978. Representation and recognitionof the spatial organization of three dimensional structure. In Pro-ceedings of the Royal Society of London B, 200:269294.

Megiddo, N. 1996. Finding a line of sight thru boxes in d-space inlinear time. Reseach Report RJ 10018, IBM Almaden ResearchCenter, San Jose, California.

Nagao, K. and Grimson, W. 1994. Object recognition by alignmentusing invariant projections of planar surfaces. In 12th Int. Conf.on Pattern Rec., pp. 861864.

Nayar, S. and Bolle, R. 1996. Reflectance based object recognition.Int. J. of Comp. Vis., 17(3):219240.

Pellegrini, M. 1990. Stabbing and ray shooting in three dimensionalspace. In Proc. of the 6th Annual Symp. on Comp. Geometry,pp. 177186.

Pentland, A. 1987. Recognition by parts. In Proceedings of the FirstInternational Conference on Computer Vision, pp. 612620.

Persoon, E. and Fu, K.S. 1977. Shape discrimination using Fourierdescriptors. IEEE Transactions on Systems, Man and Cybernetics,7:534541.

Poelman, C.J. and Kanade, T. 1997. A paraperspective factorizationmethod for shape and motion recovery. IEEE Transactions onPattern Analysis and Machin

3-D to 2-D Pose Determination with Regions

Documents

Transcript of 3-D to 2-D Pose Determination with Regions