A Closed-Form Solution to Non-Rigid Shape and Motion...

International Journal of Computer Vision 67(2), 233–246, 2006c© 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

DOI: 10.1007/s11263-005-3962-9

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

JING XIAOEpson Palo Alto Laboratory, 3145 Porter Drive, Palo Alto, CA 94304, USA

[email protected],[email protected] (www.cs.cmu.edu/∼jxiao)

JINXIANG CHAI AND TAKEO KANADERobotics Institute, CMU, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA

[email protected]

[email protected]

Received September 21, 2004; Revised April 19, 2005; Accepted May 3, 2005

First online version published in February, 2006

Abstract. Recovery of three dimensional (3D) shape and motion of non-static scenes from a monocular videosequence is important for applications like robot navigation and human computer interaction. If every point in thescene randomly moves, it is impossible to recover the non-rigid shapes. In practice, many non-rigid objects, e.g. thehuman face under various expressions, deform with certain structures. Their shapes can be regarded as a weightedcombination of certain shape bases. Shape and motion recovery under such situations has attracted much interest.Previous work on this problem (Bregler, C., Hertzmann, A., and Biermann, H. 2000. In Proc. Int. Conf. ComputerVision and Pattern Recognition; Brand, M. 2001. In Proc. Int. Conf. Computer Vision and Pattern Recognition;Torresani, L., Yang, D., Alexander, G., and Bregler, C. 2001. In Proc. Int. Conf. Computer Vision and PatternRecognition) utilized only orthonormality constraints on the camera rotations (rotation constraints). This paperproves that using only the rotation constraints results in ambiguous and invalid solutions. The ambiguity arisesfrom the fact that the shape bases are not unique. An arbitrary linear transformation of the bases produces anotherset of eligible bases. To eliminate the ambiguity, we propose a set of novel constraints, basis constraints, whichuniquely determine the shape bases. We prove that, under the weak-perspective projection model, enforcing boththe basis and the rotation constraints leads to a closed-form solution to the problem of non-rigid shape and motionrecovery. The accuracy and robustness of our closed-form solution is evaluated quantitatively on synthetic data andqualitatively on real video sequences.

Keywords: non-rigid structure from motion, shape bases, rotation constraint, ambiguity, basis constraints,closed-form solution

1. Introduction

Many years of work in structure from motion haveled to significant successes in recovery of 3D shapesand motion estimates from 2D monocular videos.Many reliable methods have been proposed forreconstruction of static scenes (Tomasi and Kanade,1992; Zhang et al., 1995; Triggs, 1996; Hartley

and Zisserman, 2000; Mahamud and Hebert, 2000).However, most biological objects and natural scenesvary their shapes: expressive faces, people walkingbeside buildings, etc. Recovering the structure andmotion of these non-rigid objects is a challengingtask. The effects of rigid motion, i.e. 3D rotationand translation, and non-rigid shape deformation,e.g. stretching, are coupled together in the image

234 Xiao, Chai and Kanade

measurements. While it is impossible to reconstructthe shape if the scene deforms arbitrarily, in practice,many non-rigid objects, e.g. the human face undervarious expressions, deform with a class of structures.

One class of solutions model non-rigid objectshapes as weighted combinations of certain shapebases that are pre-learned by off-line training (Bascleand Blake, 1998; Blanz and Vetter, 1999; Brand andBhotika, 2001; Gokturk et al., 2001). For instance,the geometry of a face is represented as a weightedcombination of shape bases that correspond to variousfacial deformations. Then the recovery of shape andmotion is simply a model fitting problem. However,in many applications, e.g. reconstruction of a sceneconsisting of a moving car and a static building, theshape bases of the dynamic structure are difficult toobtain before reconstruction.

Several approaches have been proposed to solve theproblem without a prior model (Bregler et al., 2000;Torresani et al., 2001; Brand, 2001). Instead, they treatthe model, i.e. shape bases, as part of the unknowns tobe computed. They try to recover not only the non-rigidshape and motion, but also the shape model. This classof approaches so far has utilized only the orthonor-mality constraints on camera rotations (rotation con-straints) to solve the problem. However, as shown inthis paper, enforcing only the rotation constraints leadsto ambiguous and invalid solutions. These approachesthus cannot guarantee the desired solution. They haveto either require a priori knowledge on shape and mo-tion, e.g. constant speed (Han and Kanade, 2001), orneed non-linear optimization that involves large num-ber of variables and hence requires a good initial esti-mate (Torresani et al., 2001; Brand, 2001).

Intuitively, the above ambiguity arises from the non-uniqueness of the shape bases: an arbitrary linear trans-formation of a set of shape bases yields a new set ofeligible bases. Once the bases are determined uniquely,the ambiguity is eliminated. Therefore, instead of im-posing only the rotation constraints, we identify andintroduce another set of constraints on the shape bases(basis constraints), which implicitly determine thebases uniquely. This paper proves that, under the weak-perspective projection model, when both the basis androtation constraints are imposed, a linear closed-formsolution to the problem of non-rigid shape and motionrecovery is achieved. Accordingly we develop a factor-ization method that applies both the metric constraintsto compute the closed-form solution for the non-rigidshape, motion, and shape bases.

2. Previous Work

Recovering 3D object structure and motion from 2Dimage sequences has a rich history. Various approacheshave been proposed for different applications. The dis-cussion in this section will focus on the factorizationtechniques, which are most closely related to our work.

The factorization method was originally proposedby Tomasi and Kanade (1992). First it applies the rankconstraint to factorize a set of feature locations trackedacross the entire sequence. Then it uses the orthonor-mality constraints on the rotation matrices to recoverthe scene structure and camera rotations in one step.This approach works under the orthographic projec-tion model. Poelman and Kanade (1997) extended it towork under the weak perspective and para-perspectiveprojection models. Triggs (1996) generalized the fac-torization method to the recovery of scene geometryand camera motion under the perspective projectionmodel. These methods work for static scenes.

The factorization technique has been extended tothe non-rigid cases (Costeira and Kanade, 1998; Hanand Kanade, 2000; Wolf and Shashua, 2001; Kanatani,2001; Wolf and Shashua, 2002; Vidal and Soatto, 2002;Zelnik-Manor and Irani, 2003; Sugaya and Kanatani,2004; Vidal and Hartley, 2004). Costeira and Kanade(1998) proposed a method to reconstruct the structureof multiple independently moving objects under theweak-perspective camera model. This method firstseparates the objects according to their respectivefactorization characteristics and then individuallyrecovers their shapes using the original factorizationtechnique. The similar reconstruction problem wasalso studied in Kanatani (2001), Zelnik-Manor andIrani (2003), and Sugaya and Kanatani (2004). Wolfand Shashua (2001) derived a geometrical constraint,called the segmentation matrix, to reconstruct a scenecontaining two independently moving objects fromtwo perspective views. Vidal and his colleagues (Vidalet al., 2002; Vidal and Hartley, 2004) generalizedthis approach for reconstructing dynamic scenescontaining multiple independently moving objects. Incases where the dynamic scenes consist of both staticobjects and objects moving straight along respectivedirections, Han and Kanade (2000) proposed aweak-perspective reconstruction method that achievesa closed-form solution, under the assumption of con-stant moving velocities. A more generalized solutionto reconstructing the shapes that deform at constantvelocity is presented in Wolf and Shashua (2002).

Closed-Form Solution to Non-Rigid Shape and Motion Recovery 235

Bregler and his colleagues (Bregler et al., 2000) forthe first time incorporated the basis representation ofnon-rigid shapes into the reconstruction process. Byanalyzing the low rank of the image measurements,they proposed a factorization-based method thatenforces the orthonormality constraints on camerarotations to reconstruct the non-rigid shape andmotion. Torresani and his colleagues (Torresani et al.,2001) extended the method in Bregler et al. (2000) toa trilinear optimization approach. At each step, twoof the three types of unknowns, bases, coefficients,and rotations, are fixed and the remaining one isupdated. The method in Bregler et al. (2000) is usedto initialize the optimization process. Brand (2001)proposed a similar non-linear optimization methodthat uses an extension of the method in Bregler etal. (2000) for initialization. All the three methodsenforce the rotation constraints alone and thus cannotguarantee the optimal solution. Note that because bothof the non-linear optimization methods involve a largenumber of variables, e.g. the number of coefficientsequals the product of the number of the images andthe number of the shape bases, their performancesgreatly rely on the quality of the initial estimates.

3. Problem Statement

Given 2D locations of P feature points across F frames,{(v, v)T

f p| f = 1, . . . , F, p = 1, . . . , P}, our goal isto recover the motion of the non-rigid object relativeto the camera, including rotations {R f | f = 1, . . . , F}and translations {t f | f = 1, . . . , F}, and its 3D deform-ing shapes {(x, y, z)T

f p| f = 1, . . . , F, p = 1, . . . , P}.Throughout this paper, we assume:

– the deforming shapes can be represented as weightedcombinations of shape bases;

– the 3D structure and the camera motion are non-degenerate;

– the camera projection model is the weak-perspectiveprojection model.

We follow the representation of Blanz and Vetter(1999) and Bregler et al. (2000). The non-rigid shapesare represented as weighted combinations of K shapebases {Bi , i = 1, . . . , K }. The bases are 3×P matricescontrolling the deformation of P points. Then the 3Dcoordinate of the point p at the frame f is

X f p = (x, y, z)Tf p = �K

i=1c f i bi p

f = 1, . . . , F, p = 1, . . . , P (1)

where bip is the pth column of Bi and cif is its combi-nation coefficient at the frame f. The image coordinateof Xfp under the weak perspective projection model is

x f p = (u, v)Tf p = s f (R f · X f p + t f ) (2)

where Rf stands for the first two rows of the fth camerarotation and t f = [t f x t f y]T is its translation relative tothe world origin. sf is the scalar of the weak perspectiveprojection.

Replacing Xfp using Eq. (1) and absorbing sf into cfi

and tf , we have

x f p = (c f 1 R f . . . c f K R f ) ·

b1p

· · ·bK p

+ t f (3)

Suppose the image coordinates of all P feature pointsacross F frames are known. We form a 2F × P mea-surement matrix W by stacking all image coordinates.Then W = MB + T [11 . . . 1], where M is a 2F × 3Kscaled rotation matrix, B is a 3K × P bases matrix, andT is a 2F × 1 translation vector,

M =

c11 R1 . . . c1K R1

......

...

cF1 RF . . . cF K RF

,

B =

b11 . . . b1P

......

...

bK 1 . . . bK P

, T = (

tT1 . . . tT

F

)T(4)

As in Han and Kanade (2000) and Bregler et al.(2000), since all points on the shape share the sametranslation, we position the world origin at the scenecenter and compute the translation vector by aver-aging the image projections of all points. This stepdoes not introduce ambiguities, provided that the cor-respondences of all the points across all the images aregiven, i.e. there are no missing data. We then subtract itfrom W and obtain the registered measurement matrixW = M B.

Under the non-degenerate cases, the 2F × 3Kscaled rotation matrix M and the 3K × P shapebases matrix B are both of full rank, respectivelymin{2F, 3K } and min{3K , P}. Their product, W , isof rank min{3K , 2F, P}. In practice, the frame num-ber F and point number P are usually much larger than


the basis number K such that 2F > 3K and P > 3K .Thus the rank of W is 3K and K is determined byK = rank(W )

3 . We then factorize W into the product ofa 2F × 3K matrix M and a 3K × P matrix B, usingSingular Value Decomposition (SVD). This decompo-sition is only determined up to a non-singular 3K ×3Klinear transformation. The true scaled rotation matrixM and bases matrix B are of the form,

M = M · G, B = G−1 · B (5)

where the non-singular 3K × 3K matrix G is calledthe corrective transformation matrix. Once G is deter-mined, M and B are obtained and thus the rotations,shape bases, and combination coefficients are recov-ered.

All the procedures above, except obtaining G, arestandard and well-understood (Blanz and Vetter, 1999;Bregler et al., 2000). The problem of nonrigid shapeand motion recovery is now reduced to: given the mea-surement matrix W, how can we compute the correctivetransformation matrix G?

4. Metric Constraints

G is made up of K triple-columns, denoted as gk, k =1, . . . , K . Each of them is a 3K × 3 matrix. They areindependent on each other because G is non-singular.According to Eqs. (4, 5), gk satisfies,

Mgk =

c1k R1

· · ·cFk RF

(6)

Then,

Mgk gTk MT

=

c21k R1 RT

1 c1kc2k R1 RT2 . . . c1kcFk R1 RT

F

c1kc2k R2 RT1 c2

2k R2 RT2 . . . c2kcFk R2 RT

F

......

. . ....

c1kcFk RF RT1 c2kcFk RF RT

2 . . . c2Fk RF RT

F

(7)

We denote gk gTk by Qk, a 3K × 3K symmetric matrix.

Once Qk is determined, gk can be computed using SVD.To compute Qk, two types of metric constraints areavailable and should be imposed: rotation constraints

and basis constraints. While using only the rotationconstraints (Bregler et al., 2000; Brand, 2001) leads toambiguous and invalid solutions, enforcing both setsof constraints results in a linear closed-form solutionfor Qk.

4.1. Rotation Constraints

The orthonormality constraints on the rotation matricesare one of the most powerful metric constraints andthey have been used in reconstructing the shape andmotion for static objects (Tomasi and Kanade, 1992;Poelman and Kanade, 1997), multiple moving objects(Costeira and Kanade, 1998; Han and Kanade, 2000),and non-rigid deforming objects (Bregler et al., 2000;Torresani et al., 2001; Brand, 2001).

According to Eq. (7), we have,

M2i−1:2i Qk MT2 j−1:2 j =cikc jk Ri RT

j , i, j =1, . . . F

(8)

where M2i−1:2i represents the ith bi-row of M . Due toorthonormality of the rotation matrices, we have,

M2i−1:2i Qk MT2i−1:2i = c2

ikI2×2, i = 1, . . . , F (9)

where I2×2 is a 2×2 identity matrix. The two diago-nal elements of Eq. (9) yield one linear constraints onQk, since cik is unknown. The two off-diagonal con-straints are identical, because Qk is symmetric. For allF frames, we obtain 2F linear constraints as follows,

M2i−1 Qk MT2i−1 − M2i Qk MT

2i = 0, i = 1, . . . , F

(10)

M2i−1 Qk MT2i = 0, i = 1, . . . , F (11)

We consider the entries of Qk as variables and neglectthe nonlinear constraints implicitly embedded in theformation of Qk = gk gT

k , where gk has only 9K freeentries. Since Qk is symmetric, it contains (9K 2+3K )

2independent variables. It appears that, when enoughimages are given, i.e. 2F ≥ (9K 2+3K )

2 , the rotation con-straints in Eqs. (10, 11) should be sufficient to deter-mine Qk via the linear least-square method. However, itis not true in general. We will show that most of the ro-tation constraints are redundant and they are inherentlyinsufficient to resolve Qk.


4.2. Why are Rotation Constraints Not Sufficient?

Under specific assumptions like static scene or constantspeed of deformation, the orthonormality constraintsare sufficient to reconstruct the 3D shapes and camerarotations (Tomasi and Kanade, 1992; Han and Kanade,2000). In general cases, however, no matter how manyimages or feature points are given, the solution of therotation constraints in Eqs. (10, 11) is inherently am-biguous.

Definition 1. A 3K ×3K symmetric matrix Y is calleda block-skew-symmetric matrix, if all the diagonal 3× 3 blocks are zero matrices and each off-diagonal 3× 3 block is a skew symmetric matrix.

Yi j =

0 yi j1 yi j2

−yi j1 0 yi j3

−yi j2 −yi j3 0

= −Y T

i j =Y Tji , i �= j

(12)

Yii = 03×3, i, j = 1, . . . , K (13)

Each off-diagonal block consists of 3 independent el-ements. Because Y is skew-symmetric and has K (K−1)

2independent off-diagonal blocks, it includes 3K (K−1)

2independent elements.

Definition 2. A 3K ×3K symmetric matrix Z is calleda block-scaled-identity matrix, if each 3 × 3 block is ascaled identity matrix, i.e. Zi j = λi j I3×3, where λij isthe only variable.

Because Z is symmetric, the total number of variablesin Z equals the number of independent blocks, K (K+1)

2 .

Theorem 1. The general solution of the rotation con-straints in Eqs. (10, 11) is GHkGT, where G is thedesired corrective transformation matrix, and Hk isthe summation of an arbitrary block-skew-symmetricmatrix and an arbitrary block-scaled-identity matrix,given a minimum of K 2+K

2 images containing different(non-proportional) shapes and another K 2+K

2 imageswith non-degenerate (non-planar) rotations.

Proof: Let us denote Qk as the general solutionof Eqs. (10, 11). Since G is a non-singular squarematrix, Qk can be represented as GHk GT , whereHk = G−1 Qk G−T . We then prove that Hk must be

the summation of a block-skew-symmetric matrix anda block-scaled-identity matrix.

According to Eqs. (5, 9),

c2ikI2×2 = M2i−1:2i Qk MT

2i−1:2i

= M2i−1:2i G Hk GT MT2i−1:2i

= M2i−1:2i Hk MT2i−1:2i , i = 1, . . . , F

(14)

Hk consists of K 2 3 × 3 blocks, denoted asHkmn, m, n = 1, . . . , K . Combining Eqs. (4) and (14),we have,

Ri�Km=1

(c2

im Hkmm +�Kn=m+1cimcin

(Hkmn +H T

kmn

))RT

i

= c2ikI2×2, i = 1, . . . , F (15)

Denote the 3 × 3 symmetric matrix �2m=1(c2

im Hkmmm +�K

n=m+1cimcin(Hkmn + H Tkmn)) by �ik. Then Eq. (15)

becomes Ri�ik RTi = c2

ikI2×2 be its homogeneous so-lution, i.e. Ri �ik RT

i = 02×2. So we are looking fora symmetric matrix �ik in the null space of the mapX �→ Ri X RT

i . Because the two rows of the 2 × 3matrix Ri are orthonormal, we have,

�ik = r Ti3δik + δT

ikri3 (16)

where ri3 is a unitary 1 × 3 vector that are orthogonalto both rows of Ri . δik is an arbitrary 1×3 vector, indi-cating the null space of the map X �→ Ri X RT

i has 3degrees of freedom. �ik = c2

ikI3×3 is a particular solu-tion of Ri�ik RT

i = c2ikI2×2. Thus the general solution

of Eq. (15) is,

�Km=1

(c2

im Hkmm + �Kn=m+1cimcin

(Hkmn + H T

kmn

))

= �ik = c2ikI3×3 + αik�ik, i = 1, . . . , F (17)

where αik is an arbitrary scalar.Let us denote N = K 2+K

2 . Each image yields a set ofN coefficients (c2

i1, ci1ci2, . . . , ci1ci K , c2i2, ci2ci3, . . . ,

ci K−1ci K , c2ik) in Eq. (17). We call them the quadratic-

form coefficients. The quadratic-form coefficients as-sociated with images with non-proportional shapes aregenerally independent. In our test, the samples of ran-domly generated N different (non-proportional) shapesalways yield independent N sets of quadratic-form co-efficients. Thus given N images with different (non-proportional) shapes, the associated N quadratic-form


coefficient sets span the space of the quadratic-form co-efficient in Eq. (17). The linear combinations of theseN sets compose the quadratic-form coefficients asso-ciated with any other images. Since Hkmm and Hkmn

are common in all images, for any additional imagej = 1, . . . , F, � jk is described as follows,

� jk = �Nl=1λlk�lk

⇒ c2jkI3×3 + α jk� jk = �N

l=1λlk(c2

lkI3×3 + αlk�lk)

⇒ α jk� jk = �Nl=1λlk(αlk�lk) (18)

where λlk is the combination weights. We substitute� jk and �lk by Eq. (16) and absorb αjk and αlk intoδjk and δlk respectively due to their arbitrary values.Eq. (18) is then rewritten as,

r Tj3δ jk + δT

jkr j3 − �Nl=1λlk

(r T

l3δlk + δTlkrl3

) = 0 (19)

Due to symmetry, Eq. (19) yields 6 linear constraints onδlk, l = 1, . . . , N and δjk that in total include 3N + 3variables. These constraints depend on the rotationsin the images. Given x additional images with non-degenerate (non-planar) rotations, we obtain 6x differ-ent linear constraints and 3N +3x free variables. When6x ≥ (3N + 3x), i.e. x ≥ N , these constraints lead toa unique solution, δlk = δ jk = 0, l = 1, . . . , N , j =1, . . . , x . Therefore, given a minimum of N imageswith non-degenerate rotations together with the N im-ages containing different shapes, Eq. (17) can be rewrit-ten as follows,

�ik = �Km=1

(c2

im Hkmm +�Kn=m+1cimcin

(Hkmn +H T

kmn

))

= c2ikI3×3, i = 1, . . . , 2N (20)

Eq. (20) completely depend on the coefficients. Ac-cording to Eq. (18, 20), for any image j = 2N +1, . . . , F, � jk = �N

l=1λlk�lk = �Nl=1λlkc2

lkI3×3 =c2

jkI3×3. Therefore Eq. (20) is satisfied for all F im-ages. Denote (Hkmn + H T

kmn) by �kmn. Because theright side of Eq. (20) is a scaled identity matrix, foreach off-diagonal element ho

kmn and θokmn , we achieve

the following linear equation,

�Km=1

(c2

imhokmm + �K

n=m+1cimcinθokmn

) = 0,

i = 1, . . . , F (21)

Using the N given images containing different shapes,Eq. (21) leads to a non-singular linear equation seton ho

kmn and θokmn . The right sides of the equations are

zeros. Thus the solution is a zero vector, i.e. the off-diagonal elements of Hkmm and �kmn are all zeros. Sim-ilarly, we can derive the constraint as Eq. (21) on thedifference between any two diagonal elements. There-fore the difference is zero, i.e. the diagonal elementsare all equivalent. We thus have,

Hkmm = λkmmI3×3, m = 1, . . . , K (22)

Hkmn + H Tkmn = λkmnI3×3, m = 1, . . . , K ,

n = m + 1, . . . , K (23)

where λkmm and λkmn are arbitrary scalars. Accordingto Eq. (22), the diagonal block Hkmm is a scaled iden-tity matrix. Due to Eq. (23), Hkmn − λkmn

2 I3 × 3 =−(Hkmn − λkmn

2 I3×3)T , i.e. Hkmn − λkmn2 I3×3 is skew-

symmetric. Thus the off-diagonal block Hkmn equalsthe summation of a scaled identity block, λkmn

2 I3×3, anda skew-symmetric block, Hkmn − λkmn

2 I3×3. Since λkmm

and λkmn are arbitrary, the entire matrix Hk is the sum-mation of an arbitrary block-skew-symmetric matrixand an arbitrary block-scaled-identity matrix, given aminimum of N images with non-degenerate rotationstogether with N images containing different shapes, intotal 2N = K2+K images. �

Let Yk denote the block-skew-symmetric matrix andZk denote the block-scaled-identity matrix in Hk. SinceYk and Zk respectively contain 3K (K−1)

2 and K (K+1)2 in-

dependent elements, Hk include 2K2−K free elements,i.e. the solution of the rotation constraints has 2K2−Kdegrees of freedom. In rigid cases, i.e. K = 1, the degreeof freedom is 1 (2∗12-1=1), meaning that the solutionis unique up to a scale, as suggested in (Tomasi andKanade, 1992).

For non-rigid objects, i.e. K > 1, the rotation con-straints result in an ambiguous solution space. For themany combinations of Zk and Yk of which Hk are posi-tive semi-definite, gk = G

√Hk lead to the ambiguous

solutions. Of course the space contains invalid solu-tions as well. Specifically, for those combinations ofwhich Hk is not positive semi-definite,

√Hk is indef-

inite and gk = G√

Hk refer to the invalid solutions.Such combinations indeed exist in the space. For exam-ple, when the block-scaled-identity matrix Zk is zero,Hk equals the block-skew-symmetric matrix Yk, whichis not positive semi-definite.


4.3. Basis Constraints

The only difference between non-rigid and rigid situa-tions is that the non-rigid shape is a weighted combi-nation of certain shape bases. The rotation constraintsare sufficient for recovering the rigid shapes, but theycannot determine a unique set of shape bases in thenon-rigid cases. Instead any non-singular linear trans-formation applied on the bases leads to another setof eligible bases. Intuitively, the non-uniqueness of thebases results in the ambiguity of the solution by enforc-ing the rotation constraints alone. We thus introduce thebasis constraints that determine a unique basis set andresolve the ambiguity.

Because the deformable shapes lie in a K-basis lin-ear space, any K independent shapes in the space forman eligible bases set, i.e. an arbitrary shape in the spacecan be uniquely represented as a linear combination ofthese K independent shapes. Note that these indepen-dent bases are not necessarily orthogonal. We measurethe independence of the K shapes using the conditionnumber of the matrix that stacks the 3D coordinates ofthe shapes like the measurement matrix W. The condi-tion number is infinitely big, if any of the K shapes isdependent on the other shapes, i.e. it can be describedas a linear combination of the other shapes. If the Kshapes are orthonormal, i.e. completely independentand equally dominant, the condition number achievesits minimum, 1. Between 1 and infinity, a smaller con-dition number means that the shapes are more indepen-dent and more equally dominant. The registered imagemeasurements of these K shapes in W are,

W1

W2

...

WK

=

R1 0 . . . 0

0 R2 . . . 0...

......

...

0 0 . . . RK

S1

S2

...

SK

(24)

where Wi and Si , i = 1, . . . , K are respectively theimage measurements and the 3D coordinates of the Kshapes. On the right side of Eq. (24), since the matrixcomposed of the rotations is orthonormal, the project-ing process does not change the singular values of the3D coordinate matrix. Therefore, the condition numberof the measurement matrix has the same power as thatof the 3D coordinate matrix to represent the indepen-dence of the K shapes. Accordingly we compute thecondition number of the measurement matrix in eachgroup of K images and identify a group of K images

with the smallest condition number. The 3D shapes inthese K images are specified as the shape bases. Notethat so far we have not recovered the bases explicitly,but decided in which frames they are located. This in-formation implicitly determines a unique set of bases.

We denote the selected frames as the first K imagesin the sequence. The corresponding coefficients are,

cii = 1, i = 1, . . . , K

ci j = 0, i, j = 1, . . . , K , i �= j (25)

Let � denote the set, {(i, j)|i = 1, . . . , K ; i �= k; j =1, . . . , F}. According to Eqs. (8, 25), we have,

M2i−1 Qk MT2 j−1 =

{1, i = j = k

0, (i, j) ∈ �(26)

M2i Qk MT2 j =

{ 1, i = j = k

0, (i, j) ∈ �(27)

M2i−1 Qk MT2 j = 0, (i, j) ∈ � or i = j = k (28)

M2i Qk MT2 j−1 = 0, (i, j) ∈ � or i = j = k (29)

These 4F(K−1) linear constraints are called the basisconstraints.

5. A Closed-Form Solution

Due to Theorem 1, enforcing the rotation constraints onQk leads to the ambiguous solution GHk GT . Hk con-sists of K 23 × 3 blocks, Hkmn, m, n = 1, . . . , K . Hkmn

contains four independent entries as follows,

Hkmn =

h1 h2 h3

−h2 h1 h4

−h3 −h4 h1

(30)

Lemma 1. Hkmn is a zero matrix if,

Ri Hkmn RTj = 02×2 (31)

where Ri and Rj are non-degenerate 2 × 3 rotationmatrices.

Proof: First we prove that the rank of Hkmn is at most2. Due to the orthonormality of rotation matrices, from


Eq. (31), we have,

Hkmn = r Ti3δik + δT

jkr j3 = (r T

i3 δTjk

) (δik

r j3

)(32)

where ri3 and rj3 respectively are the cross productsof the two rows of Ri and those of Rj. δik and δjk arearbitrary 1 × 3 vectors. Because both matrices on theright side of Eq. (32) are at most of rank 2, the rank ofHkmn is at most 2.

Next, we prove h1 = 0. Since Hkmn is 3 × 3 ma-trix of rank 2, its determinant, h1(

∑4i=1 h2

i ) , equals 0.Therefore h1 = 0, i.e. Hkmn is a skew-symmetric matrix.

Finally we prove h2 = h3 = h4 = 0. Denote the rowsof Ri and Rj as ri1, ri2, rj1, and rj2 respectively. Sinceh1 = 0, we can rewrite Eq. (31) as follows,

(ri1 · (h × r j1) ri1 · (h × r j2)

ri2 · (h × r j1) ri2 · (h × r j2)

)= 02×2 (33)

where h = (−h4, h3 − h2). Equation (33) means thatthe vector h is located in the intersection of the fourplanes determined by (ri1, r j1), (ri1, r j2), (ri2, r j1), and(ri2, r j2). Since Ri and Rj are non-degenerate rotationmatrices, ri1, ri2, rj1, and rj2 do not lie in the sameplane, hence the four planes intersect at the origin, i.e.h = (−h4, h3 −h2) = 01 × 3. Therefore Hkmn is a zeromatrix. �

Using Lemma 1, we derive the following theorem,

Theorem 2. Enforcing both the basis constraints andthe rotation constraints results in a closed-form solu-tion of Qk, given a minimum of K 2+K

2 images containingdifferent (non-proportional) shapes together with an-other K 2+K

2 images with non-degenerate (non-planar)rotations.

Proof: According to Theorem 1, given a minimumof K 2+K

2 images containing different shapes and an-other K 2+K

2 images with non-degenerate rotations, us-ing the rotation constraints we achieve the ambiguoussolution of Qk = GHk GT , where G is the desired cor-rective transformation matrix and Hk is the summationof an arbitrary block-skew-symmetric matrix and anarbitrary block-scaled-identity matrix.

Due to the basis constraints, replacing Qk in Eq. (26–29) with GHk GT ,

M2k−1:2kGHk GT MT2k−1:2k

= M2k−1:2k Hk MT2k−1:2k = I2×2 (34)

M2i−1:2i G Hk GT MT2 j−1:2 j = M2i−1:2i Hk MT

2 j−1:2 j

= 02×2, i = 1, . . . , K , j = 1, . . . , F, i �= k

(35)

From Eq. (4), we have,

M2i−1:2i Hk MT2 j−1:2 j =�K

m=1�Kn=1cimc jn Ri Hkmn R j T ,

i, j = 1, . . . , F (36)

where Hkmn is the 3×3 block of Hk. According toEq. (25),

M2i−1:2i Hk M2 j−1:2 j T = Ri Hki j R jT , i, j = 1, . . . , K

(37)Combining Eqs. (34, 35) and (37), we have,

Rk Hkkk RkT = I2×2 (38)

Ri Hki j R j T = 02×2, i, j = 1, . . . , K , i �= k (39)

According to Eq. (22), the kth diagonal block of Hk,Hkkk, equals λkkkI3×3. Thus by Eq. (38), λkkk = 1 andHkkk = I3 × 3. We denote K of the K 2+K

2 given im-ages with non-degenerate rotations as the first K im-ages in the sequence. Due to Lemma 1 and Eq. (39),Hki j , i, j = 1, . . . , K , i �= k, are zero matrices. SinceHk is symmetric, the other blocks in Hk, Hki j , i =k, j = 1, . . . , K , j �= k, are also zero matrices. Thus,

G Hk GT = (g1 . . . gK )Hk(g1 . . . gK )T

= (0 . . . gk . . . 0)(g1 . . . gK )T

= gk gTk = Qk (40)

i.e. a closed-form solution of the desired Qk has beenachieved. �

According to Theorem 2, we compute Qk =gk gT

k , k = 1, . . . , K , by solving the linear equations,Eqs. (10–11, 26–29), via the least square methods.We then recover gk by decomposing Qk via SVD.The decomposition of Qk is up to an arbitrary 3×3orthonormal transformation k, since (gkk)(gkk)T

also equals Qk. This ambiguity arises from the fact thatg1, . . . , gK are estimated independently under differ-ent coordinate systems. To resolve the ambiguity, weneed to transform g1, . . . , gK to be under a commonreference coordinate system.


Figure 1. A static cube and 3 points moving along straight lines. (a) Input image. (b) Ground truth 3D shape and camera trajectory. (c)Reconstruction by the closed-form solution. (d) Reconstruction by the method in Bregler et al. (2000). (e) Reconstruction by the method inBrand (2001) after 4000 iterations. (f) Reconstruction by the tri-linear method (Torresani et al., 2001) after 4000 iterations.

Figure 2. The relative errors on reconstruction of a static cube and 3 points moving along straight lines. (a) By the closed-form solution. (b)By the method in Bregler et al. (2000). (c) By the method in Bregler et al. (2000) after 4000 iterations. (d) By the trilinear method (Torresaniet al., 2001) after 4000 iterations. The scaling of the error axis is [0,100%]. Note that our method achieved zero reconstruction errors.


Due to Eq. (6), M2i−1:2i gk = cik Ri , i = 1, . . . , F .Because the rotation matrix Ri is orthonormal, i.e.‖Ri‖ = 1, we have Ri = ± M2i−1:2igk

‖M2i−1:2igk‖ . The sign of Ri

determines which orientations are in front of the cam-era. It can be either positive or negative, determined bythe reference coordinate system. Since g1, . . . , gK areestimated independently, they lead to respective rota-tion sets, each two of which are different up to a 3 × 3orthonormal transformation. We choose one set of therotations to specify the reference coordinate system.Then the signs of the other sets of rotations are deter-mined in such a way that these rotations are consistentwith the corresponding references. Finally the orthog-onal Procrustes method (Schonemann, 1966) is appliedto compute the orthonormal transformations from therotation sets to the reference. One of the reviewerspointed out that, computing such transformation k isequivalent to computing a homography at infinity. Fordetails, please refer to Hartley and Zisserman (2000).

The transformed g1, . . . , gK form the desired correc-tive transformation G. The coefficients are then com-puted by Eq. (6), and the shape bases are recovered byEq. (5). Their combinations reconstruct the non-rigid3D shapes.

Note that one could derive the similar basis and ro-tation constraints directly on Q = GGT , where G isthe entire corrective transformation matrix, instead ofQk = gk gT

k , where gk is the kth triple-column of G.It is easy to show that enforcing these constraints alsoresults in a linear closed-form solution of Q. How-ever, the decomposition of Q to recover G is up to a3K×3K orthonormal ambiguity transformation. Elim-inating this ambiguity alone is as hard as the originalproblem of determining G.

6. Performance Evaluation

The performance of the closed-form solution was eval-uated in a number of experiments.

6.1. Comparison with Three Previous Methods

We first compared the solution with three related meth-ods (Bregler et al., 2000, Brand, 2001, Torresani et al.,2001) in a simple noiseless setting. Figure 1 shows ascene consisting of a static cube and 3 moving points.The measurement included 10 points: 7 visible ver-tices of the cube and 3 moving points. The 3 pointsmoved along the three axes simultaneously at varying

speed. This setting involved K = 2 shape bases, onefor the static cube and another for the linear motions.While the points were moving, the camera was rotat-ing around the scene. A sequence of 16 frames werecaptured. One of them is shown in Fig. 1(a) and (b)demonstrates the ground truth shape in this frame andthe ground truth camera trajectory from the first frametill this frame. The three orthogonal green bars showthe present camera pose and the red bars display thecamera poses in the previous frames. Figure 1(c) to(f) show the structures and camera trajectories recon-structed using the closed-form solution, the methodin Bregler et al. (2000), the method in Brand (2001),and the tri-linear method (Torresani et al., 2001), re-spectively. While the closed-form solution achievedthe exact reconstruction with zero error, all the threeprevious methods resulted in apparent errors, even forsuch a simple noiseless setting.

Figure 2 demonstrates the reconstruction errors ofthe four methods on camera rotations, shapes, and im-age measurements. The error was computed as the per-centage relative to the ground truth, ‖Reconstruction−Truth‖

‖Truth‖ .Note that because the space of rotations is a manifold,a better error measurement for rotations is the Rieman-nian distance, d(Ri, Rj) = acos (

trace(Ri RTj )

2 ). Howeverit is measured in degrees. For consistency, we usedthe relative percentage for all the three reconstructionerrors.

6.2. Quantitative Evaluation on Synthetic Data

Our approach was then quantitatively evaluated on thesynthetic data. We evaluated the accuracy and robust-ness on three factors: deformation strength, number ofshape bases, and noise level. The deformation strengthshows how close to rigid the shape is. It is representedby the mean power ratio between each two bases, i.e.meani,j ( max(‖Bi ‖,B j ‖)

min(‖Bi ‖,‖B j ‖) ). Larger ratio means weaker de-formation, i.e. the shape is closer to rigid. The numberof shape bases represents the flexibility of the shape. Abigger basis number means that the shape is more flexi-ble. Assuming a Gaussian white noise, we represent thenoise strength level by the ratio between the Frobeniusnorm of the noise and the measurement, i.e. ‖noise‖

‖W‖ .In general, when noise exists, a weaker deformationleads to better performance, because some deforma-tion mode is more dominant and the noise relative tothe dominant basis is weaker; a bigger basis number re-sults in poorer performance, because the noise relativeto each individual basis is stronger.


Figure 3. (a) & (b) Reconstruction errors on rotations and shapes under different levels of noise and deformation strength. (c) & (d)Reconstruction errors on rotations and shapes under different levels of noise and various basis numbers. Each curve respectively refers to anoise level. The scaling of the error axis is [0, 20%].

Figure 3(a) and (b) show the performance of our al-gorithm under various deformation strength and noiselevels on a two bases setting. The power ratios wererespectively 20, 21, . . ., and 28. Four levels of Gaussianwhite noise were imposed. Their strength levels were0, 5, 10, and 20% respectively. We tested a number oftrials on each setting and computed the average recon-struction errors on the rotations and 3D shapes. Theerrors were measured by the relative percentage as inSection 6.1. Figure 3(c) and (d) show the performanceof our method under different numbers of shape basesand noise levels. The basis number was 2, 3, . . . , and10 respectively. The bases had equal powers and thusnone of them was dominant. The same noise as in thelast experiment was imposed.

In both experiments, when the noise level was 0%,the closed-form solution recovered the exact rotationsand shapes with zero error. When there was noise, itachieved reasonable accuracy, e.g. the maximum re-construction error was less than 15% when the noise

level was 20%. As we expected, under the same noiselevel, the performance was better when the power ra-tio was larger and poorer when the basis number wasbigger. Note that in all the experiments, the conditionnumber of the linear system consisting of both basisconstraints and rotation constraints had order of mag-nitude O(10) to O(102), even if the basis number wasbig and the deformation was strong. It suggests thatour closed-form solution is numerically stable.

6.3. Qualitative Evaluation on RealVideo Sequences

Finally we examined our approach qualitatively on anumber of real video sequences. One example is shownin Fig. 4. The sequence was taken of an indoor sceneby a handhold camera. Three objects, a car, a plane,and a toy person, moved along fixed directions and atvarying speeds. The rest of the scene was static. The


Figure 4. Reconstruction of three moving objects in the static background. (a) & (d) Two input images with marked features. (b) & (e)Reconstruction by the closed-form solution. The yellow lines show the recovered trajectories from the beginning of the sequence until thepresent frames. (c) & (f) Reconstruction by the method in Brand (2001). The yellow-circled area shows that the plane, which should be on topof the slope, was mistakenly located underneath the slope.

Figure 5. Reconstruction of shapes of human faces carrying expressions. (a) & (d) Input images. (b) & (e) Reconstructed 3D face appearancesin novel poses. (c) & (f) Shape wireframes demonstrating the recovered facial deformations such as mouth opening and eye closure.

car and the person moved on the floor and the planemoved along a slope. The scene structure is composedof two bases, one for the static objects and another forthe linear motions. 32 feature points tracked across 18

images were used for reconstruction. Two of the themare shown in Fig. 4(a) and (d).

The rank of W is estimated in such a way that 99%of the energy of W remains after the factorization using


the rank constraint. The number of the bases is thus de-termined by K = rank(W )

3 . Then the camera rotationsand dynamic scene structures were reconstructed us-ing the proposed method. With the recovered shapes,we could view the scene appearance from any noveldirections. An example is shown in Fig. 4(b) and (e).The wireframes show the scene shapes and the yellowlines show the trajectories of the moving objects fromthe beginning of the sequence until the present frames.The reconstruction was consistent with our observa-tion, e.g. the plane moved linearly on top of the slope.Figure 4(c) and (f) show the reconstruction using themethod in Brand (2001). The recovered shapes of theboxes were distorted and the plane was incorrectly lo-cated underneath the slope, as shown in the yellowcircles. Note that occlusion was not taken into accountwhen rendering these images. Thus in the regions thatshould be occluded, e.g. the area behind the slope, thestretched texture of the occluding objects appeared.

Human faces are highly non-rigid objects and 3Dface shapes can be regarded as weighted combinationsof certain shape bases that refer to various facial ex-pressions. Thus our approach is capable of reconstruct-ing the deformable 3D face shapes from the 2D imagesequence. One example is shown in Fig. 5. The se-quence consisted of 236 images that contained facialexpressions like eye blinking and mouth opening. 60feature points were tracked using an efficient 2D Ac-tive Appearance Model (AAM) method (Baker andMatthews, 2001). Figure 5(a) and (d) display two ofthe input images with marked feature points. Afterreconstructing the shapes and poses, we could viewthe 3D face appearances in any novel poses. Two ex-amples are shown respectively in Fig. 5(b) and (e).Their corresponding 3D shape wireframes, as shownin Fig. 5(c) and (f), exhibit the recovered facial de-formations such as mouth opening and eye closure.Note that the feature correspondences in these experi-ments were noisy, especially for those features on thesides of the face. The reconstruction performance ofour approach demonstrates its robustness to the imagenoise.

7. Conclusion and Discussion

This paper proposes a linear closed-form solution tothe problem of non-rigid shape and motion recoveryfrom a single-camera video. In particular, we haveproven that enforcing only the rotation constraints re-sults in ambiguous and invalid solutions. We thus in-

troduce the basis constraints to resolve this ambiguity.We have also proven that imposing both the linear con-straints leads to a unique reconstruction of the non-rigidshape and motion. The performance of our algorithmis demonstrated by experiments on both simulated dataand real video data. Our algorithm has also been suc-cessfully applied to separate the local deformationsfrom the global rotations and translations in the 3Dmotion capture data (Chai et al., 2003).

Currently our approach does not consider the degen-erate deformations. A shape basis is degenerate, if itsrank is either 1 or 2, i.e. it limits the shape deformationwithin a 2D plane. Such planar deformations occur instructures like dynamic scenes or expressive humanfaces. For example, when a scene consists of severalbuildings and one car moving straight along a street,the shape basis referring to the rank-1 car translationis degenerate. It is conceivable that, in such degen-erate cases, the basis constraints cannot completelyresolve the ambiguity of the rotation constraints. Weare now exploring how to extend the current methodto reconstructing the shapes involving degenerate de-formations. Another limitation of our approach is thatwe assume the weak perspective projection model. Itwould be interesting to see how the proposed solutioncould be extended to the full perspective projectionmodel.

Acknowledgments

We would like to thank Simon Baker, Iain Matthews,and Mei Han for providing the image data and fea-ture correspondence used in Section 6.3, thank theECCV and IJCV reviewers for their valuable com-ments, and thank Jessica Hodgins for proofreading thepaper. Jinxiang Chai was supported by the NSF throughEIA0196217 and IIS0205224. Jing Xiao and TakeoKanade were partly supported by grant R01 MH51435from the National Institute of Mental Health.

References

Baker, S. and Matthews, I. 2001. Equivalence and efficiency ofimage alignment algorithms. In Proc. Int. Conf. Computer Visionand Pattern Recognition.

Bascle, B. and Blake, A. 1998. Separability of pose and expressionin facial tracing and animation. In Proc. Int. Conf. ComputerVision, pp. 323–328.

Blanz, V. and Vetter, T. 1999. A morphable model for the synthesisof 3D faces. In Proc. SIGGRAPH’99, pp. 187–194.


Brand, M. 2001. Morphable 3D models from video. In Proc. Int.Conf. Computer Vision and Pattern Recognition.

Brand, M. and Bhotika, R. 2001. Flexible flow for 3D nonrigidtracking and shape recovery. In Proc. Int. Conf. Computer Visionand Pattern Recognition.

Bregler, C., Hertzmann, A. and Biermann, H. 2000. Recoveringnon-rigid 3D shape from image streams. In Proc. Int. Conf.Computer Vision and Pattern Recognition.

Chai, J., Xiao, J., and Hodgins, J. 2003. Vision-based control of 3Dfacial animation. Eurographics/ACM Symposium on ComputerAnimation.

Costeira, J. and Kanade, T. 1998. A multibody factorization methodfor independently moving-objects. Int. Journal of ComputerVision, 29(3):159–179.

Gokturk, S.B., Bouguet, J.Y., and Grzeszczuk, R. 2001. A datadriven model for monocular face tracking. In Proc. Int. Conf.Computer Vision.

Han, M. and Kanade, T. 2000. Reconstruction of a scene withmultiple linearly moving objects. In Proc. Int. Conf. ComputerVision and Pattern Recognition.

Hartley, R. I. and Zisserman, A. 2000. Multiple View Geometry inComputer Vision. Cambridge University Press.

Kanatani, K. 2001. Motion segmentation by subspace separationand model selection. In Proc. Int. Conf. Computer Vision.

Mahamud, S. and Hebert, M. 2000. Iterative projective reconstruc-tion from multiple views. In Proc. Int. Conf. Computer Visionand Pattern Recognition.

Poelman, C. and Kanade, T. 1997. A paraperspective factorizationmethod for shape and motion recovery, IEEE Trans. PatternAnalysis and Machine Intelligence, 19(3):206–218.

Schonemann, P.H. 1966. A generalized solution of the orthogonalprocrustes problem. Psychometrika, 31(1):1–10.

Sugaya, Y. and Kanatani, K. 2004. Geometric structure of degen-eracy for multi-body motion segmentation. ECCV Workshop onStatistical Methods in Video Processing.

Tomasi, C. and Kanade, T. 1992. Shape and motion from imagestreams under orthography: A factorization method. Int. Journalof Computer Vision, 9(2):137–154.

Torresani, L., Yang, D., Alexander, G., and Bregler, C. 2001.Tracking and modeling non-rigid objects with rank constraints.In Proc. Int. Conf. Computer Vision and Pattern Recognition.

Triggs, B. 1996. Factorization methods for projective structureand motion. In Proc. Int. Conf. Computer Vision and PatternRecognition.

Vidal, R., Soatto, S., Ma, Y., and Sastry, S. 2002. Segmentation ofdynamic scenes from the multibody fundamental matrix. ECCVWorkshop on Vision and Modeling of Dynamic Scenes.

Vidal, R. and Hartley, R. 2004. Motion segmentation with missingdata using power factorization and GPCA. In Proc. Int. Conf.Computer Vision and Pattern Recognition.

Wolf, L. and Shashua, A. 2001. Two-body segmentation from twoperspective views. In Proc. Int. Conf. Computer Vision andPattern Recognition.

Wolf, L. and Shashua, A. 2002. On projection matrices Pκ → P2,κ= 3,. . .,6, and their applications in computer vision. Int. J. ofComputer Vision, 48(1):53–67.

Zelnik-Manor, L. and Irani, M. 2003. Degeneracies, dependen-cies, and their implications in multi-body and multi-sequencefactorizations. In Proc. Int. Conf. Computer Vision and PatternRecognition.

Zhang, Z., Deriche, R., Faugeras, O., and Luong, Q. 1995. Arobust technique for matching two uncalibrated images throughthe recovery of the unknown epipolar geometry, ArtificialIntelligence, 78(1/2):87–119.

A Closed-Form Solution to Non-Rigid Shape and Motion...

Documents

Transcript of A Closed-Form Solution to Non-Rigid Shape and Motion...