JHU Center for Imaging Sciencervidal/publications/cvpr04-trifocal.pdf · 2004. 3. 21. · JHU...

The Multibody Trifocal Tensor: Motion Segmentation from 3 Perspective ViewsPAPER ID #583

AbstractWe propose a new geometric approach to 3-D motion seg-mentation from point correspondences in three perspectiveviews. We demonstrate that after applying a polynomial em-bedding to the correspondences they become related by theso-called multibody trilinear constraint and its associatedmultibody trifocal tensor T . We first show how to linearlyestimate T from point-point-point correspondences. GivenT , we show that the derivatives of the multibody trilinearconstraint at a correspondence enable us to transfer pointsand lines from two views to the other. We then show that onecan estimate the epipolar lines associated with each imagepoint from the common root of a set of univariate polyno-mials, and the epipoles by solving a plane clustering prob-lem in R3. The individual trifocal tensors are then obtainedfrom the second order derivatives of the multibody trilinearconstraint. Given epipolar lines and epipoles, or trifocaltensors, one can immediately obtain an initial clustering ofthe correspondences. We use this initial clustering to ini-tialize an iterative algorithm that finds an optimal estimatefor the trifocal tensors and the segmentation of the corre-spondences using Expectation Maximization. We test ouralgorithm on various real and synthetic dynamic scenes.

1. IntroductionOne of the most important problems in visual motion analy-sis is that of reconstructing a 3-D scene from a collection ofimages taken by a moving camera. At present, the algebraand geometry of the problem is very well understood, and itis usually described in terms of the so-called bilinear, trilin-ear, and multilinear constraints among two, three and mul-tiple views, respectively. Also, there are various algorithmsfor performing the reconstruction task, both geometric andoptimization-based [4].

All the above algorithms are, however, limited by the as-sumption that the scene is static, i.e. only the camera ismoving and hence there is a single motion model to be esti-mated from the image measurements. In practice, however,most of the scenes we are dynamic, i.e. both the cameraand multiple objects in the 3-D world are moving. Thus,one is faced with the more challenging problem of recov-ering multiple motion models from the image data, withoutknowing the assignment of data points to motion models.

Previous work on 3D motion segmentation [8] has ad-dressed the problem using the standard probabilistic frame-work. Given an initial clustering of the image data, one esti-mates a motion model for each group using standard struc-ture from motion algorithms [4]. Given the motion parame-

ters, one can easily update the clustering of the image data.The algorithm then proceeds by iterating between these twosteps, using the Expectation Maximization algorithm [2].When the probabilistic model generating the data is known,the above iterative scheme indeed provides an optimal esti-mate in the maximum likelihood sense. Unfortunately, it iswell-known that the EM algorithm is very sensitive to ini-tialization [7].

In order to deal with the initialization problem, recentwork on 3D motion segmentation has concentrated on thestudy of the geometry of multiple motion models. [11] de-rived a bilinear constraint inR6 which, together with a com-binatorial scheme, segments two rigid-body motions fromtwo perspective views. [10] proposed a generalization ofthe epipolar constraint and of the fundamental matrix tomultiple rigid-body motions, which leads to a motion seg-mentation algorithm based on factoring products of epipolarconstraints to retrieve the fundamental matrices associatedwith each one of the motions. The algorithms of [11] and[10] are algebraic, hence they do not require initialization.

In this paper, we consider the problem of estimating andsegmenting multiple rigid-body motions from a set of pointcorrespondences in three perspective views. In Section 2we study the three-view geometry of multiple rigid-bodymotions. We demonstrate that, after a suitable embeddinginto a higher-dimensional space, the three views are relatedby the so-called multibody trilinear constraint and its as-sociated multibody trifocal tensor. In Section 3, we showthat one can use the multibody trifocal tensor to transferpoints and lines from two views to the other. In Section 4,we propose a geometric algorithm for 3D motion segmenta-tion that estimates the motion parameters (epipoles, epipo-lar lines and trifocal tensors) from the derivatives of themultibody trilinear constraint. This algebraic (non-iterative)solution is then used to initialize an optimal algorithm.

To the best of our knowledge, there is no previous workaddressing this problem. The only existing work on mul-tiframe 3D motion segmentation is for the case of affinecameras [1, 5], and requires a minimum of four views.

2. Multibody three-view geometryThis section establishes the basic geometric relationshipsamong three perspective views of multiple rigid-body mo-tions. We first review the trilinear constraint and its associ-ated trifocal tensor for the case of a single motion. We thengeneralize these notions to multiple motions via a polyno-mial embedding that leads to the so-called multibody trilin-ear constraint and its associated multibody trifocal tensor.

1

2.1. Trilinear constraint and trifocal tensorLet x ↔ `′ ↔ `′′ be a point-line-line correspondence inthree perspective views with 3× 4 camera matrices

P = [I 0], P′ = [R′ e′] and P′′ = [R′′ e′′], (1)

where e′ ∈ P2 and e′′ ∈ P2 are the epipoles in the 2nd and3rd views, respectively. Then, the multiple view matrix [6][

`′>R′x `′>e′

`′′>R′′x `′′>e′′

]∈ R2×2 (2)

must have rank 1, hence its determinant must be zero, i.e.

`′>(R′xe′′> − e′x>R′′>)`′′ = 0. (3)This is the well-known point-line-line trilinear constraintamong the three views [4], which we will denote as

x`′`′′T = 0 (4)

where T ∈ R3×3×3 is the so-called trifocal tensor.Notation. For ease of notation, we will drop the sum-mation and the subscripts in trilinear expressions such as∑ijk xi`

′j`′′kTijk, and write them as shown above. Simi-

larly, we will write xT to represent the matrix whose (jk)th

entry is∑i xiTijk, and x`

′T to represent the vector whosekth entry is

∑ij xi`

′jTijk . The notation is somewhat con-

densed, and inexact, since the particular indices that are be-ing summed over are not specified. However, the meaningshould in all cases be clear from the context.

Notice that one can linearly solve for the trifocal tensorT from the trilinear constraint (3) given at least 26 point-line-line correspondences. However, if we are given point-point-point correspondences x ↔ x′ ↔ x′′, then for eachpoint in the 2nd view x′, we can obtain two lines `′1 and `

′2

passing through x′, and similarly for the 3rd view. Sinceeach correspondence gives 4 independent equations on T ,we only need 7 correspondences to linearly estimate T .1

2.2. The multibody trilinear constraintConsider now a scene with a known number n of rigid-bodymotions with associated trifocal tensors {Ti∈R3×3×3}ni=1,where Ti is the trifocal tensor associated with the motion ofthe ith object relative to the moving camera among the threeviews. We assume that the motions of the objects relative tothe camera are such that all the trifocal tensors are differentup to a scale factor. We also assume that the given imagescorrespond to 3-D points in general configuration in R3, i.e.they do not all lie in any critical surface, for example.

Let x ↔ `′ ↔ `′′ be an arbitrary point-line-line cor-respondence associated with any of the n motions. Then,there exists a trifocal tensor Ti satisfying the trilinear con-straint in (3) or (4). Thus, regardless of the motion associ-ated with the correspondence, the following constraint mustbe satisfied by the number of independent motions n, the tri-focal tensors {Ti}ni=1 and the correspondencex↔ `′ ↔ `′′

1We refer the reader to [4] for further details and more robust linearmethods for computing T .

n∏

i=1

(x`′`′′Ti) = 0. (5)

The above multibody constraint eliminates the featuresegmentation stage from the motion segmentation problemby taking the product of all trilinear constraints. Althoughtaking products is not the only way of algebraically elimi-nating feature segmentation, it has the advantage of leadingto a polynomial equation in (x, `′, `′′) with a nice algebraicstructure. Indeed, the multibody constraint is a homoge-neous polynomial of degree n in each of x, `′ or `′′. Now,suppose x = (x1, x2, x3)>. We may enumerate all the pos-sible monomials xn11 x

n22 x

n33 of degree n in (5) and write

them in some chosen order as a vector

x̃ = (xn1 , xn−11 x2, x

n−11 x3, x

n−21 x

22, . . . , x

n3 )> . (6)

This vector has dimensionMn = (n+1)(n+2)/2. The mapx 7→ x̃ is known as the polynomial embedding of degreen in the machine learning community and as the Veronesemap of degree n in the algebraic geometry community.

Now, note that (5) is a sum of terms of degree n in eachof x, `′ and `′′. Thus, each term is a product of degree nmonomials in x, `′ and `′′. We may therefore define a 3-dimensional tensor T ∈ RMn×Mn×Mn containing the co-efficients of each of the monomials occurring in the product(5) and write (5) as

x̃ ˜̀′ ˜̀′′ T = 0, (7)

where summation over all the entries of the vectors x̃, ˜̀′ and˜̀′′ is implied. We call equation (7) the multibody trilinearconstraint, as it is a natural generalization of the trilinearconstraint valid for n = 1. The important point to observeis that although (7) has degree n in the entries of x, `′ and

`′′, it is in fact linear in the entries of x̃, ˜̀′ and ˜̀′′.2.3. The multibody trifocal tensorThe array T is called the multibody trifocal tensor, definedup to indeterminate scale, and is a natural generalization ofthe trifocal tensor. Given a point-line-line correspondencex↔ `′ ↔ `′′, one can compute the entries of the vectors x̃,˜̀′ and ˜̀′′, and use the multibody trilinear constraint (7) toobtain a linear relationship in the entries of T . Therefore,we may estimate T linearly from M 3n − 1 point-line-linecorrespondences. That is 27 correspondences for one mo-tion, 215 for two motions, 999 for three motions, etc.

Fortunately, as in the case of n = 1 motion, one maysignificantly reduce the data requirements by working withpoint-point-point correspondences x↔x′↔x′′. Since eachpoint in the second view x′ gives two lines `′1 and `

′2 and

each point in the third view x′′ gives two lines `′′1 and `′′2 ,

a naive calculation would give 22 = 4 constraints per cor-respondence. However, due to the algebraic properties ofthe polynomial embedding, each correspondence providesin general (n+1)2 independent constraints on the multibody

2

trifocal tensor. To see this, remember that the multibody tri-linear constraint is satisfied by all lines `′ = `′1 + α`

′2 and

`′′ = `′′1 + β`′′2 passing through x

′ and x′′, respectively.Therefore, for all α ∈ R and β ∈ R we must have

n∏

i=1

(x(`′1 + α`

′2)(`

′′1 + β`

′′2)Ti

)= 0. (8)

The above equation, viewed as a function of α, is a poly-nomial of degree n, hence its n + 1 coefficients must bezero. Each coefficient is in turn a polynomial of degree n inβ, whose n + 1 coefficients must be zero. Therefore, eachcorrespondence gives (n+1)2 constraints on the multibodytrifocal tensor T , hence we need only (M 3n − 1)/(n + 1)2point-point-point correspondences to estimate T . That isonly 7 correspondences for one motion, 24 for two motions,63 for three motions, etc. This represents a significant im-provement not only with respect to the case of point-line-line correspondences, as explained above, but also with re-spect to the case of two perspective views which requiresM2n − 1 point-point correspondences for linearly estimat-ing the multibody fundamental matrix [10], i.e. 8, 35 and99 correspondences for one, two and three motions, respec-tively.

Given a correspondence x↔ x′ ↔ x′′, one may gener-ate the (n+1)2 linear equations in the entries of T by choos-ing `′1, `

′2, `′′1 and `

′′2 passing through x

′ and x′′, respec-tively, and then computing the coefficients of αiβj in (8). Asimpler way is to choose at least n+ 1 distinct lines passingthrough each of x′ and x′′ and generate the correspondingpoint-line-line equation. This leads to the following linearalgorithm for estimating the multibody trifocal tensor.Algorithm 1 (Estimating the multibody trifocal tensor T )Given N ≥ (M3n − 1)/(n + 1)2 point-point-point cor-respondences {xi ↔ x′i ↔ x′′i }Ni=1, with at least 7correspondences per moving object, estimate T as follows:

1. Generate N` ≥ (n + 1) lines {`′ij}N`j=1 and {`′′ik}N`k=1passing through x′i and x

′′i , respectively, for i = 1..N .

2. Compute T , interpreted as a vector inRM3n , as the nullvector of the matrix A ∈ RNN2`×M3n , whose rows arecomputed as x̃i⊗ ˜̀′ij⊗ ˜̀′′ik ∈ RM

3n , for all i = 1 . . .N

and j, k = 1 . . .N`, where⊗ is the Kronecker product.Notice that Algorithm 1 is essentially the same as the

linear algorithm for estimating the trifocal tensor T . Theonly differences are that we need to generate more than 2lines per point in the second and third views x′ and x′′, andthat we need to replace the original correspondences x ↔` ↔ `′ by the embedded correspondences x̃ ↔ ˜̀′ ↔ ˜̀′′in order to build the data matrix A, whose null-space is themultibody trifocal tensor.

3. Multibody transfer properties of TAn important property of the trifocal tensor T is that oftransferring points and lines from a pair of views to the

other [4]. For example, if `′ and `′′ are corresponding linesin the second and third views, then ` = `′`′′T is a corre-sponding line in the first view. Similarly, if x is a point inthe first view and `′ is a corresponding line in the secondview, then x′′ = x`′T is the corresponding point in thethird view. Likewise, x′ = x`′′T is the point in the secondview corresponding to (x, `′′).

In this section, we discuss the transfer properties of themultibody trifocal tensor T . While in principle these prop-erties are natural generalizations of the corresponding prop-erties of the individual trifocal tensors {Ti}ni=1, in the multi-body case the situation is more complex, because T incor-porates information about all the motions at the same time.In order to obtain information about a specific motion, saythe ith motion, without yet knowing its trifocal tensor Ti,we exploit the algebraic properties of the multibody trilin-ear constraint. In particular, we show that information aboutindividual motions is encoded in its derivatives.

We begin by considering the derivative of the multibodytrilinear constraint with respect to its first argument

∂

∂x(x̃ ˜̀′ ˜̀′′T ) = ∂

∂x

n∏

i=1

(x`′`′′Ti) =n∑

i=1

(`′`′′Ti)∏

k 6=i(x`′`′′Tk).

We notice that if we evaluate this derivative at a correspon-dence x↔ `′ ↔ `′′ associated with the ith motion, i.e. thecorrespondence is such that x`′`′′Ti = 0, then all the termsin the above summation but the ith vanish. Thus we obtain∂

∂x(x̃ ˜̀′ ˜̀′′T )

∣∣∣∣x`′`′′Ti=0

= (`′`′′Ti)∏

k 6=i(x`′`′′Tk) ∼ (`′`′′Ti),

which from the properties of the trifocal tensor Ti gives aline ` in the first view. Notice that this line ` in the firstview is transferred from the two lines in the second andthird views according to the unknown ith trifocal tensor Ti.That is, the multibody trifocal tensor enables us to transfercorresponding lines according to their own motion, withouthaving to know the motion with which the correspondenceis associated. In a similar fashion, one may obtain pointtransfer properties, by considering derivatives of the multi-body trilinear constraint with respect to its second and thirdarguments. We therefore have the following results.

Theorem 1 (Line transfer from corresponding lines inthe second and third views to the first) The derivative ofthe multibody trilinear constraint with respect to its first ar-gument evaluated at a correspondence (x, `′, `′′) gives aline ` in the first view passing though x , i.e.

` =∂

∂x(x̃ ˜̀′ ˜̀′′T ) and `>x = 0.

Theorem 2 (Point transfer from first to second and thirdviews) The derivative of the multibody trilinear constraintwith respect to its second and third arguments evaluated ata correspondence (x, `′, `′′) gives the corresponding pointin the second and third view x′ and x′′, respectively, i.e.

∂

∂`′(x̃ ˜̀′ ˜̀′′T ) ∼ x′ and ∂

∂`′′(x̃ ˜̀′ ˜̀′′T ) ∼ x′′.

3

4. Motion Segmentation from 3 viewsIn this section, we present a linear algorithm for estimatingand segmenting multiple rigid-body motions. More specif-ically, we assume we are given a set of point correspon-dences {xj ↔ x′j ↔ x′′j }Nj=1, from which we can estimatethe multibody trifocal tensor T , and would like to estimatethe individual trifocal tensors {Ti}ni=1 and/or the segmenta-tion of the correspondences according to the n motions.

Our algorithm proceeds as follows. In Section 4.1 weshow how to the estimate epipolar lines in the second andthird views, `′x and `

′′x, respectively, associated with each

point x in the first view by solving for the common rootof a set of univariate polynomials. In Section 4.2 we showhow to estimate the epipoles in the second and third views,{e′i}ni=1 and {e′′i }ni=1, respectively, by solving a plane clus-tering problem using a combination of Generalized Prin-cipal Component Analysis (GPCA) with spectral cluster-ing. Given epipolar lines and epipoles one may immedi-ately cluster the correspondences into n groups and then es-timate individual trifocal tensors and camera matrices fromthe data associated with each group. However, we also showin Section 4.3 that one may recover the individual trifo-cal tensors directly from the second order derivatives of themultibody trilinear constraint. In Section 4.4 we show howto refine the estimates of the linear algorithm by applyingthe EM algorithm to a mixture of trifocal tensors model.

4.1. From T to epipolar linesGiven the trifocal tensor T , it is well known how to computethe epipolar lines in the second and third views of a point xin the first view [4]. Specifically, notice that the matrix

Mx = (xT ) = (R′xe′′> − e′x>R′′>) ∈ R3×3 (9)

has rank 2. In fact its left null-space is `′x=e′×(R′x) and

its right null-space is `′′x=e′′×(R′′x), i.e. the epipolar lines

of x in the second and third views, respectively. In brief

Lemma 1 The epipolar line `′x in the second view corre-sponding to a point x in the first view is the line such thatx`′xT = 0. Similarly the epipolar line `

′′x in the third view

is the line satisfying x`′′xT = 0. Therefore, rank(xT ) = 2.

In the case of multiple motions, we are faced with themore challenging problem of computing the epipolar lines`′x and `

′′x without knowing the individual trifocal tensors

{Ti}ni=1 or the segmentation of the correspondences. Thequestion is then how to compute such epipolar lines fromthe multibody trifocal tensor T . To this end, we notice thatwith each point in the first view xwe can associate n epipo-lar lines {`′ix}ni=1, each one of them corresponding to eachone of the n motions between the first and second views.One of such n epipolar lines, `′x, is the true epipolar linecorresponding to x according to its own motion. We thushave x`′ixTi = 0 which implies that for any line `

′′ in the

third view x`′ix`′′Ti = 0. Now, since the span of ˜̀′′ for all

`′′ ∈ R3 is RMn , we have that for all i = 1, . . . , n

∀`′′[ n∏

k=1

(x`′ix`′′Tk) = (x̃

˜̀′ix˜̀′′T ) = 0

]⇐⇒ (x̃ ˜̀′ixT = 0).

We have shown the following result.

Theorem 3 If `′ix and `′′ix are the epipolar lines in the sec-

ond and third views corresponding to a point x in the first

view according to the ith motion, then x̃ ˜̀′ixT = x̃ ˜̀′′ixT =0 ∈ RMn . Therefore, rank(x̃T ) ≤Mn − n.This result alone does not help us to find `′ix according to agiven motion, since any one of the n epipolar lines `′ix willsatisfy the above condition. This question of determiningthe epipolar line `′x corresponding to a point x is not wellposed as such, since the epipolar line `′x depends on whichof the n motions the point x belongs to, which cannot bedetermined without additional information. We thereforepose the question a little differently, and suppose that weknow the point x′ in the second view corresponding to xand wish to find the epipolar line `′x also in the second view.This epipolar line must of course pass through x′.

To solve this problem, we begin by noticing that `′x canbe parameterized as

`′x = `′1 + α`

′2 (10)

where, as before, `′1 and `′2 are two different lines passing

through x′. From Theorem 3 we have that

x̃ ˜(`′1 + α`′2)T = 0. (11)

Each of the Mn components of this vector is a polynomialof degree n in α. These polynomials must have a commonroot α∗ for which all the polynomials (and hence the vector)vanishes. The epipolar line of x in the second view is then`′x = `

′1 + α

∗`′2. In practice, we do not need to considerall the Mn polynomials, but can instead find the commonroot of random linear combinations of these polynomials.We therefore have the following algorithm for computingepipolar lines from the multibody fundamental tensor.

Algorithm 2 (Estimating epipolar lines from T ) Givena point x in the first view,

1. Choose two different lines `′1 and `′2 passing through

its corresponding point x′ in the second view. ChooseN` ≥ 2 vectors {w′′k ∈ RMn}N`k=1 and build the poly-nomials q′k(α) = x̃

˜(`′1 + α`′2)w

′′kT , k = 1, . . . , N`.

Compute the common root α∗ of these N` polynomialsas the root of q′(α) =

∑N`k=1 q

′k(α)

2 that minimizesq′(α). The epipolar line of x in the second view isgiven by `′x = `

′1 + α

∗`′2.

2. Given a correspondencex↔x′′, determine its epipolarline in the third view `′′x in an entirely analogous way.

We may apply the above process to all correspondences{xj ↔ x′j ↔ x′′j }Nj=1 and obtain the set of all N epipolarlines in the second and third views according to the motionassociated with each correspondence. Notice, again, that

4

this is done from the multibody trifocal tensor only, withoutknowing the individual trifocal tensors or the segmentationof the correspondences.

It is also useful to note that the only property of `′1 and`′2 that we used in the above algorithm was that the desiredepipolar line `′x could be expressed as a linear combinationof `′1 and `

′2. If instead we knew the epipoles corresponding

to the required motion, then we could choose `′1 and `′2 to

be any two lines passing through the epipole. Algorithm 2could then be used to determine the epipolar line `′x.

Observe therefore, that once we know the set of epipolescorresponding to the n motion, we may compute the epipo-lar lines corresponding to any point x in the first image.Consequently, we can determine the fundamental matricesand, as we will see in Section 4.3, the individual trifocal ten-sors. Before proceeding, we need to show how the epipolesmay be determined, which we do in the next section.

4.2. From T to epipolesIn the case of one rigid-body motion, the epipoles in thesecond and third views e′ and e′′ must lie on the epipolarlines in the second and third views, {`′xj}Nj=1 and {`

′′xj}Nj=1,

respectively. Thus we can obtain the epipoles from

e′>[`′x1 , . . . , `′xN ] = 0 and e

′′>[`′′x1 , . . . , `′′xN ] = 0. (12)

Clearly, we only need 2 epipolar lines to determine theepipoles, hence we do not need to compute the epipolarlines for all points in the first view. However, it is betterto use more than two lines in the presence of noise.

In the case of n motions there exist n epipole pairs,{(e′i, e′′i )}ni=1, where e′i and e′′i are epipoles in the sec-ond and third views corresponding to the ith motion. Now,given a set of correspondences {xj ↔ x′j ↔ x′′j } we maycompute the multibody trifocal tensor, and then for eachcorrespondence {xj ↔ x′j ↔ x′′j } determine its epipolarlines `′xj and `

′′xj by the method described in Section 4.1.

Then, for each pair of epipolar lines (`′xj , `′′xj ) there exists

an epipole pair (e′i, e′′i ) such that

e′i>`′xj = 0 and e

′′i>`′′xj = 0. (13)

Our task is two-fold. First, we need to find the set of epipolepairs {(e′i, e′′i )}. Second, we need to determine which pairof epipoles lie on the epipolar lines (`′xj , `

′′xj ) derived from

a given point correspondence.If two point correspondences xj ↔ x′j ↔ x′′j and

xk ↔ x′k ↔ x′′k both belong to the same motion, then thepair of epipoles can be determined easily by intersecting theepipolar lines. If the two motions are different, then the in-tersection points of the epipolar lines will have no geometricmeaning, and will be essentially arbitrary. This suggests anapproach to determining the epipoles based on RANSAC([3]) in which we intersect pairs of epipolar lines to findcandidate epipoles, and determine their degree of supportamong from the other point correspondences. This methodis expected to be effective with small numbers of motions.

In reality, we used a different method based on the ideaof multibody epipoles proposed in [10] for the case of twoviews, which we now extend and modify for the case ofthree views. Notice from (13) that, regardless of the motionassociated with each pair of epipolar lines, we must have

n∏

i=1

(e′i>`′x) = c

′>˜̀′x = 0,

n∏

i=1

(e′′i>`′′x) = c

′′>˜̀′′x = 0,

where the multibody epipoles c′ ∈ RMn and c′′ ∈ RMn arethe coefficients of the homogeneous polynomials of degree

n p′(`′x) = c′>˜̀′

x and p′′(`′′x) = c

′′>˜̀′′x, respectively. Sim-

ilarly to (12), we may obtain the multibody epipoles from

c′>[ ˜̀′x1 , . . . , ˜̀′xN ] = 0 and c

′′>[ ˜̀′′x1 , . . . , ˜̀′′xN ] = 0. (14)

In order to estimate the epipoles, similarly to our resultsin Section 3, we notice that if the pair of epipolar lines(`′x, `

′′x) corresponds to the i

th motion, then the derivativesof p′ and p′′ at the pair (`′x, `

′′x) give the epipoles e

′i and e

′′i ,

i.e.∂

∂`′x(c′>˜̀′x) ∼ e′i and

∂

∂`′′x(c′′>˜̀′′x) ∼ e′′i . (15)

In the noise free case, this means that we can immedi-ately obtain the epipoles by evaluating the derivatives ofp′ and p′′ at different epipolar lines. Then epipolar linesbelonging to the same motion will give the same epipoles,hence we can automatically cluster all the correspondences.With noisy measurements, however, the derivatives of p′

and p′′ will not be equal for two pairs of epipolar lines corre-sponding to the same motion. However, we may use (15) tocompute the (unit) epipoles (e′xj , e

′′xj ) and (e

′xk, e′′xk) from

the derivatives of (p′, p′′) at (`′xj , `′′xj ) and (`

′xk, `′′xk), re-

spectively. Then the similarity measure

Sjk =1

2

(∣∣∣e′xj>e′xk∣∣∣+∣∣∣e′′xj>e′′xk

∣∣∣)

(16)

is approximately one for points j and k in the same groupand strictly less than one for points in different groups.Given the so-defined similarity matrix S ∈ RN×N , one canapply any spectral clustering technique to obtain the seg-mentation of the correspondences, and then the epipoles,fundamental matrices and camera matrices for each group.We therefore have the following algorithm for computingthe epipoles and clustering the correspondences.

Algorithm 3 (Estimating epipoles from T ) Given a set ofepipolar lines {(`′xj , `

′′xj )}Nj=1.

1. Compute the multibody epipoles c′ and c′′ from (14).

2. Compute the epipole at each epipolar line from thederivatives of the polynomials p′ and p′′ as in (15).

3. Define a pairwise similarity matrix as in (16) and ap-ply spectral clustering to segment the epipolar lines,hence the original point correspondences.

4. Compute the epipoles e′i and e′′i , i = 1, . . . , n, for each

one of the n groups of epipolar lines as in (12).

5

4.3. From T to trifocal tensorsThe algorithm for motion segmentation that we have pro-posed so far computes the motion parameters (trifocal ten-sors, camera matrices and fundamental matrices) by firstclustering the image correspondences using the geometricinformation provided by epipoles and epipolar lines. In thissection, we demonstrate that one can estimate the individualtrifocal tensors without first clustering the image correspon-dences. The key is to look at second order derivatives of themultibody trilinear constraint. Therefore, we contend thatall the geometric information about the multiple motions isalready encoded in the multibody trifocal tensor.

Let x be an arbitrary point in P2 (not necessarily a pointin the first view). Also let `′ix and `

′′ix be its corresponding

epipolar lines in the second and third views according to theith motion. Then, we have the following result.

Theorem 4 (Slices of the trifocal tensors from second or-der derivatives of the multibody trilinear constraint) Thesecond order derivative of the multibody trilinear constraintwith respect to the second and third argument evaluated at(x, `′ix, `

′′ix) gives the matrix Mix ∼ xTi ∈ R3×3, i.e.

∂2(x̃˜̀′˜̀′′T )∂`′∂`′′

∣∣∣∣∣(x,`′ix,`

′′ix)

= Mix. (17)

Proof. A simple calculation shows that

∂2(x̃˜̀′˜̀′′T )

∂`′∂`′′=

n∑

j=1

(xTj)∏

k 6=j(x`′`′′Tk)+

n∑

j=1

(x`′Tj)∑

k 6=j(x`′′Tk)

∏

6̀=k(x`′`′′T`)

Since `′ix and `′′ix are epipolar lines associated with the i

th motion,then x`′ixTi = x`

′′ixTi = 0. Therefore,

∂2(x̃˜̀′˜̀′′T )

∂`′∂`′′

∣∣∣∣∣(x,`′ix,`

′′ix)

= (xTi)∏

j 6=i(x`′`′′Tj) ∼ (xTi)

Thanks to Theorem 4, we can immediately outline an al-gorithm for computing the individual trifocal tensors. Wejust take x = (1, 0, 0)>, x = (0, 1, 0)> and x = (0, 0, 1)>

and we immediately obtain the slices of the ith trifocal ten-sor Ti according to its first coordinate. Unfortunately, theabove procedure suffers from the following two problems:

1. Since the point x is not necessarily a point in the firstview, we cannot compute its epipolar lines (`′x, `

′′x) us-

ing Algorithm 2, because we do not know the corre-sponding points in the other views. Furthermore, (17)needs the epipolar lines according to all n motions.

2. The slice xTi of the ith trifocal tensor is only obtainedup to a scale factor. Thus, we need an additional pro-cedure to obtain the relative scales among the slices.

The first problem can be easily solved, because theepipolar lines we are looking for must pass through theepipoles, which are already known. Therefore, for eachepipole in the second view, e′i, we compute two lines `

′i1

and `′i2 passing through e′i and apply Algorithm 2 to com-

pute `′ix. Similarly, we compute two lines passing to e′′i

and then apply Algorithm 2 to compute `′′ix. By repeat-ing this process for the n epipoles, we obtain the epipo-lar lines {`′ix}ni=1 and {`′′ix}ni=1 for any x ∈ P2, whetheran image in the first view or not. The second problem canalso be easily solved. Let e1 = (1, 0, 0)>, e2 = (0, 1, 0)>

and e3 = (0, 0, 1)> be the standard basis for R3. Also letx = (x1, x2, x3)

> be any point in P2 not equal to e1, e2 ore3. Then with the above procedure we know how to com-pute the four 3 × 3 matrices Mi,e1 = λ−11 (e1Ti), Mi,e2 =λ−12 (e2Ti), Mi,e3 = λ

−13 e3Ti and Mix = λ

−1(xTi) up tounknown scale factors λ1, λ2, λ3 and λ, respectively. Wetherefore have

λMix = λ1Mi,e1x1 + λ2Mi,e2x2 + λ3Mi,e3x3 (18)

which gives a total of 9 linear equations in the 4 unknownscales. After solving for the scales λ1, λ2, λ3 from (18), onecan recover the individual trifocal tensors as:

Ti = [λ1Mi,e1 λ2Mi,e2 λ3Mi,e3 ] i = 1, . . . , n. (19)

Once the individual trifocal tensors have been computed,one may cluster the correspondences by assigning each fea-ture to the trifocal tensor Ti which minimizes the Sampsonerror. Alternatively, one may first reconstruct the 3D struc-ture by triangulation, project those 3D points onto the threeviews, and then assign points to the trifocal tensor Ti thatminimizes the reprojection error. We refer the reader to [4]for details of the computation of both errors.

4.4. Iterative refinement by EMThe motion segmentation algorithm we have proposed sofar is purely geometric and provably correct in the absenceof noise. Since most of the steps of the algorithm involvesolving linear systems, the algorithm will also work with amoderate level of noise (as we will show in the experiments)provided that one solves each step in a least-squares fashion.

In order to obtain an optimal estimate for the trifocal ten-sors and the segmentation of the correspondences, we as-sume a generative model in which rigid-body motions occurwith probabilities {0 ≤ πi ≤ 1}ni=1,

∑ni=1 πi = 1, and the

correspondences are corrupted with zero-mean i.i.d. Gaus-sian noise with variance {σ2i }ni=1. Let zij = 1 denote theevent that the jth correspondence belongs to the ith motion.Also let �ij = RepErr(xj ,x′j ,x

′′j , Ti) be the reprojection

error for correspondence j according to motion i. Then thecomplete log-likelihood (neglecting constant factors) of thedata (xj ,x′j ,x

′′j ) and the latent variables zij is given by

logN∏

j=1

n∏

i=1

(πiσi

exp

(−�ij2σ2i

))zij=

N∑

j=1

n∑

i=1

zij(log(πiσi

)− �ij2σ2i

).

6

In order to obtain a maximum likelihood estimate forthe parameters θ = {(Ti, σi, πi)}ni=1 given the data X ={(xj ,x′j ,x′′j )}Nj=1, we maximize the above function usingthe Expectation-Maximization algorithm [2] as follows:E-step: Computing the expected log-likelihood. Givena current estimate for the parameters, we can compute theexpected value of the latent variables

wij.=E[zij |X, θ]=P (zij = 1|X, θ)=

πiσi

exp(− �ij2σ2i

)∑ni=1

πiσi

exp(− �ij2σ2i

).

Then the expected complete log-likelihood is given byN∑

j=1

n∑

i=1

wij(log(πi)− log(σi))− wij �ij2σ2i

. (20)

M-step: Maximizing the expected log-likelihood. Afterdifferentiating (20) w.r.t. the trifocal tensors, we notice thatthe optimization problem can be decomposed into n opti-mization problems of the form min

∑nj=1 wij�ij , which can

be solved using standard structure from motion algorithmswith the correspondences weighted according to wij . Then,we can solve for πi using Lagrange multipliers asn∑

i=1

N∑

j=1

wij log(πi) + λ(1−n∑

i=1

πi) =⇒ πi =∑Nj=1 wij

N.

Finally, after differentiating (20) w.r.t. σi we obtain

σ2i =

∑Nj=1 wij�ij∑Nj=1 wij

.

The EM algorithm proceeds by iterating between the E-stepand the M-step, until the estimates converge to a local max-ima.

5. ExperimentsIn this paper, we consider the following algorithms.

1. Algebraic I: this algorithm clusters the corresponden-ces using epipoles and epipolar lines computed fromthe multibody trifocal tensor, as in Algorithms 1-3.

2. Algebraic II: this algorithm first computes the epipolesusing Algorithms 1-3, the trifocal tensors as in Sec-tion 4.3, and then clusters the point correspondencesaccording to their motion classes using the Sampson-distance residual to the different motions.

3. K-means: this algorithm alternates between comput-ing (linearly) the trifocal tensors for different motionclasses and clustering the point correspondences usingthe Sampson-distance residual to the different motions.

4. EM: This algorithm refines the classification and themotion parameters as described in Section 4.4. Forease of computation, in the M-step we first computeeach trifocal tensor linearly as in Section 2.1. If theerror (20) increases, we recompute the trifocal ten-sors using the linear algebraic algorithm in [4]. Ifthe error (20) still increases, then we use Levenberg-Marquardt to solve for the trifocal tensors optimally.

Figure 1 shows three views of the Tshirt-Book-Can se-quence which has two rigid-body motions, the camera andthe can, for which we manually extracted a total of N =140 correspondences, 70 per motion. Figure 1 also showsthe relative displacement of the correspondences betweenpairs of frames. We first run 1000 trials of the K-means al-gorithm starting from different random classifications. Onaverage, the K-means algorithm needs 39 iterations (max-imum was set to 50) to converge and yields a misclassi-fication error of about 24.6%, as shown in Table 1. The(non-iterative) algebraic algorithms on the other hand, givea misclassification error of 24.3% and 23.6%. Running theK-means algorithm starting from the clustering produced bythe second algebraic algorithm resulted in convergence after3 iterations to a misclassification error of 7.1%. Finally, af-ter 10 iterations of the EM algorithm, the misclassificationerror reduced to 1.4%.

200 400 600 800 1000 1200

100

200

300

400

500

600

700

800

900

200 400 600 800 1000 1200

100

200

300

400

500

600

700

800

900

200 400 600 800 1000 1200

100

200

300

400

500

600

700

800

900

Figure 1: Top: views 1-3 of a sequence with two rigid-bodymotions. Bottom: 2D displacements of the 140 correspon-dences from the current view (’o’) to the next (’→’).

We also tested the performance of our algorithm on twosequences with transparent motions, so that the only cue forclustering the correspondences is the motion cue. Given aset of correspondences of a sequence with one rigid-bodymotion, we generated a second set of correspondences byflipping the x and y coordinates of the first set of correspon-dences. In this way, we obtain a set of correspondences withno spatial separation and undergoing two different rigid-body motions. Figure 2 shows frames 1, 4 and 7 of theWilshire sequence and the inter-frame displacement of theN = 164×2 = 328 correspondences. As shown in Table 1,the K-means algorithm gives a mean misclassification error(over 1000 trials) of 35.5% with a mean number of iter-ations of 47.1. The algebraic algorithms give an error of4.1% and 2.5%. Following this with the K-means did notimprove the classification, while following this with EMachieved a perfect segmentation. Figure 3 shows two outof three frames from the Tea-Tins sequence and the inter-frame displacement of theN=42×2=84 correspondences.As shown in Table 1, the K-means algorithm gives a meanerror of 32.0% with a mean number of iterations of 49.8.The algebraic algorithms give an error of 15.4% and 10.7%,which is reduced to 9.5% by K-means, and to 0% by EM.

7

Table 1: Percentage of misclassification of each algorithm.K-means Alg. I Alg. II Alg. II +

K-meansAlg. II +

K-means+EM

Tshirt-Book-Can 24.6% 24.3% 23.6% 7.1% 1.4%

Wilshire 39.5% 4.1% 2.5% 2.5% 0.0%

Tea-Tins 32.0% 15.4% 10.7% 9.5% 4.8%

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

Figure 2: Top: frames 1, 4 and 7 of the Wilshire sequence.Bottom: 2D displacement of the 328 original and flippedcorrespondences from current view (’◦’) to the next (’→’).

100 200 300 400 500 600 700

50

100

150

200

250

300

350

400

450

500

550

Frame 83 Frame 94 2D disp. 83–94Figure 3: Two (out of three) views from the Tea-Tins se-quence and the 2D displacement of the 84 original andflipped correspondences.

We also tested our algorithm on synthetic data. We ran-domly generated two groups of 100 3D points each with adepth variation 100-400 units of focal length (u.f.l.). Thetwo rigid-body motions were chosen at random with aninter-frame rotation of 5◦ and an inter-frame translation of30 u.f.l. We added zero-mean i.i.d. Gaussian noise withs.t.d. of [0, 1] pixels for an image size of 1000 × 1000.Figure 4 shows the percentage of misclassified correspon-dences and the error in the estimation of the epipoles (de-grees) over 100 trials. The K-means algorithm usually con-verges to a local minima due to bad initialization. Thealgebraic algorithms (I and II) achieve a misclassificationratio of about 20.2% and 9.1% and a translation error of22.4◦ and 11.7◦, respectively, for 1 pixel noise. These er-rors are reduced to about 2.9% and 3.8◦, respectively, bythe K-means algorithm and to 2.4% and 2.8◦, respectivelyby the EM algorithm. This is expected, as the algebraicalgorithms do not enforce the nonlinear algebraic structureof the multibody trifocal tensors. The K-means algorithmimproves the estimates by directly clustering the correspon-dences using the trifocal tensors. The EM algorithm furtherimproves the estimates in a probabilistic fashion, at the ex-pense of a higher computational cost.

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

Noise level [pixels]

Mis

clas

sific

atio

n er

ror

[%]

KmeansLinAlg1LinAlg2LinAlg2+KmeansLinAlg2+Kmeans+EM

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

Noise level [pixels]

Tra

nsla

tion

erro

r [d

egre

es]

KmeansLinAlg1LinAlg2LinAlg2+KmeansLinAlg2+Kmeans+EM

Figure 4: Motion segmentation and motion estimation(translation) errors as a function of noise.

6. ConclusionsThe multibody trifocal tensor is effective in the analysis ofdynamic scenes involving several moving objects. The al-gebraic method of motion classification involves computa-tion of the multibody tensor, computation of the epipolesfor different motions and classification of the points accord-ing to the compatibility of epipolar lines with the differentepipoles. Our reported implementation of this algorithmwas sufficiently good to provide an initial classification ofpoints into different motion classes. This classification canbe refined using a K-means or EM algorithm with excellentresults. It is likely that more careful methods of computingthe tensor (analogous with best methods for the single-bodytrifocal tensor) could give a better initialization.

The algebraic properties of the multibody trifocal tensorare in many respects analogous to those of the single-bodytensor, but provide many surprises and avenues of researchthat we have not yet exhausted.

References[1] J. Costeira and T. Kanade. Multi-body factorization methods for mo-

tion analysis. In ICCV, pages 1071–1076, 1995.

[2] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from in-complete data via the EM algorithm. Journal of the Royal StatisticalSociety, 39(B):1–38, 1977.

[3] M. Fischler and R. Bolles. RANSAC random sample consensus: Aparadigm for model fitting with applications to image analysis andautomated cartography. Communic. of the ACM, 26:381–395, 1981.

[4] R. Hartley and A. Zisserman. Multiple View Geometry in ComputerVision. Cambridge, 2000.

[5] K. Kanatani. Motion segmentation by subspace separation and modelselection. In ICCV, volume 2, pages 586–591, 2001.

[6] Y. Ma, Kun Huang, R. Vidal, J. Kosecká, and S. Sastry. Rank condi-tions on the multiple view matrix. IJCV, 2004. To appear.

[7] P. Torr, R. Szeliski, and P. Anandan. An integrated Bayesian ap-proach to layer extraction from image sequences. IEEE Trans. onPattern Analysis and Machine Intelligence, 23(3):297–303, 2001.

[8] P. H. S. Torr. Geometric motion segmentation and model selection.Phil. Trans. Royal Society of London A, 356(1740):1321–1340, 1998.

[9] R. Vidal, Y. Ma, and S. Sastry. Generalized principal componentanalysis (GPCA). In CVPR, 2003.

[10] R. Vidal, Y. Ma, S. Soatto, and S. Sastry. Two-view multibody struc-ture from motion. International Journal of Computer Vision, 2004.

[11] L. Wolf and A. Shashua. Two-body segmentation from two perspec-tive views. In CVPR, pages 263–270, 2001.

8

JHU Center for Imaging Sciencervidal/publications/cvpr04-trifocal.pdf · 2004. 3. 21. · JHU...

Documents

Transcript of JHU Center for Imaging Sciencervidal/publications/cvpr04-trifocal.pdf · 2004. 3. 21. · JHU...