Intrinsic Dense 3D Surface Tracking - Max Planck...

Intrinsic Dense 3D Surface Tracking

Yun Zeng1, Chaohui Wang2,3, Yang Wang1, Xianfeng Gu1, Dimitris Samaras1, Nikos Paragios2,3

1Computer Science Department, Stony Brook University, New York, USA2Laboratoire MAS, Ecole Centrale Paris, Chatenay-Malabry, France

3Equipe GALEN, INRIA Saclay - Ile-de-France, Orsay, France

Abstract

This paper presents a novel intrinsic 3D surface dis-tance and its use in a complete probabilistic tracking frame-work for dynamic 3D data. Registering two frames of a de-forming 3D shape relies on accurate correspondences be-tween all points across the two frames. In the general casesuch correspondence search is computationally intractable.Common prior assumptions on the nature of the deforma-tion such as near-rigidity, isometry or learning from a train-ing set, reduce the search space but often at the price ofloss of accuracy when it comes to deformations not in theprior assumptions. If we consider the set of all possible3D surface matchings defined by specifying triplets of cor-respondences in the uniformization domain, then we intro-duce a new matching cost between two 3D surfaces. Thelowest feature differences across this set of matchings thatcause two points to correspond, become the matching costof that particular correspondence. We show that for sur-face tracking applications, the matching cost can be effi-ciently computed in the uniformization domain. This match-ing cost is then combined with regularization terms thatenforce spatial and temporal motion consistencies, into amaximum a posteriori (MAP) problem which we approx-imate using a Markov Random Field (MRF). Compared toprevious 3D surface tracking approaches that either assumeisometric deformations or consistent features, our methodachieves dense, accurate tracking results, which we demon-strate through a series of dense, anisometric 3D surfacetracking experiments.

1. Introduction

Dynamic 3D data has become increasingly popular withthe advances in 3D reconstruction techniques [1, 12, 20, 26,33]. An important prerequisite for most applications is toregister the 3D data among frames. For applications suchas facial expression analysis/transfer [7, 23], dense and ac-curate registration is highly desired for capturing subtle de-

tails. However, achieving dense, accurate registration re-mains challenging when there is noise, large deformationsand lack of reliable features. In this paper, we addressthe challenging problem of tracking a deformable templatefrom dynamic, markerless 3D data.

According to the well-known Riemann uniformizationtheorem [8], any simply-connected surface with a Rieman-nian metric can be conformally deformed onto one of threecanonical spaces: the sphere, the plane and the hyperbolicdisk. By using uniformization, 3D geometric problems arenaturally converted to 2D ones, which in general simpli-fies computation. Most importantly, when the deformationsbetween surfaces are isometric, matching between two sur-faces can be greatly simplified in the uniformization domainby only searching for a few correspondences [15].

Previously work [25, 26, 29, 30] relied on consistent fea-ture boundaries or points to determine prescribed sparsecorrespondences, in order to match two surfaces in theuniformization domain. Once the sparse correspondencesare established, a single conformal energy is minimized tomatch between the whole surfaces. To avoid relying onconsistent feature points, Lipman et al. [16] observed thatwhen two surfaces are isometrically deformed, only threecorrespondences are needed to determine a unique confor-mal mapping which is described by a Mobius transforma-tion. Sparse correspondences are optimized by voting fromdifferent Mobius transformations induced by different com-binations of correspondences. Recently, based on the factthat every three correspondences determine a unique con-formal mapping, leading to a candidate matching point forevery point on the surface, Zeng et al. [31] formulated ahigh-order graph matching problem to search for the opti-mal dense matching result by combining multiple matchingcriteria. Despite its success in combining multiple match-ing criteria to handle more than isometric deformations, thesingleton term that defines textural and geometric similar-ities for each matching candidate is evaluated only point-wise and hence such a term is sensitive to noise. Most re-cently, Lipman et al. [15] proposed a new distance functionthat compares two neighborhoods (i.e., disks) around two

1225

points in the disk uniformization domain, which improvesthe robustness of the matching cost for each candidate cor-respondence. Nevertheless, this distance cannot handle gen-eral surface matching when the surfaces have inconsistentboundaries or are anisometrically deformed, because com-paring two neighborhoods directly in the uniformization do-main is not straightforward since a disk is no longer mappedto a disk under Mobius transformations [18].

In this paper, we define a new distance that compares theneighborhoods between any candidate matching pair evenwhen the two surfaces have inconsistent boundaries and arenot isometrically deformed. Since every three correspon-dences determine a unique conformal mapping between thetwo surfaces, we can define a matching cost based on fea-ture differences for every such possible match. However,globally searching for the best three correspondences thatmatch the two surfaces is limited because the two surfacescan be exactly matched only when they are isometricallydeformed and have consistent boundaries. Hence, we de-fine a cost function for a particular correspondence by thelowest feature differences across the set of transformationsthat cause the two points to match, which only involvessearching for the correspondences of another two points onthe surface. By restricting the comparison of feature dif-ferences only between the neighborhoods of the correspon-dence, we can handle surfaces with inconsistent boundariesor anisometric deformations. A matching cost between twoneighborhoods can be therefore efficiently computed sinceonly one conformal mapping is needed for one surface andthe other conformal mappings induced from different corre-spondence matches are computed in a closed form.

With the above mentioned matching cost for any candi-date correspondence, it is not enough to simply output suchlocally best match for each point due to multiple optima,noisy data and numerical errors. Therefore, regularizationis necessary for a plausible result. In this paper, we formu-late the surface tracking problem in a unified probabilisticinference framework that takes into account spatio-temporalconsistency as well as the possibility for drift error. Weshow that such an inference problem can be approximatedby standard MRF optimization methods in a discrete settingand occlusion can be appropriately handled. Combinatorialmethods based on graphical models have become populardue to their capabilities of solving for more complicateddeformations [10, 11, 21, 24] and avoiding local optimalsolutions [13, 14]. Besides, occlusion handling can be con-veniently modeled in the same framework [22].

In summary, the primary contributions of this paper area robust intrinsic distance function for measuring the costof matching two points and a unified framework for intrin-sic 3D surface tracking. To achieve a robust 3D trackingsystem, our framework includes an intrinsic spatial defor-mation prior that constrains consistency in local deforma-

tions among neighboring points as well as drift and occlu-sion handling. Our tracking method is computed in the uni-formization domain, so it is robust to large deformations andscale changes. Compared to existing tracking algorithmssuch as [3, 26], we do not require prescribed feature detec-tors and do not rely on consistent boundaries. Thereforeour method is able to handle surface tracking under chal-lenging situations as shown in our experiments with a de-forming sponge. Unlike the system of Weise et al. [28], wedo not require a pre-defined training set for PCA learning,which is important for accurately tracking both large pre-viously unseen variations in object deformation as well assubtle but significant differences in the case of facial expres-sion changes. Quantitative results show that our algorithmachieves a high level of accuracy.

The remainder of this paper is organized as follows. Thenew distance of matching correspondences is defined inSec. 2. In Sec. 3 we introduce our probabilistic 3D surfacetracking framework. The implementation details are givenin Sec. 4. Experimental results and validation are part ofSec. 5. Finally we conclude our work in Sec. 6.

2. A robust correspondence matching distanceWe assume a shape is represented in a metric feature

space (M, dM, fM), where M is a compact connected andcomplete Riemannnian surface, dM : M × M → R is ameasure of distances between pairs of points on M, andfM : M → Rn is the mapping of each point on M intothe feature space (such as curvatures, texture, etc.). Previ-ously, a number of metrics have been proposed for measur-ing the similarities between any two shapes based on geo-metric information only, e.g., the geodesics distance [4], thediffusion distance [5] and distances based on the conformalfactor [15, 30]. To compare two shapes (M, dM, fM) and(N , dN , fN ), we denote the set of possible mappings (e.g.,diffeomorphism) between them as TM→N . A distance be-tween any two shapes (M, dM, fM) and (N , dN , fN ) canbe defined as follows:

dT (M,N ) = inft∈TM→N

∫M

|fM(x)− fN (t(x))|dx. (1)

Such a definition resembles the sum of absolute differences(SAD) metric commonly used in motion estimation. Fur-thermore, in the context of surface registration, it providesa flexible way of handling a wide range of deformations be-tween surfaces. For example, when the feature is the con-formal factor [30] and the mappings TM→N are restrictedto Mobius transformations in the uniformization domain, ithandles isometric or near-isometric surface matching. Todeal with more general deformations, we may use other fea-tures such as texture and curvature [31].

In this paper, based on Eq. 1, we consider the followingdistance function for measuring the quality of a correspon-

1226

dence between a point p ∈ M and a point q ∈ N :

dTM,N (p, q) = inft∈TM→Nt(p)=q

∫M

|fM(x)− fN (t(x))|dx, (2)

which is defined by the cost of matching the two surfacesby fixing the particular correspondence. When there is nomapping in the group TM→N that maps p to q, we definethe distance to be infinite. Since the distance function ofEq. 2 is defined on the whole surface, when the transforma-tion group TM→N is confined to mappings with boundedarea distortions, a small deviation from the true matchingthat minimizes Eq. 1 would cause the energy measure todeviate significantly from the optimum, which guaranteesthe robustness of this distance measure. Hence such a dis-tance is more robust compared to distances based on localfeatures. It is easy to see that dTM,N (p, q) ≥ dT (M,N ) forany p ∈ M, q ∈ N and the lower bound is achieved when pis mapped to q under the transformation that minimizes theenergy of Eq. 1. In the problem of dense surface registra-tion, the goal is to find on N the correspondences of a pointset P = {pi|pi ∈ M, i = 1, . . . , n}. Since we have

dT (M,N ) ≤∑

p∈P dTM,N (p, t(p))

|P |,∀t ∈ TM→N , (3)

the problem of shape registration can be formulated as

arg inft∈TM→N

∑p∈P

dTM,N (p, t(p)). (4)

In the following, we show how the distance function can beefficiently approximated in the uniformization domain.

2.1. Approximation in the uniformization domain

Although the distance defined in Eq. 2 gives us a ro-bust way to evaluate the matching cost between points, itis in general computationally intractable to evaluate sucha distance function directly in the 3D embedding spacesince it involves searching among all possible matchingsbetween two surfaces given a correspondence. For exam-ple, in [32] the matching cost given a few sparse correspon-dences is measured by deforming the whole surface to thetarget based on a particular deformation energy; minimizingsuch an energy is computational possible for only approx-imately 10 correspondences. In this paper, we propose anefficient way to approximate the distance function of Eq. 2by considering a subset of the mapping set TM→N (notethat such an approximation is only used for evaluating thecorrespondence cost and the global shape registration prob-lem of Eq. 4 is considered in a general set of mappings).

In order to take into account mappings between two sur-faces with inconsistent boundaries, we consider a neighbor-hood N(p) of p and the points on its boundary ∂N(p) =

{p1, . . . , pr}. For each possible mapping of the neighbor-ing points p1, . . . , pr ∈ ∂N(p) , the distance function ofEq. 2 can be approximated by warping the neighborhoodN(p) to the neighborhood of the target. Directly comput-ing such warping is very costly. However, if we notice thata mapping between the two surfaces can be computed byspecifying a few feature correspondences and optimizinga conformal energy to find the mappings of all the otherpoints [25, 26, 30], we then have an efficient way to ap-proximate the distance function of Eq. 2. This motivates usto consider the mappings of the neighborhood N(p) in theuniformization domain U ⊂ C, where C denotes the com-plex domain.

Formally, we denote the uniformization (conformal map-ping) of any surface M as ϕM : M → U . Also we considerthe set of mappings T UNI that is induced by specifying threecorrespondences between two surfaces in the uniformiza-tion domain [16]. For any point p ∈ M, we define theimage of a point p as Img(p) = {t(p)|t ∈ TM→N }, whereTM→N can be arbitrary diffeomorphisms and we only re-quire that T UNI ⊂ TM→N . For any two points p1, p2 ∈∂N(p), q ∈ Img(p), q1 ∈ Img(p1) and q2 ∈ Img(p2), let usdenote by Mo : pp1p2 → qq1q2 the Mobius transformationthat maps (p, p1, p2) to (q, q1, q2) in U . We then approxi-mate the distance of Eq. 2 in the uniformization domain asfollows:

dUNIM,N (p, q) = inf

p1,p2∈∂N(p),q1∈Img(p1),q2∈Img(p2),Mo:pp1p2→qq1q2∫

ϕM(N(p))⊂U |fM(ϕ−1M (z))− fN (ϕ−1

N (Mo(z)))|dzArea(N(p))

. (5)

Here ϕM(N(p)) denotes the mapping of the neighborhoodN(p) to the uniformization domain and Area(N(p)) de-notes the area of the neighborhood. When the feature isonly based on geometry (e.g., the conformal factor [15] orthe gaussian curvature), the distance measures the devia-tion from isometric deformation. However, when the fea-ture is based on other measures such as texture, the distancemeasures how accurately the source surface has been de-formed onto the target surface, not necessarily by an iso-metric deformation. Therefore, our definition is general andsubsumes the isometric deformation as a special case.

3. Probabilistic 3D surface trackingThe intrinsic distance dUNI

M,N (·, ·) measures the likeli-hood of matching between individual points. Such a dataterm is then combined with regularization terms towardssolving the problem of surface registration. To this end,we investigate the 3D surface tracking problem in a proba-bilistic framework which takes into account geometric andtextural similarities, spatio-temporal consistencies as wellas error drift.

1227

Let us denote by M1:t ≡ {M1, . . . ,Mt} the dynamic3D data up to time t, and x1:t ≡ {xi ⊂ Mi|i = 1, . . . , t} asthe trajectory of the given initial dense points x0 = {x0

i ∈R3|i = 1, . . . , n} where xt = {xt

i ∈ R3|i = 1, . . . , n}. Inorder to utilize the intrinsic measure defined in the previoussection, we assume the initial points x0 are represented as a2-manifold mesh, i.e., a planar graph G = (V, E).

The task of tracking the trajectory of x0 at time t giventhe dynamic data M1:t and the previous trajectory x1:t−1

therefore becomes the MAP problem

argmaxxt

p(xt|M1:t,x1:t−1). (6)

Here we assume M1:t′ to be independent of xt given x1:t′

whenever t′ < t. From the Bayes’ theorem, we havep(xt|M1:t,x1:t−1)

=p(M1:t|x1:t)p(xt|x1:t−1)

p(M1:t|x1:t−1)

∝ p(M1:t|x1:t)p(xt|x1:t−1)

= p(Mt|x1:t,M1:t−1)p(M1:t−1|x1:t)p(xt|x1:t−1)

∝ p(Mt|x1:t,M1:t−1)︸︷︷︸Data fidelity

p(xt|x1:t−1)︸︷︷︸Spatio-temporal prior

. (7)

Here p(Mt|x1:t,M1:t−1) denotes the data likelihood de-fined by intrinsic similarities. p(xt|x1:t−1) denotes thespatio-temporal priors that ensure the smoothness of the re-sult. In the following we discuss each of the two terms indetail.

3.1. Data fidelity terms

The data fidelity terms consider the fidelity of the 3Ddata M1:t given the tracking results x1:t. For dense track-ing, we assume the tracking points x0 are dense enough tocapture the detailed geometry of the surfaces and thus thetrajectory x1:t′ and the data M1:t′ are independent of thetrajectory xt given xt−1 and Mt−1 when t′ < t − 1. Wehave

p(Mt|x1:t,M1:t−1)

=p(Mt,x1:t−2,M1:t−2|xt,xt−1,Mt−1)

p(x1:t−2,M1:t−2|xt,xt−1,Mt−1)

∝ p(Mt,x1:t−2,M1:t−2|xt,xt−1,Mt−1)

= p(Mt|xt,xt−1,Mt−1)×p(x1:t−2,M1:t−2|xt,xt−1,Mt−1,Mt).

Here p(M t|xt,xt−1,Mt−1) denotes the reg-istration between successive frames andp(x1:t−2,M1:t−2|xt,xt−1,Mt−1,Mt) denotes theconsistency between the current frame and the previouslytracked frames, which avoids loss of tracking caused byaccumulation of local registration errors.

3.1.1 Geometry and texture similarities

Intrinsic comparison takes into account both geometry andtexture (if available) consistencies between frames. We de-fine the inter-frame data similarity term as follows:

p(Mt|xt,xt−1,Mt−1)

∝n∏

i=1

N (dUNIMt,Mt−1(x

ti, x

t−1i )|0, σdata). (8)

3.1.2 Error drift term

The intrinsic distance dUNIMt,Mt′ (x

ti, x

t′

i ) allows us to con-sider the likelihood of a correspondence between time tand t′. We assume that xt is consistent with the previouslytracked frames if it “agrees” with the majority of them. Foreach point, we select the median distance of the set of thepoint to its previous matches. This is similar to the use ofmedian filter, and formulated as

p(x1:t−2,M1:t−2|xt,xt−1,Mt−1,Mt) (9)

∝n∏

i=1

mediant′∈{1,...,t−2}N (dUNIMt,Mt′ (x

ti, x

t′

i )|0, σdrift)

However, computing consistency between the current frameand all the previous frames is very costly. An approximatesampling scheme is to consider a subset of {1, . . . , t − 2},namely, I. Hence Eq. 9 can be approximated as follows:

n∏i=1

mediant′∈IN (dUNIMt,Mt′ (x

ti, x

t′

i )|0, σdrift). (10)

3.2. Spatiotemporal priors

The probability p(xt|x1:t−1) represents the prior knowl-edge of the trajectory x1:t, and also regularizes the trackingresult. We decompose the probability into two terms:

p(xt|x1:t−1) =p(xt|xt−1)p(x1:t−2|xt,xt−1)

p(x1:t−2|xt−1)

∝ p(xt|xt−1)p(x1:t−2|xt,xt−1). (11)

Here p(xt|xt−1) denotes the spatial deformation consis-tency between consecutive frames and p(x1:t−2|xt,xt−1)denotes the dynamic motion consistency.

3.2.1 Intrinsic spatial deformation prior

The spatial prior p(xt|xt−1) takes into account the plau-sible deformation of the surface between frames. In thedefinition of the distance in Eq. 5, only the matching costsfor each correspondence xt−1

i 7→ xti are considered. There

is no constraint on the consistency between two neighbor-ing correspondences xt−1

i 7→ xti and xt−1

j 7→ xtj where

1228

(i, j) ∈ E . Since each of the distance functions (Eq. 5)takes into account the locally best Mobius transformationmapping a neighborhood of xt−1

i to a neighborhood of xti,

it is reasonable to assume that such a locally optimal trans-formation also maps xt−1

i ’s neighbor xt−1j to a position

nearby xtj . Let Moopt

p,q denote the optimal Mobius trans-formation that results in the distance defined in Eq. 5. Toconstrain the deformation consistency between neighboringpoints (i, j) ∈ E , we define the following distance in theuniformization domain:

dti→j = |(Moopt

xt−1i ,xt

i

(ϕt−1(xt−1j )))− ϕt(xt

j)|. (12)

Here ϕt(·) denotes the uniformization of the data Mt. Thisdistance measures how close ϕt(xt

j) is to the point z, wherez is obtained by transforming ϕt−1(xt−1

j ) using the optimaltransformation Moopt

xt−1i ,xt

i

(·) that maps xt−1i to xt

i. Whenthis distance is small, it means such optimal transforma-tion produces consistent mappings with the neighbors. For-mally, we define,

p(xt|xt−1) ∝∏

(i,j)∈E

N ((dti→j + dtj→i)/2|0, σspa). (13)

3.2.2 Dynamic motion prior

The dynamic prior imposes temporal consistency of eachvertex i by assuming the curve traced by each vertex i to besmooth, i.e., we assume the acceleration to be small. If wedefine the angle between the vectors xt

i − xt−1i and xt−1

i −xt−2i as Angt

i, the dynamic prior can be defined as

p(x1:t−2|xt,xt−1) ∝n∏

i=1

N (Angt−1i |Angt

i, σdyn). (14)

In practice, the smoothness assumption is only applicablewhen the motion between two frames is not too large.

4. Implementation detailsThe above mentioned objective function (Eq. 7) does not

have a closed form in continuous space, making global op-timization difficult. Instead, in this paper, we employ thediscrete MRF framework [9] to take into account the ob-jectives discussed in the previous section. For each frameat time t, we select L matching candidates for each pointxti, i = 1, . . . , n. As a result, by applying the − log operator

to the probability of Eq. 7, our tracking problem is equiva-lent to solving for the optimal configuration xt ∈ Ln:

argminxt

∑i∈V

θi(xti) +

∑(i,j)∈E

θij(xti, x

tj), (15)

where the energy functions θi(xti), θij(x

ti, x

tj) are de-

fined according to the probabilistic framework discussed in

(a) Base mesh (b) Dense mesh

Figure 1. Mesh template. The base mesh (a) is resampled to pro-duce the dense mesh (b).

Sec. 3. In this section, we give the detailed definition of theenergy.

4.1. Initialization

In the first frame M0, an initial mesh template x0 ={x0

1, . . . , x0n} is constructed. The template can be con-

structed either automatically [28] or manually [3]. In ourexperiments, we constructed the template using the re-topology tool provided in MeshLab1. In this way, we spec-ify less than 100 base vertices and obtain a dense mesh witharound 1000 vertices (Fig. 1).

4.2. Candidate selection

To decide the matching candidates for each node xti in

the next frame t + 1, we consider the embedding spaceneighborhood, the view space neighborhood and the in-trinsic space neighborhood. (1) For the embedding spaceneighborhood, we uniformly sample L1 points on Mt+1 inthe neighborhood of each xt

i within radius R. (2) For theview space neighborhood, we project the point xt

i to an im-age plane where xt

i is visible. Then we back-project L2

neighboring points of the the projection of xti on the image

plane back to the surface Mt+1. For 3D data obtained fromthe depth map of 2D images, the selection of the neighborson the image plane is done in a hierarchical manner in orderto take into account large deformations as in [11]. (3) Forthe intrinsic space neighborhood, we randomly select L3

triplets of initial correspondences (by closest point registra-tion or results from previous iteration) among the vertices ofthe base mesh and obtain a correspondence for each pointfrom every triplets of correspondences [31]. Such intrinsicspace sampling can achieve sub-sample accuracy [31]. Inour experiments, we consider L = 64 candidates for eachpoint, L1 = 24 from embedding space sampling, L2 = 25from view space sampling and L3 = 15 from intrinsic spacesampling. R is set to be 0.1 of the diameter of the mesh M0.

1http://meshlab.sourceforge.net/

1229

4.3. Computation of intrinsic distance

The computation of the intrinsic distance dUNIM,N (p, q) for

any correspondence p 7→ q involves selection of the neigh-borhood N(p), sampling of points p1, . . . , pr ∈ ∂N(p)and the numerical approximation of the integration in Eq. 5from the N(p) to all its possible mappings on the targetsurface in the uniformization domain. We select the bound-ary points p1, . . . , pr among the vertices of the base mesh(Fig. 1(a)). If p is not on the boundary of the template andpapbpc is the face of the base mesh that covers p, we chooseN(p) to be the largest triangle among the three triangles△ppapb, △ppbpc and △ppcpa and we evaluate the integra-tion of Eq. 5 only in the region covered by the selected tri-angle. If p is on the boundary, we choose N(p) to be the twolargest triangles among its neighboring triangles mentionedabove. In our implementation, we select 81 sampling pointsin each neighborhood. The distance function can be effi-ciently computed in parallel because each of the functionsdUNIM,N (p, ·) is independent of the others.

4.4. Occlusion handling

Occlusions are handled by introducing an additional la-bel {Occ} for each vertex xt

i, with a cost function:

docc(xti, x

t′

i ) =

0 if dUNI

Mt,Mt′ (xti, x

t′

i ) < δ, xti = Occ

E1 if dUNIMt,Mt′ (x

ti, x

t′

i ) > δ, xti = Occ

E2 otherwise

.

Here t′ is the last frame before time t that the point i is vis-ible and we set E1 = 1, E2 = 10 and δ = 0.05. Intuitively,when the cost of matching a correspondence xt′

i 7→ xti is

higher than a threshold δ, it is likely that i is occluded atframe t, where a constant penalty E1 is imposed. When oc-clusion occurs at a point, we set its default position as theposition computed by the ICP algorithm [2] registering theun-occluded points.

4.5. Composite MRFs and optimization

Combining the energy functions defined above, we opti-mize for the MAP problem of Eq. 6 under the discrete MRFoptimization framework. The singleton terms include thedata-fidelity, dynamic motion consistency, error drift andocclusion handling priors, i.e., for all i ∈ V , we define

θi(xti) =

dUNIMt,Mt−1 (x

ti,x

t−1i )2

σdata+

(Angt−1i −Angti)

2

σdyn

+(mediant′∈T dUNI

Mt,Mt′ (xti,x

t′i ))2

σdrift

if dUNIMt,Mt′ (x

ti, x

t′

i ) < δ, xti = Occ

E1 if dUNIMt,Mt′ (x

ti, x

t′

i ) > δ, xti = Occ

E2 otherwise

The pairwise terms include the spatial deformation prior inSec. 3.2.1 and the smoothness of the occluded part, i.e.,

θij(xti, x

tj) =

(dt

i→j+dtj→i)

2

4σspaif xt

i, xtj = Occ

0 if xti = xt

j = Occ

E3 otherwise

(16)

In our experiments, E3 = 1. Intuitively, such an energyencourages the smoothness of the occluded part. We em-ploy the TRW-S algorithm [13] for the optimization. Theenergy is optimized iteratively until convergence or up to amaximum allowed number of iterations. For the drift han-dling term of Eq. 10, we randomly select 5 frames fromprevious tracking results {1, . . . , t− 1}. The weights of theenergy are selected as σdata = 1, σdyn = 500, σdrift = 2,σspa = 20.

5. ResultsData: We test our tracking system on a dynamic face

data set captured by the 3D scanning system described in[27]. The data set consists of four actors with 24 differentfacial expressions, including coy flirtation, devious smirk,soft affection and fake smile, etc. Each of the expressions iscaptured with a frame rate of 24fps for around 10−20 sec-onds. The number of vertices for each frame is 79, 000 onaverage with only gray-scale textural information (the graylevel is normalized as in [0, 1]). Some of the captured raw3D data frames suffer from scale ambiguities. In such cases,we remove the dynamic prior defined in Sec. 3.2.2. In ourexperiments, we use texture as the feature for computing theintrinsic distance (Eq. 5) as it is less sensitive to anisomet-ric deformation. Fig. 2 shows four different sequences fromfour different actors2. Fig. 3 shows a tracking result in avery challenging situation (largely inconsistent boundaries,occlusions and anisometric deformations between frames).For this example, the average texture difference per pointbetween every frame and the first frame is 0.0235. The max-imal average area ratio change (to the first frame) is 1.26and the maximal percentage of surface occlusion occurringbetween two frames is around 30%.

Analysis of intrinsic distance function: A key factorto achieve high accuracy of surface tracking is the distancefunction dUNI

M,N (·, ·) defined in Sec. 2. To see the sensitivityof the distance in distinguishing subtle differences in cor-respondences for a given point p, we sample 7 × 7 closestneighboring points in the next frame as matching candidatesin the embedding space (Sec. 4.2). Fig. 4(a) shows the eval-uation of the 49 values of the function dUNI

M,N (p, ·) for dif-ferent points on the surface. We compare the distance withsimple per-point texture difference.

Furthermore, we compare our cost function with the re-sult obtained by the optical flow algorithm in [17] basedon [6]. We project the left part of the face to a 640 × 480perspective view selected to maximize visibility and apply

2See the complete sequences: http://www.cs.sunysb.edu/∼ial/.

1230

Figure 2. Tracking results selected from our data set.

Figure 3. A challenging result with both anisometric deformationand inconsistent boundaries.

02

46

8

02

46

80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Co

st f

un

ctio

n v

alu

e\

Pe

r-p

oin

t tx

ture

dis

tan

ce

02

46

8

0

5

100

0.05

0.1

0.15

0.2

0.25

02

46

8

0

5

10

0

0.02

0.04

0.06

0.08

0.1

Co

st f

un

ctio

n v

alu

e\

Pe

r-p

oin

t tx

ture

dis

tan

ce

0

5

10

02

46

80

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Per-point texture distanceCo

st f

un

ctio

n v

alu

e\

Pe

r-p

oin

t tx

ture

dis

tan

ce

Co

st f

un

ctio

n v

alu

e\

Pe

r-p

oin

t tx

ture

dis

tan

ce

0 50 100 150 200 250 300 350 4000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Frame Number

Av

era

ge

Te

xtu

re D

i!e

ren

ce

Result by optical "ow

Result by our cost function

(a) dUNIM,N (p, ·) (b)

Figure 4. (a) Our cost function of Eq. 5 vs. per-point texture dis-tance on distinguishing subtle differences in correspondences and(b) comparison with optical flow method for inter-frame registra-tion (details of comparison are described in the text).

the optical flow algorithm to establish correspondences be-

tween two frames. For the template points that belong tothe projected part, we compute the cost function and choosethe correspondence with lowest cost as the matching result.We linearly interpolate the correspondence of other pointswithin the template that are visible. We compare the av-erage texture per-point differences based on the correspon-dences obtained by optical flow and by our method (Fig. 4(b) shows the comparison for one sequence). It can be seenthat when the deformation between two frames is large, theoptical flow degrades more significantly than our method.

Error and performance analysis: Fig. 5(a) shows theerror evaluation based on the complete 24 tracking results.The error measures the average per-point texture differencecompared to the first frame. Fig. 5(b) shows the amount ofarea ratio change (anisometry) between the first frame thethe current frame for 23, 000 randomly selected trianglesamong the tracking results.

We compare the influence of the regularization terms inthe optimization of Eq. 15. The error is evaluated based onthe average texture difference between every frame and thefirst frame. Fig. 6 (a) shows the comparisons for sequenceA coyfirtion. Even by considering the data term only, ourmethod is more accurate for most frames than a previousintrinsic surface tracking method based on Harmonic maps[26]. Fig 6 (b) is the comparison for 5 more sequences onthe average per-point texture difference.

Computation time: Our algorithm is implemented onan Intelr Core(TM) 2 Duo 3.16G PC with 4G RAM anda NVIDIAr Geforce 9800GTX+ graphics card with 128CUDA cores. The preprocessing (mesh loading, nearestneighbor search data structure construction, candidate se-lection) takes 2–3s. The computation of the mid-edge uni-formization [19] for each mesh takes less than 1s using GPUimplementation. With the hardware acceleration describedin Sec. 4.3, the computation of the L = 64 cost functionsfrom Eq. 5 for one tracking point takes only 3ms on aver-age. The MRF optimization using the TRW-S algorithm[13] takes around 1–3s for the 1000 template points de-scribed in Sec. 4.1 with 65 (64 for matching candidate and1 for occlusion) labels per point. Therefore tracking oneframe with 5 look-back frames takes only 18 – 25s for eachiteration. In our experiments, we observe that the algorithmoften converges within 5 iterations.

6. ConclusionWe proposed a new cost function that compares two

neighborhoods of a correspondence by searching among allthe possible mappings in the uniformization domain. Thiscost function is then combined with regularization termsthat take into account spatio-temporal consistency, errordrift and occlusion problems, into a unified 3D trackingframework. By employing existing MRF optimization tech-niques and hardware accelerations, our algorithm becomes

1231

0 50 100 150 2000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Frame Number

Av

era

ge

Te

xtu

re D

i!e

ren

ce

0.5 1 1.5 2 2.5 30

200

400

600

800

1000

1200

1400

1600

1800

Area Ratio

Fre

qu

en

cy(a) (b)

Figure 5. (a) shows the evaluation of the average per-point texturedifference between every frame and the first frame on the whole 24data set. (b) is the frequency of the area ratio of randomly selectedtriangles between current frame and the first frame, which showsthat a significant number of them deforms anisometrically.

0 50 100 150 200 2500

0.02

0.04

0.06

0.08

0.1

0.12

Frame Number

Av

era

ge

Te

xtu

re D

i!e

ren

ce

Harmonic map

Composite result

With data term only

Without drift handling

HM

D.Term

W.o. Drift

Our Res.

B_F D_C B_C A_C C_A

0.0810

0.0573

0.0364

0.0317

0.0987 0.0796 0.0983 0.0704

DataMethod

0.0423

0.0315

0.0262 0.0150

0.0431

0.0505

0.0299 0.0321

0.0428 0.0396

0.0452 0.0564

(a) Per-frame error for one sequence (b) Average error for 5 sequences

Figure 6. Results show influence of the regularization term used inthe optimization of Eq. 15 and comparison with previous intrinsictracking method used in Harmonic maps [26].

practical for applications where high accuracy is essential.In the near future, we would like to apply our algorithm totrack additional dynamic 3D databases and explore applica-tions such as facial expression analysis/transfer, etc.

AcknowledgementsWe thank Przemyslaw Szeptycki for help with data cap-

ture and many useful discussions. This work was partiallysupported by grants from: NSF: CCF-TF 0830550, CCF-0916235, IIS: III 0916286, ONR N000140910228, NSF:CNS-0959979, CNS-0627645, NIH 1R01EB0075300A1.

References[1] Microsoft c⃝ Kinect, 2010.[2] P. J. Besl and N. D. McKay. A method for registration of 3-D shapes.

TPAMI, 14(2):239–256, 1992.[3] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer. High resolution

passive facial performance capture. ACM Trans. Graph., 29(4):1–10, 2010.

[4] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Generalizedmultidimensional scaling: a framework for isometry-invariant partialsurface matching. Proc. National Academy of Sciences, 103:1168–1172, 2006.

[5] M. M. Bronstein and A. M. Bronstein. Shape recognition with spec-tral distances. TPAMI, 33:1065–1071, 2011.

[6] A. Bruhn, J. Weickert, and C. Schnorr. Lucas/kanade meetshorn/schunck: Combining local and global optic flow methods.IJCV, 61:211–231, 2005.

[7] R. J. Campbell and P. J. Flynn. A survey of free-form object repre-sentation and recognition techniques. Comput. Vis. Image Underst.,81(2), 2001.

[8] H. M. Farkas and I. Kra. Riemann Surfaces. Springer, 2004.[9] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions

and the bayesian restoration of images. TPAMI, 6(6):721–741, 1984.[10] B. Glocker, H. Heibel, N. Navab, P. Kohli, and C. Rother. Triangle-

flow: Optical flow with triangulation-based higher-order likelihoods.In ECCV, 2010.

[11] B. Glocker, N. Komodakis, G. Tziritas, N. Navab, and N. Paragios.Dense image registration through MRFs and efficient linear program-ming. In Medical Image Analysis, volume 12, pages 731–741, 2008.

[12] C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, andR. Cipolla. Non-rigid photometric stereo with colored lights. InICCV, 2007.

[13] V. Kolmogorov. Convergent tree-reweighted message passing for en-ergy minimization. TPAMI, 28(10):1568–1583, 2006.

[14] N. Komodakis, G. Tziritas, and N. Paragios. Performance vs compu-tational efficiency for optimizing single and dynamic mrfs: Settingthe state of the art with primal-dual strategies. CVIU, 112(1):14–29,2008.

[15] Y. Lipman and I. Daubechies. Surface comparison with mass trans-portation. Technical report, Princeton University, 2010.

[16] Y. Lipman and T. Funkhouser. Mobius voting for surface correspon-dence. ACM Trans. Graph., 28(3):1–12, 2009.

[17] C. Liu. Beyond Pixels: Exploring New Representations and Appli-cations for Motion Analysis. PhD thesis, MIT, 2009.

[18] T. Needham. Visual Complex Analysis. Oxford University Press,1999.

[19] U. Pinkall and K. Polthier. Computing discrete minimal surfaces andtheir conjugates. Experimental Mathematics, 2(1):15–36, 1993.

[20] A. Shaji, A. Varol, L. Torresani, and P. Fua. Simultaneous pointmatching and 3d deformable surface reconstruction. In CVPR, 2010.

[21] A. Shekhovtsov, I. Kovtun, and V. Hlavac. Efficient MRF deforma-tion model for non-rigid image matching. CVPR, 2007.

[22] E. B. Sudderth, M. I. Mandel, W. T. Freeman, and A. S. Willsky.Visual hand tracking using nonparametric belief propagation. InCVPRW ’04, volume 12, page 189, 2004.

[23] D. Vlasic, M. Brand, H. Pfister, and J. Popovic. Face transfer withmultilinear models. ACM Trans. Graph., 24:426–433, 2005.

[24] C. Wang, M. M. Bronstein, and N. Paragios. Discrete minimum dis-tortion correspondence problems for non-rigid shape matching. InResearch Report 7333, INRIA, 2010.

[25] S. Wang, Y. Wang, M. Jin, X. D. Gu, and D. Samaras. Conformalgeometry and its applications on 3D shape matching, recognition,and stitching. TPAMI, 29(7):1209–1220, 2007.

[26] Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras, andP. Huang. High resolution tracking of non-rigid 3D motion of denselysampled data using harmonic maps. In ICCV, 2005.

[27] Y. Wang, X. Huang, C. Lee, S. Zhang, Z. Li, D. Samaras, D. Metaxas,A. Elgammal, and P. Huang. High resolution acquisition, learningand transfer of dynamic 3-D facial expressions. In Computer Graph-ics Forum, pages 677–686, 2004.

[28] T. Weise, H. Li, L. Van Gool, and M. Pauly. Face/off: live facialpuppetry. In SCA ’09, pages 7–16, 2009.

[29] W. Zeng, D. Samaras, and X. D. Gu. Ricci flow for 3D shape analy-sis. TPAMI, 32:662–677, 2010.

[30] W. Zeng, Y. Zeng, Y. Wang, X. Yin, X. Gu, and D. Samaras. 3Dnon-rigid surface matching and registration based on holomorphicdifferentials. In ECCV, 2008.

[31] Y. Zeng, C. Wang, Y. Wang, X. Gu, D. Samaras, and N. Paragios.Dense non-rigid surface registration using high-order graph match-ing. In CVPR, 2010.

[32] H. Zhang, A. Sheffer, D. Cohen-Or, Q. Zhou, O. van Kaick, andA. Tagliasacchi. Deformation-driven shape correspondence. Com-puter Graphics Forum (SGP), 27(5):1431–1439, 2008.

[33] L. Zhang, N. Snavely, B. Curless, and S. M. Seitz. Spacetime faces:high resolution capture for modeling and animation. ACM Trans.Graph., 23(3):548–558, 2004.

1232

Intrinsic Dense 3D Surface Tracking - Max Planck...

Documents

Transcript of Intrinsic Dense 3D Surface Tracking - Max Planck...