Segregation of moving objects using elastic matching

13
Segregation of moving objects using elastic matching Vishal Jain * , Benjamin B. Kimia, Joseph L. Mundy Division of Engineering, Brown University, Providence, RI, USA Received 8 May 2005; accepted 6 November 2006 Available online 6 April 2007 Communicated by James Maclean Abstract We present a method for figure-ground segregation of moving objects from monocular video sequences. The approach is based on tracking extracted contour fragments, in contrast to traditional approaches which rely on feature points, regions, and unorganized edge elements. Specifically, a notion of similarity between pairs of curve fragments appearing in two adjacent frames is developed and used to find the curve correspondence. This similarity metric is elastic in nature and in addition takes into account both a novel notion of tran- sitions in curve fragments across video frames and an epipolar constraint. This yields a performance rate of 85% correct correspondence on a manually labeled set of frame pairs. Color/intensity of the regions on either side of the curve is also used to reduce the ambiguity and improve efficiency of curve correspondence. The retrieved curve correspondence is then used to group curves in each frame into clusters based on the pairwise similarity of how they transform from one frame to the next. Results on video sequences of moving vehicles show that using curve fragments for tracking produces a richer segregation of figure from ground than current region or feature-based methods. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Curves; Elastic curve matching; IHS color space; Tracking; Vehicle; Similarity transform; Fragmentation 1. Introduction The key goal of this paper is to investigate tracking con- tinuous edge curves in video sequences as a basis for figure- ground segmentation. The approach to be described here exploits the spatial continuity of contour fragments (con- nected edgel chains) in order to achieve more robust track- ing and to obtain a more complete representation of the segmented moving object than that produced by traditional optical-flow or point-based methods. The algorithm reported here is based on finding minimum energy curve deformation transformations from one frame to the next. Figure-ground classes are defined in terms of groups of curves transforming in a similar fashions as measured by curve matching distances. The problem of object tracking in video has been exten- sively studied. The approaches can be loosely organized by the primary spatial dimension of the tracked feature, i.e. points [21,13,5,27], curves [17,15,9,16,6,10] or regions [19,12,23,4,7]. Point-based features are typically used in the context of 3D reconstruction, where points are matched across frames on the basis of Euclidean distance and image correlation in a local neighborhood around matching pairs. The epipolar constraint is used to eliminate erroneous matches using robust fitting algorithms such as RANSAC [14]. More recently, there has been considerable interest in regions, where affine invariance derived from intensity operators [20] is used to define salient patches that can be recovered from multiple views of the same surface feature. These affine patches can be used for tracking as well as recognition. Much of the prior literature on curve-based tracking exploits various types of deformable contours, such as polygons or cubic splines, which are tracked by optimizing image-based energy cost or likelihood functions. In most 1077-3142/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2006.11.024 * Corresponding author. E-mail addresses: [email protected] (V. Jain), [email protected]. edu (B.B. Kimia), [email protected] (J.L. Mundy). www.elsevier.com/locate/cviu Available online at www.sciencedirect.com Computer Vision and Image Understanding 108 (2007) 230–242

Transcript of Segregation of moving objects using elastic matching

Available online at www.sciencedirect.com

www.elsevier.com/locate/cviu

Computer Vision and Image Understanding 108 (2007) 230–242

Segregation of moving objects using elastic matching

Vishal Jain *, Benjamin B. Kimia, Joseph L. Mundy

Division of Engineering, Brown University, Providence, RI, USA

Received 8 May 2005; accepted 6 November 2006Available online 6 April 2007

Communicated by James Maclean

Abstract

We present a method for figure-ground segregation of moving objects from monocular video sequences. The approach is based ontracking extracted contour fragments, in contrast to traditional approaches which rely on feature points, regions, and unorganized edgeelements. Specifically, a notion of similarity between pairs of curve fragments appearing in two adjacent frames is developed and used tofind the curve correspondence. This similarity metric is elastic in nature and in addition takes into account both a novel notion of tran-sitions in curve fragments across video frames and an epipolar constraint. This yields a performance rate of 85% correct correspondenceon a manually labeled set of frame pairs. Color/intensity of the regions on either side of the curve is also used to reduce the ambiguity andimprove efficiency of curve correspondence. The retrieved curve correspondence is then used to group curves in each frame into clustersbased on the pairwise similarity of how they transform from one frame to the next. Results on video sequences of moving vehicles showthat using curve fragments for tracking produces a richer segregation of figure from ground than current region or feature-basedmethods.� 2007 Elsevier Inc. All rights reserved.

Keywords: Curves; Elastic curve matching; IHS color space; Tracking; Vehicle; Similarity transform; Fragmentation

1. Introduction

The key goal of this paper is to investigate tracking con-tinuous edge curves in video sequences as a basis for figure-ground segmentation. The approach to be described hereexploits the spatial continuity of contour fragments (con-nected edgel chains) in order to achieve more robust track-ing and to obtain a more complete representation of thesegmented moving object than that produced by traditionaloptical-flow or point-based methods. The algorithmreported here is based on finding minimum energy curvedeformation transformations from one frame to the next.Figure-ground classes are defined in terms of groups ofcurves transforming in a similar fashions as measured bycurve matching distances.

1077-3142/$ - see front matter � 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.cviu.2006.11.024

* Corresponding author.E-mail addresses: [email protected] (V. Jain), [email protected].

edu (B.B. Kimia), [email protected] (J.L. Mundy).

The problem of object tracking in video has been exten-sively studied. The approaches can be loosely organized bythe primary spatial dimension of the tracked feature, i.e.points [21,13,5,27], curves [17,15,9,16,6,10] or regions[19,12,23,4,7]. Point-based features are typically used inthe context of 3D reconstruction, where points are matchedacross frames on the basis of Euclidean distance and imagecorrelation in a local neighborhood around matching pairs.The epipolar constraint is used to eliminate erroneousmatches using robust fitting algorithms such as RANSAC[14]. More recently, there has been considerable interest inregions, where affine invariance derived from intensityoperators [20] is used to define salient patches that can berecovered from multiple views of the same surface feature.These affine patches can be used for tracking as well asrecognition.

Much of the prior literature on curve-based trackingexploits various types of deformable contours, such aspolygons or cubic splines, which are tracked by optimizingimage-based energy cost or likelihood functions. In most

V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242 231

cases, these curves are initialized by hand in the first videoframe. An early application of directly tracking edgels wasdemonstrated by Huttenlocker [15] who formed objecttemplates from edge points and tracked them by minimiz-ing the Hausdorff distance between point sets. Howeverthis work did not exploit the spatial continuity of the edgecurves, which were treated as discrete point sets.

There seems to be little work on using connected edgelchains directly as the tracked representation. The edgelchains can be obtained by linking edges obtained from anedge-detector as in [24] or iso-intensity contours [22] withnon-zero gradient along the contour which are claimed tobe robust to illumination changes and do not require anyfixed threshold. The closest work is that of Folta et al.[9], where they use edge curve matching to form the out-lines of moving objects, the key objective of this paper.Our algorithmic approach is closest to that of Freedman[10] who considered curve tracking as a problem of opti-mum geometric alignment of detected intensity edges. Akey difference from this paper is that Freedman assumesa model curve, which is supplied by hand initialization orby learning from a hand-picked set of example curves.

No such assumption is made in the current algorithmwhere curves, as segmented, are tracked across frames. Inthis regard, the tracking process is similar to that forpoints, e.g., Harris corners, but in addition we exploit theorder and continuity provided by segmented edgel chains.A key reason that edgel curves have not received muchattention is that it is difficult to define local correspon-dences between smooth curve segments. Here the corre-spondence problem is solved by exploiting both epipolarconstraints from the object motion as well as optimizingglobal deformation energy. The epipolar constraint pro-vides dense local matching constraints within global curvedeformation energy that incorporates shape constraints ofthe entire curve.

The problem of tracking extracted edge contours invideo appears straightforward but significant trackingcomplexity arises from the presence of smooth and dis-continuous curve transformations and structural changes.An extracted curve can change smoothly over a numberof frames but then undergo a singular visual event suchas occlusion or a momentary specularity or fall belowdetection threshold. Under these discontinuous viewingconditions a curve can split, or merge with anothercurve, or simply disappear altogether. In practice, weexpect 50–100 curve fragments from a typical frameand only about half depict smooth changes. A curve inthe remaining set undergoes a transition as fully classifiedin Fig. 1.

To understand the nature of deformations and topolog-ical evolution of image curves it is necessary to consider theunderlying mechanisms that give rise to them. Imagecurves arise from discontinuities in the projection of 3Dstructure such as occluding contours (depth discontinuity),surface reflectance discontinuities, ridges on the object, illu-mination discontinuities (e.g., highlights, shadows), and

other effects. Movement of these curve fragments in a videosequence can in turn arise from changes in the viewpoint(e.g., object or camera), illumination changes (e.g., move-ment of light source) or inter-reflections on the movingobject surface. Most of these effects lead to complexmotions, which can only in the simplest cases be modeled.One tractable case is when the curves are due to fixed reflec-tance discontinuities on the moving object. In this case var-ious approximations to perspective projection, e.g. planaraffine motion, can provide a reasonable prediction of curveshape over time. However, for inter-reflections and occlud-ing contours, there is no simple model to predict dynamiccurve geometry. As a final level of difficulty, outdoor scenespresent a large intensity dynamic range which leads to fluc-tuating edge recovery from frame to frame. It is impossibleto find edge detection parameters that will recover all therelevant edge segments as a moving object moves throughshadows or is subject to specular refection.

Our approach relies on the following key observation:if each extracted curve fragment is sufficiently distinctfrom other extracted curves in the same frame, and ifthe inter-frame deformation for each curve fragment issmall enough as compared to intra-frame curve differ-ences, then the similarity between curve pairs in the twoframes provides a basis for the recovery of curve corre-spondence by solving an assignment problem. However,the design of an effective similarity metric to capture bothsmooth and abrupt changes presents a challenge. We usean elastic matching metric based on the notion of analignment curve which is symmetric in the order of thetwo curves matched [25], and modify it in three ways toaddress this issue. First, since only about half the curvefragments change smoothly, we modify the elastic metricto implicitly include a notion of curve transitions anduse it to rank-order correspondence candidates. Second,transitions are explicitly handled so that broken curvefragments can be joined and the curve similarity matrixis suitably adjusted. Third, ambiguities arising from caseswhich violate our main assumption, i.e., when the inter-frame curve distance is less than the intra-frame distanceto their corresponding curves, are handled by bringing tobear a vanishing point constraint to spatially constrainthe correspondence. The resulting similarity matrix is thenconverted into an assignment via a greedy best-first solu-tion. We validated the resulting correspondence on a setof four labeled frame pair and noted improvements froman average correspondence rate of 48% for classical elasticmatching, to 56%, 68%, and 85%, respectively and cumu-latively, after augmenting it with the three steps describedabove.

The recovered curve correspondence is the basis of fig-ure-ground segregation. The main assumption is thatcurves belonging to the same object transform from oneframe to another more similarity to each other than tocurves from other objects or background. Specifically, fromthe inter-frame correspondence between each pair of curvesa similarity transformation is recovered and a notion of

Fig. 1. Typical changes in curve fragments extracted from two frames of a video sequence using the topologically driven edge operator [24] are illustratedfor two frames of the UHAUL sequence. Typically, about half of curve fragments change smoothly as illustrated in (a). However, the remaining half canbe expected to undergo abrupt changes as classified into six transitions: (b) a curve fragment can be split into two, or two can be joined into one. (c) Theformation or disappearance of a T-junction. (d) The complete disappearance or appearance of a curve. (e) Compound fragmentation when two curvefragments join and split differently, a combination of two transitions of type ‘‘b’’. (f) Compound T-junction, a combination of transitions ‘‘c’’ and ‘‘b’’. (g)Compound fragmentation of closed curves, a combination of two transitions of type b.

Fig. 2. The contour fragments of a moving vehicle are segregated from the background using only the two adjacent frames shown.

232 V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242

transform similarity is defined between two pairs of curvesin a common frame based on how they transform in thenext frame. This transform-induced similarity matrix isthen converted into clusters which define objects and back-

ground in an image. The results on video frames of movingvehicles are very encouraging, as previewed in Fig. 2, andare illustrated on a number of video sequence of movingvehicles later in this paper.

V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242 233

2. Curve tracking via transition-based elastic matching

In this section we describe a similarity-based method forfinding the correspondence between contour fragments intwo video frames. Specifically, we first describe how con-tour fragments are extracted from each frame, thendescribe how a correspondence is obtained from a pairwiseelastic similarity of these curve fragments, and finallydescribe three modifications to induce the notion of transi-tions and the vanishing point constraint.

2.1. Extracting Contours

The contour detector used in these experiments is basedon a modification of the Canny algorithm [2] as proposedin [24]. As is well known, the performance of the originalCanny step edge detector is poor near corners and junc-tions. The algorithm developed in [24] focuses on extendingthe edgel chains at corners and junctions so that bettertopological connections are achieved by relaxing the con-straints of the step edge model and searching for paths withthe greatest intensity variation. The edges are located tosub-pixel accuracy using weighted parabolic interpolationwith respect to the edge direction. Examples of these con-tour fragments are shown in Fig. 1.

2.2. From a similarity metric to curve correspondence

The similarity metric Snm which is computed betweencurve fragment Cn in the first image and curve fragmentCm in the second frame as described further below. Theresulting similarity matrix Snm is converted into an assign-ment in a greedy best-first fashion: The highest similarityranked pair in the matrix is made into a correspondence,and the remaining items in the corresponding row and col-umn are removed to retain a one-to-one mapping. Further-more, a second similarity metric is used as an additionalcheck to veto those curve pairs which are not sufficientlysimilar. Specifically, this second metric is based on theHausdorff distance after the curves have been aligned by

Fig. 3. From [25]. The alignment curve (left) represents a correspondence bpredicting correspondences mapping an entire interval to a point; this aspecttransitions. The optimal alignment curve a is efficiently found by dynamic pro

the optimal alignment-based similarity transformationbetween the curves. The process of selecting the most likelycorresponding curve pairs eliminates the correspondingrows and columns. This continues until either no rows orno columns remain. This greedy approach can be poten-tially further improved by achieving a globally optimalassignment, e.g., by using graduated assignment [11,26],but this is not the focus of this paper.

2.3. Transition-sensitive elastic matching

We begin with an elastic curve-matching algorithm[25,28] which minimizes an elastic energy functional overall possible alignments between two curves C and �C, byusing an alignment curve a mediating between the twocurves, Fig. 3,

aðnÞ ¼ ðhðnÞ; �hðnÞÞ; n 2 ½0; ~L�; að0Þ ¼ ð0; 0Þ; að~LÞ ¼ ðL; �LÞ;ð1Þ

where n is the arc-length along the alignment curve, h and �hrepresent arc-lengths on C and �C, respectively, L and �L rep-resent lengths on C and �C, respectively, and ~L is the lengthof the alignment curve a. The alignment curve can be spec-ified by a single function, namely, w(n), n 2 ½0; ~L�, where wdenotes the angle between the tangent to the alignmentcurve and the x-axis. The arc-lengths of C and �C can thenbe obtained by integration from w,

hðnÞ ¼Z n

0

cosðwðgÞÞdg; �hðnÞ ¼Z n

0

sinðwðgÞÞdg; n 2 ½0; ~L�:

ð2Þ

The optimal alignment a between the curves can be foundby minimizing an energy functional l,

l½w� ¼Z½j cosðwÞ � sinðwÞj þ R1jjðhÞ cosðwÞ � �jð�hÞ sinðwÞj�dn

ð3Þ

where j and �j are the curvatures of the curves. The first termdescribes differences in the arc-length as defined in Eq. (2)and thus penalizes ‘‘stretching’’. The second term is the dif-

etween two curves (right). The notion of an alignment curve allows forof the correspondence is crucial here as it does occur in the context of

gramming [25].

Fig. 6. Epipolar lines through the sample points of the first curve shouldpass closely to the corresponding sample points on the second point, andvice-versa. The distances between these corresponding points and the linesthrough he original sample points indicate deviation from the epipolarconstraint and is used as an addition clue towards finding the correct curvecorrespondence.

234 V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242

ference in the angular extent associated with each infinitesi-mal pair of corresponding curve pieces and thus penalizes‘‘bending’’ and R1 relates the two terms. The ‘‘edit distance’’between the curves C and �C is defined as the cost of the opti-mal alignment given by dðC; �CÞ ¼ minwlðwÞ which is foundby dynamic programming [25].

When this similarity metric is used to rank-order allcurves in a frame with respect to a curve in another frame,the top ranking curve typically (72% for our database)yields the right correspondence when only gradual changesare involved. The overall curve correspondence perfor-mance is defined as

curve correspondenceð%Þ

¼ No: of correctly assigned curve pairs

Total no: of corresponding curve pairs� 100 ð4Þ

which is measured after the greedy assignment described ear-lier on a set of four manually labeled pairs of video frames.The overall curve-correspondence performance is 48% witherrors arising mainly because transitions mislead the similar-ity metric, especially when a portion of one curve is matchedto an entire curve of which it is a fragment, Fig. 4(a and d),e.g., as occurs in the fragmentation transition, Fig. 1(b).

Observe, however, that in such cases there often remainssufficient shape similarity information in the remainingportion to correctly identify it as a sub-curve of the othercurve. This requires that the energy cost be appropriatelymodified to allow for the possibility of such transitions.The removal or addition of a contour segment during thematch is represented as a vertical or horizontal segmentin the beginning or in the end of the alignment curve, sinceeither h is constant and �h is varying, or vice-versa. To avoiddiscouraging such alignments, the elastic energy on thesesegments is diminished by a factor m (m = 0.3 for all ourexperiments). Fig. 4(b and e) illustrates that the alignmentis correctly identified from a sub-curve to an entire curve,and this is typically the case when the fragment has suffi-cient structure on it. The significance of the above modifi-cation is twofold. First, the elastic energy arising from the

Fig. 4. (a,d) The elastic curve matching alignments are incorrect in the presenccorrects the alignment (b,e) and furthermore recovers the fragmented ‘‘tail’’.transform (similarity) between the two curves, which in turn is used to find an

new corrected alignment results in a corrected similaritymeasure which more often points to the correct corre-sponding curve. Second, it allows for a more precise simi-larity transformation since in the corrected alignment the‘‘tails’’ mapping an entire segment to a point are discardedfrom the Hausdorff distance computation, which is moresensitive to the presence of ‘‘tails’’. This modification ofthe energy functional aimed at handling sub-curve match-ing increases the performance from 48% to 56%.

2.4. Explicit handling of transitions

The above modification works well for sub-curves whichhave sufficient shape content but not so well for smaller sub-curves. Thus, in stage two we incorporate transitions, e.g.,as a single curve in one frame breaks into two sub-curvesin the second frame, in the matching process. Specifically,assuming that the first stage has been successful in identify-ing the right correspondence between the original curve andone of the resulting sub-curves involved in the transition,the existence of a ‘‘tail’’ in the alignment is flagged (asshown in Fig. 4(b and e)) as an indicator that a transitionhas likely occurred. Recall that the similarity transforma-tion between the two curve fragments is obtained in the ver-ification step without involving the initial ‘‘tail’’. We now

e of a transition but modifying the energy function to allow for such casesThe alignment when excluding the tail is then used to define a geometricd recover the broken curve fragment (c,f).

Fig. 5. The stage two similarity metric fails to identify the corresponding pair when multiple similar structures exist (a) or when curves do not depictsignificant structure (b).

Fig. 7. The curve C(s) is shown in blue while C+(s) and C�(s) are show in red and green, respectively. The remaining contours encode other informationand should be ignored in this context. (a) Non-occluding curve and (b) occluding curve.

V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242 235

transform the ‘‘tail’’ accordingly in search of a mate in theother frame. If a third curve (the second sub-curve) existsthat is sufficiently similar, the two sub-curves in frametwo are merged and identified as a single curve. While thiscan be done iteratively for multiply fragmented curves,Fig. 1(c, e–g), our current implementation only joins twocurve fragments. With this improvement correspondenceperformance increases from 56% to about 68%.

Fig. 8. The attributes at point p are computed at Cþp and C�p neighbor-hoods as indicated.

2.5. Use of epipolar constraint to reduce ambiguity

The above transition-sensitive shape-based similarityfails in two cases. First, when numerous similar structuresare present, as in the seven rectangles in the front grill ofthe vehicle in Fig. 5(a), the alignment between any pairof curve fragments is excellent and of low energy, so thatthe intrinsic nature of this shape metric does not signifi-

236 V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242

cantly differentiate them to rank-order matches accordingto extrinsic placement. The second case involves contourswhich do not have significant ‘‘shape content’’, as in thestraight lines on the pavement in Fig. 5(b), so that thereare numerous curve fragments with nearly equivalent align-ments and energies. In such cases it is useful to introducean extrinsic measure, namely, the epipolar constraint. Weassume that within a limited neighborhood of frames, themotion of the object giving rise to the curve can be approx-imated as a translation, requiring the alignments betweenprojected curves to pass through an epipole e, Fig. 6. Thisepipole is either available as a vanishing point of the sceneor it can be estimated together with the alignment betweena pair of curves. In our experiments the camera was fixed,so the edges on the road are used to find the vanishingpoint manually.

The epipolar constraint is incorporated in the curveenergy using an epipolar term. Consider a point of thealignment curve relating point P1 on the first curve to thepoint P2 on the second curve, Fig. 6. Then distance ofthe point P2 from the epipolar line passing through P1 iscomputed along the tangent direction of P2. The tangentialdistance de is estimated from the perpendicular distance d

using de ¼ dcos /, where / is angle between the tangent at

point P2 and the perpendicular line to the epipolar line asshown in Fig. 6. Similarly, a second estimate of de is com-puted with the role of the points based P1 and P2 reversedand the maximum is used as the values of de.

The modified energy then takes the form

l½w� ¼Z½j cosðwÞ � sinðwÞj þ R1jjðhÞ cosðwÞ

� �jð�hÞ sinðwÞj þ R2jðdeÞp=ð1þ ðdeÞpÞj�dn;

where p = 10 in our experiment. The performance after thisstage increases from 68% to 85%.

2.6. Use of color and intensity to reduce ambiguity

Another powerful cue which can drastically reduceambiguity and improve efficiency is color/intensity. Recall

Fig. 9. (a) The set of curves in an image and, (b) the distance tran

that imposing a maximum speed constraint reduces thenumber of potential matching curve pairs in two framesby a factor of 10–20 (for example, 35 curve matches mayremain out of 700 curves in a second frame). Despite thisdrastic reduction in number of potential matches, the sizeof the remaining pool is large enough that the likelihoodof a pair of non-corresponding curves with similar shapesis not negligible. We now suggest that the use of continuityof color/intensity over time (frames) leads to improved effi-ciency and reduced ambiguity. The central assumption isthat the color/intensity of a narrow region surrounding acurve changes only slightly from one frame to the next,on one side of the curve if it is an occluding contour, andon both sides of the curve otherwise. Fig. 7(a and b) illus-trates this point.

In the continuous domain, each point on the curve isattributed by color values in some color space, one for theleft and one for the right side of the curve. We chose theHSV color space because of its ability to separate intensityfrom color in an intuitive fashion. Thus, each point on thecurve C(s) is attributed with (H+(s),S+(s),V+(s)) and(H�(s), S�(s),V�(s)) where H is hue(color), S is saturationand V is value(intensity) as defined by the conversion in[8], and where ± denote the intensity to the immediate leftand the immediate right of the curve, respectively. In prac-tice, since a large number of edges are not step edges and thetransition across an edge may be gradual, we opt to define

H�ðsÞ ¼ HðC�ðsÞÞS�ðsÞ ¼ SðC�ðsÞÞV �ðsÞ ¼ V ðC�ðsÞÞ;

8><>: ð5Þ

where C±(s) = C(s) ± dN(s), N(s) is the normal to thecurve, and d is a fixed constant. See Fig. 8. Observe careshould be taken that in case of close by curves that d doesnot exceed the space between curves. Thus, the signed dis-tance transform is used to detect when d exceeds this limitin which no value is assigned to such points. In addition, toreduce the effect of noise we use the HSV values of asmoothed image by applying a Gaussian kernel to eachof the components of HSV space individually Fig. 9.

sform indicating the largest possible distance from each curve.

Fig. 10. (a) The HSV color space representation, (b) binning for buildinga 3D histogram.

V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242 237

These new attributes of a curve can now be used to dis-card unlikely matches, e.g., a curve separating say red andgreen region in one frame cannot be a match for a curveseparating blue and brown regions (we can also use theseattributes in the alignment process itself which is workunder progress). We propose to summarize the HSV attri-butes in a coarse fashion using the 3D histogram. S(r,z,h)where the height z represents ‘‘value’’, the radius r indicates

Fig. 11. Matched curves in a pair of video frames (top and bottom, on thefragments are shown in the same color.

Fig. 12. Green curves in frame 2 are modeled similarity transformations of th

‘‘saturation’’ and the angle h represents ‘‘hue’’ as shown inFig. 10(a). The bins are defined with spacing along radiusand value to give equal volume bins.

The three-dimensional histogram g(h, r,z) can be used tocoarsely compare two curves. We use the Bhattacharya dis-tance to compute the dissimilarity measure between twohistograms,

dgðg1; g2Þ ¼ � lnXr;h;z

g1ðr; h; zÞg2ðr; h; zÞ !

: ð6Þ

Since an occluding contour may have only one matchingside and since the orientation of each curve is arbitrary,we compare four possibilities and assign the minimum dis-similarity as the distance between the two curves

dBðC1;C2Þ ¼ minðdgðgþ1 ; gþ2 Þ; dgðg�1 ; gþ2 Þ; dgðgþ1 ; g�2 Þ; dgðg�1 ; g�2 ÞÞð7Þ

where gþ1 , gþ2 are the histograms for right side of the curvesC1 and C2, respectively, and g�1 , g�2 are the histograms forleft side of the curves C1 and C2, respectively. The Bhat-tacharya distance between coarse HSV histogram of re-gions flanking a pair of matching candidate curves canthen be used to discard unlikely matches by thresholdingthe distance, i.e., two curves for which

left) and corresponding zoomed areas on the right. Corresponding curve

e red curves in frame 1 while blue curves are the actual curves in frame 2.

Table 1Overall curve performance for four pair of video frames

Image-pair % correct

SUV67–68 73Police-car 16–17 80Police-car 21–22 85Minivan65–66 86

238 V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242

dBðC1;C2Þ < sB ð8Þ

for some threshold sB will be considered further for mea-suring fine scale shape similarity. We select a very conser-vative threshold which ensures zero error in our groundtruth database (sB = 0.45). The improvement in efficiencyeven with such a very conservative estimate is drastic:about 43% of the matching candidates are discarded. Thereduction in ambiguity when using only the maximumspeed constraint is roughly 10% improvement in the correctcorrespondence rate. These measurements were not re-peated for the epipolar constraint as the latter constraintalready removes much ambiguity. However, in cases where

Fig. 13. Results of figure-ground segregation based on two adjacent frames forright). The top row shows the original image, the second row shows the conto

this constraint is not expected to hold, the more generic in-ter-frame continuity plays a significant role.

2.7. Results

Fig. 11 shows several examples of the final curve corre-spondence for adjacent frames taken from several videosequences. In order to formally evaluate the curve corre-spondence algorithm, a database of ground truth consist-ing of four image pairs was manually created. Along withthis, a curve is also labeled as foreground or backgroundfor verification of results in Section 3. As tabulated inTable 1, 80–90% of correspondences are correct in thesefour frames which, as we shall see in the next section, is suf-ficiently high to enable reliable figure-ground segregation.We also expect significant improvements when severalother constraints are utilized in the similarity measure,including a measure of intensity and color match for eachalignment, use of 3D geometric reconstruction, imposingspatial order among the curve fragments to disambiguate

a Van (first frame shown on the left) and an SUV (first frame shown on theurs extracted and last row shows the segmented object.

V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242 239

correspondences, and in particular when compound transi-tions are also explicitly handled.

3. Transformation-induced figure-ground segregation

In this section we describe a figure-ground segregationmethod based on the Gestalt cue of common fate. Specifi-cally, since the curve correspondence has established howeach curve transforms form one frame to another, curveswith distinctly similar transforms should be grouped. Thesetransforms are characterized in the domain of an expectedgeometric transform, in our case the similarity transform,although affine or projective transformations can also beused.

While it is tempting to measure the similarity betweentwo transforms by measuring the distance between theparameter vectors describing each transform, it is muchmore meaningful to measure similarity not in the parame-ter space, but in the observation space. Specifically, con-sider a transform T1(T1x,T1y,h1,k1) where (T1x,T1y) aretranslation coordinates and h1 is the angle of rotation,

Fig. 14. Two-frame segregation of moving vehicles in two subsequent video sfigure which can than be used for recognition.

and k1 is scaling is defined by an inter-frame curve pairðC1; �C1Þ and similarly T2(T2x,T2y,h2,k2) is defined for theðC2; �C2Þ pair. Rather than rely on differences between theparameter describing T1 and T2, we define the similarityof T1 and T2 by the extent T1C2 is similar as a curve to�C2, and analogously, T2C1 is similar to �C1, Fig. 12,

dT ðC1;C2Þ ¼ maxfdHðT 1C2; �C2Þ; dHðT 2C1; �C1Þg; ð9Þ

where dH is the Hausdorff metric between two curves.This pairwise measure defines the degree by which two

curves in one frame have ‘‘common fate’’ with respect tothe second frame and is represented by an m · m matrixwhere m is the number of curves in the first frame. Ideally,a moving object on a stationary background would lead totwo distinct clusters in this matrix. However, since back-ground curves can also shift in a wide range of movementsresembling some of those on the object, e.g., tree branchesmoving in the wind, this distinction is smeared.

We adopt a simple clustering technique to determinecluster boundaries, namely, the seeded region growingmethod used for segmentation of intensity images [1]. Each

equences. Observe how our segregation produces a rich description of the

240 V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242

curve is initialized as a cluster. The distance between twoclusters is defined as the median of pairwise distancesbetween their members. An iterative procedure then mergesthe two closest clusters into one until either the closest dis-tance between clusters exceeds some threshold or the num-ber of clusters falls below a minimum number of expectedclusters. An additional spatial constraint is used to rule outclustering of curves which are far in Euclidean space. Theclusters only for which inter-cluster Euclidean distance isless than threshold ss are considered for clustering.

Fig. 13 depicts the clusters associated with the fore-ground for two distinct frame pairs. Fig. 14 show resultsfor four subsequent frames, in three different videos. Notethat the figure-ground segregation only based on adjacent

Table 2Performance of segregating curves in four pair of video frames

Image-pair Object curves Falsesegregation

Correctlysegregated curves

SUV 67–68 84 5 51Police-car 16–17 38 3 18Police-car 21–22 31 5 16Minivan 65–66 65 3 29

Fig. 15. (a) Original video sequence, (b) segmentation of curves using oursegmentation using optical flow and (e) edge based tracking using geodesic ac

pair of frames only. As tabulated in Table 2 the segregationincludes few non-object contours (5–10%) for our fourground-truth frame pairs, while capturing a significant col-lection of the curves on the object.

The computational complexity of the approach dependson a number of factors including the number of curves andthe number of sample points on each curve. There are gen-erally 600–700 curve fragments per frame. The number ofsample points on each curve varies from 40 to 200. Thecomplexity of matching a pair of curve segments is O(n2)where n is the number of sample points on curve segmentsbut multi-scale approaches can be used to speed this up.The complexity for matching curves in two frame isO(M2n2) where M is the number of curves in a frame.The overall analysis takes approximately 2–2.5 min to pro-cess a frame on a Pentium 4, 2 GHz machine.

3.1. Comparisons

We have compared the segmentation of the figure resultsusing our approach with three different approaches. Firstwe compare it with the KLT tracker [27], in which robustfeature points are tracked. Our segregation is richer thanthe above technique as evident in Fig. 15(b and c). Next,

approach, (c) segmentation using KLT tracker [27], (d) region-basedtive contours, which requires a periodic manual initialization.

V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242 241

we compare it to an optical-flow based approach. The opti-cal flow is computed for the image and the pixels withvelocity above a certain threshold are considered as figure.Note that the optical flow segmentation is not robust tonoise in the background and also it suffers from the wellknown aperture problem in uniform regions. As a resultthe segmentation has holes as shown in Fig. 15(d). Last,we compare our results with active-contour based trackingmethods. A contour in the first frame has been manuallyinitialized which is then snapped onto the object usingthe geodesic active contour approach [3,18]. A constantvelocity model is then used to propagate the contour tothe following frames and used the active contour approachis again applied.

Fig. 16. (a) Original video sequence with an occlusion, (b) edge maps of the

Table 3System parameters are listed along with their sensitivities and effects

Parameter Description Effe

s Edge detector threshold for contrast Incrnum

e Initial estimate of the epipole obtained by extrapolatingsides of the roads in the scene

Rou

R1 & R2 Constants used in the elastic matching cost functionwhich were estimated empirically

R1 wR2 =

m Energy factor reduction at the end of the curve to enablethe sub-curve matching and handle transitions

Low

N Initial number of clusters for clustering curves withsimilar transformation

Largfragwou

dmin Minimum inter-cluster distance in agglomerativeclustering

Larg

ss Threshold for clustering curves which are spatiallycloser to each other based on their Euclidean distance

Thethe

d Used in computing {H,S,V} values at each point on thecurve at d distance from the curve

Shoof oregio

sB Threshold for comparing curves based on their colorvalues

Lowincre

We also tested our approach for performance underocclusion by blocking a portion of the video sequence.As illustrated in Fig. 16, curves in the non-occluded partof the object are not affected by the occlusion and theresulting figures is a rich description of object.

We emphasize that these results while already veryencouraging are only using pairwise comparison of framesand can be potentially significantly improved further.Observe in Fig. 14 how each frame pair gives a segmenta-tion that has many common curve fragments with itsnearby frame pairs, but also feature novel curves not seenbefore. We have not yet utilized this multi-frame regularity

which should lead to a dense and complete segmentationafter a few frames. Also, the emphasis has not been on

above video sequence, and (c) segregation of curves using our approach.

ct

easing the value will increase the selectivity of the detector and reduce theber of edges detectedgh estimate of epipole is needed

eighs the bending term and R2 weighs the epipolar term. R1 = 10 and3 is used for all the experiments

er values favor sub-curve matching

er number than expected number of objects would increase thementation (overfitting) of the figures and fewer number (underfitting)ld result in oppositee values would lead to conservative clustering

value should be adjusted to the expected size of the object. As increasingvalue would allow more false alarms in the segregationuld be around 2–3 pixel. If its really large we would cross over into regionsther curves and if its really small the intensities would be from the edgener values would be more conservative and higher values would lead toased number of misses

242 V. Jain et al. / Computer Vision and Image Understanding 108 (2007) 230–242

using a sophisticated clustering method, although the useof one would certainly improve the results. We expect thatthe addition of regional motion information will also sig-nificantly improve the results. As the comparison inFig. 14 shows our curve-based approach is a promisingdirection for figure-ground segregation and tracking in awide range of applications.

3.2. System parameters

A list of parameters for the approach and their effectsare given in Table 3.

References

[1] R. Adams, L. Bischof, Seeded region growing, PAMI 16 (6) (1994)641–647.

[2] J. Canny, A computational approach to edge detection, PAMI 8(1986) 679–698.

[3] V. Caselles, R. Kimmel, G. Sapiro, Geodesic active contours, in:ICCV, 1995, pp. 156–162.

[4] D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of nonrigidobjects using mean shift, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, Hilton Head Island,South Carolina, 2000 (2) 142–149.

[5] R. Deriche, G. Giraudon, A computational approach for corner andvertex detection, IJCV (1993) 167–187.

[6] C.E. Erdem, A. Tekalp, B. Sankur, Video object tracking withfeedback of performance measures, in: Proceedings of the IEEEconference on Computer Vision and Pattern Recognition, December,2001, 593–600.

[7] V. Ferrari, T. Tuytelaars, L. van Gool, Real-time affine regiontracking and coplanar grouping, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, Kauai,Hawaii, 2001, 226–233.

[8] J. Foley, A.v. Dam, S.K. Feiner, J.F. Hughes, Computer GraphicsPrinciples and Practice, second ed., Addison-Wesley, Reading, MA,1996.

[9] F. Folta, L.V. Eycken, L. van Gool, Shape extraction using temporalcontinuity, in: Proceedings of European Workshop on ImageAnalysis for Multimedia Interactive Services of the IEEE conferenceon Computer Vision and Pattern Recognition, 1997, 69–74.

[10] D. Freedman, Effective tracking through tree search, in: IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 25,May 2003, pp. 604–615.

[11] S. Gold, A. Rangarajan, E. Mjolsness, Learning with preknowledge:clustering with point and graph matching distance measures, NeuralComputation 8 (4) (1996) 787–804.

[12] G. Hager, P. Belhumeur, Efficient region tracking with parametricmodels of geometry and illumination, PAMI 20 (10) (1998) 1025–1039.

[13] C. Harris, Determination of ego-motion from matched points,International Journal of Computer Vision (1993) 189–192.

[14] R. Hartley, A. Zisserman, Multiple View Geometry in ComputerVision, Cambridge University Press, Cambridge, 2000.

[15] D.P. Huttenlocher, G.A. Klanderman, W.J. Rucklidge, Comparingimages using the Hausdorff distance, PAMI 15 (1993) 850–863.

[16] M. Isard, A. Blake, Condensation—conditional density propagationfor visual tracking, IJCV 29 (1998) 2–28.

[17] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models,International Journal of Computer Vision 1 (4) (1987) 321–331.

[18] S. Kichenassamy, A. Kumar, P.J. Olver, A. Tannenbaum, A.J.Yezzi, A geometric snake model for segmentation of medicalimagery, IEEE Transactions on Medical Imaging 16 (2) (1997)199–209.

[19] D. Koller, J. Weber, J. Malik, Robust multiple car tracking withocclusion reasoningProceedings of the Third European Conferenceon Computer Vision, vol. 1, Springer Verlag, Berlin, 1994.

[20] T. Lindenberg, Feature detection with automatic scale detection,IJCV 30 (2) (1998) 77–116.

[21] H.P. Moravec, Visual mapping by a robot rover, in: Proceedings ofthe 6th International Joint Conference on Artificial Intelligence, 1979,pp. 598–600.

[22] P. Muse, F. Sur, F. Cao, Y. Gousseau, J. Morel, An a contrariodecision method for shape element recognition, International Journalof Computer Vision 69 (3) (2006) 295–315.

[23] N. Paragios, R. Deriche, A PDE-based level set approach fordetection and tracking of moving objects, in: Proceedings of theInternational Conference Computer Vision, Bombay, India, January1998.

[24] C. Rothwell, J. Mundy, W. Hoffman, V.-D. Nguyen, Driving visionby topology, in: IEEE International Symposium on Computer Vision,1995, 395–400.

[25] T. Sebastian, P. Klein, B. Kimia, On aligning curves, PAMI 25 (1)(2003) 116–125.

[26] D. Sharvit, J. Chan, H. Tek, B.B. Kimia, Symmetry-based indexing ofimage databases, Journal of Visual Communication and ImageRepresentation 9 (4) (1998) 366–380.

[27] J. Shi, C. Tomasi. Good features to track, in: Proceedings of the IEEEconference on Computer Vision and Pattern Recognition, 1994, pp.593–600.

[28] L. Younes, Computable elastic distance between shapes, SIAMJournal of Applied Mathematics 58 (1998) 565–586.