On Pairwise Costs for Network Flow Multi-Object Tracking...On Pairwise Costs for Network Flow...

1
On Pairwise Costs for Network Flow Multi-Object Tracking Visesh Chari 1 , Simon Lacoste-Julien 2 , Ivan Laptev 1 , Josef Sivic 1 1 WILLOW project-team, Départment d’Informatique de l’Ecole Normale Supérieure, ENS/INRIA/CNRS UMR 8548, Paris, France 2 SIERRA project-team, Départment d’Informatique de l’Ecole Normale Supérieure, ENS/INRIA/CNRS UMR 8548, Paris, France Multi-object tracking has been recently approached with the min-cost net- work flow optimization techniques. Such methods simultaneously resolve multiple object tracks in a video and enable modeling of dependencies among tracks. Min-cost network flow methods also fit well within the “tracking-by- detection” paradigm where object trajectories are obtained by connecting per-frame outputs of an object detector. Object detectors, however, often fail due to occlusions and clutter in the video. To cope with such situations, we propose to add pairwise costs to the min-cost network flow framework. While integer solutions to such a problem become NP-hard, we design a convex relaxation solution with an efficient rounding heuristic which empir- ically gives certificates of small suboptimality. We evaluate two particular types of pairwise costs and demonstrate improvements over recent tracking methods in real-world video sequences. Given a video with objects in motion, the goal is to simultaneously track K moving objects in a “detect-and-track” framework [2]. The input to the approach is two-fold. First a set of candidate object locations is assumed to be given, provided, for example, as output of an object detector. Hence- forth we refer to these locations as detections. The approach also requires a measure of correspondence between detections across video frames. This could be obtained for example from optical flow, or using some other form of correspondence. Based on these inputs, the tracking problem is setup as a joint optimization problem of simultaneously selecting detections of objects and connections between them across video frames. Such a problem can be modeled through a MAP objective [2] with specific constraints encoding the structure of the tracks. The MAP optimization problem can be cast as the following integer linear program (ILP): min x i c i x i + ijE c ij x ij (1) s.t. 0 x i 1 , 0 x ij 1 i : ijE x ij = x j = i : jiE x ji i x it = K = i x si x FLOW K x i , x ij are integer. The above formulation encodes the joint selection of K tracks using the fol- lowing selection variables: x i ∈{0, 1} is a binary indicator variable taking the value 1 when the detection i is selected in some track; x ij ∈{0, 1} is a binary indicator variable taking the value 1 when detection i and detection j are connected through the same track in nearby time frames. The index i ranges over possible detections across the whole video. c i denotes the cost of selecting detection i in a specific frame (and represents the negative de- tection confidence) while c ij represents the negative of the correspondence strength between detections i and j. The set of possible connections between detections is represented by E and could be a subset of all pairs of detections in nearby frames by using choice heuristics (such as spatial proximity). The quality of track selection is quantified by the objective in (1). The constraint i : ijE x ij = x j = i : jiE x ji , which has the structure of a flow conservation constraint [1], encodes the correct claimed semantic that x ij can take the value 1 if and only if both x i and x j take the value 1, and moreover, that each detection belongs to at most one track, enforcing the fact that two objects cannot occupy the same space. Finally, the con- straint i x it = K = i x si ensures that exactly K tracks are selected (dummy “source” and “sink” variables with the fixed value x s = x t = K are added; the connection variables x si and x it represent the start and end of tracks re- spectively). This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. Basic Network Flow Network Flow with Pairwise Terms Figure 1: Results of network flow tracking using cost functions with/without pairwise terms. (Top Row): a pairwise term that penalizes the overlap between different tracks helps resolving ambiguous (red tracks) tracks in crowded scenes. (Bottom Row): a pairwise term that encourages the consistency between two signals (here head detections and body detec- tions) helps eliminating failures (red tracks) of object detectors. To summarize, the above optimization problem can be solved efficiently using existing network flow or linear algebra packages [1] when the integer constraint is relaxed, and provides a convenient framework to transform the tracking problem into a track selection problem. We use this conversion as a starting point to add additional constraints and costs on the selection process to influence it in desirable ways to address challenging scenarios. We make the following contributions in this paper: We propose a new non-greedy approach to optimize pairwise terms within a min-cost network flow framework. Our solution is generic and allows the simultaneous optimization of any type of pairwise costs. We propose a global optimization strategy with a convex relaxation that allows us to minimize pairwise costs using linear optimization, and a principled Frank-Wolfe style rounding procedure to obtain in- teger solutions with a certificate of suboptimality. The optimization procedure is empirically stable, allowing the practitioner to focus on modeling. To illustrate our method, we propose two particular examples of pair- wise costs: the first discourages significant overlaps between dis- tinct tracks; the second models the spatial co-occurrence of different types of detections. This allows us to better model complex dynamic scenes with substantial clutter and partial occlusions. Using our method, we show improved tracking results on several real-world videos. In addition, we propose a new strategy to evaluate tracking results that better measures the longevity of overlap between output tracks and ground truth. [1] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network flows: theory, algorithms, and applications. Prentice-Hall, Inc., 1993. [2] Li Zhang, Yuan Li, and R. Nevatia. Global data association for multi- object tracking using network flows. In CVPR, 2008.

Transcript of On Pairwise Costs for Network Flow Multi-Object Tracking...On Pairwise Costs for Network Flow...

Page 1: On Pairwise Costs for Network Flow Multi-Object Tracking...On Pairwise Costs for Network Flow Multi-Object Tracking Visesh Chari1, Simon Lacoste-Julien2, Ivan Laptev1, Josef Sivic1

On Pairwise Costs for Network Flow Multi-Object Tracking

Visesh Chari1, Simon Lacoste-Julien2, Ivan Laptev1, Josef Sivic1

1WILLOW project-team, Départment d’Informatique de l’Ecole Normale Supérieure, ENS/INRIA/CNRS UMR 8548, Paris, France2SIERRA project-team, Départment d’Informatique de l’Ecole Normale Supérieure, ENS/INRIA/CNRS UMR 8548, Paris, France

Multi-object tracking has been recently approached with the min-cost net-work flow optimization techniques. Such methods simultaneously resolvemultiple object tracks in a video and enable modeling of dependencies amongtracks. Min-cost network flow methods also fit well within the “tracking-by-detection” paradigm where object trajectories are obtained by connectingper-frame outputs of an object detector. Object detectors, however, oftenfail due to occlusions and clutter in the video. To cope with such situations,we propose to add pairwise costs to the min-cost network flow framework.While integer solutions to such a problem become NP-hard, we design aconvex relaxation solution with an efficient rounding heuristic which empir-ically gives certificates of small suboptimality. We evaluate two particulartypes of pairwise costs and demonstrate improvements over recent trackingmethods in real-world video sequences.

Given a video with objects in motion, the goal is to simultaneously trackK moving objects in a “detect-and-track” framework [2]. The input to theapproach is two-fold. First a set of candidate object locations is assumedto be given, provided, for example, as output of an object detector. Hence-forth we refer to these locations as detections. The approach also requiresa measure of correspondence between detections across video frames. Thiscould be obtained for example from optical flow, or using some other formof correspondence. Based on these inputs, the tracking problem is setup as ajoint optimization problem of simultaneously selecting detections of objectsand connections between them across video frames. Such a problem can bemodeled through a MAP objective [2] with specific constraints encoding thestructure of the tracks. The MAP optimization problem can be cast as thefollowing integer linear program (ILP):

minx

∑i cixi +∑i j∈E ci jxi j (1)

s.t.

0≤ xi ≤ 1 , 0≤ xi j ≤ 1

∑i : i j∈E

xi j = x j = ∑i : ji∈E

x ji

∑i

xit = K = ∑i

xsi

x ∈ FLOWK

xi , xi j are integer.

The above formulation encodes the joint selection of K tracks using the fol-lowing selection variables: xi ∈ {0,1} is a binary indicator variable takingthe value 1 when the detection i is selected in some track; xi j ∈ {0,1} is abinary indicator variable taking the value 1 when detection i and detectionj are connected through the same track in nearby time frames. The index iranges over possible detections across the whole video. ci denotes the costof selecting detection i in a specific frame (and represents the negative de-tection confidence) while ci j represents the negative of the correspondencestrength between detections i and j. The set of possible connections betweendetections is represented by E and could be a subset of all pairs of detectionsin nearby frames by using choice heuristics (such as spatial proximity). Thequality of track selection is quantified by the objective in (1).

The constraint ∑i : i j∈E xi j = x j = ∑i : ji∈E x ji, which has the structureof a flow conservation constraint [1], encodes the correct claimed semanticthat xi j can take the value 1 if and only if both xi and x j take the value 1,and moreover, that each detection belongs to at most one track, enforcingthe fact that two objects cannot occupy the same space. Finally, the con-straint ∑i xit = K = ∑i xsi ensures that exactly K tracks are selected (dummy“source” and “sink” variables with the fixed value xs = xt = K are added;the connection variables xsi and xit represent the start and end of tracks re-spectively).

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

Basic Network Flow Network Flow with Pairwise Terms

Figure 1: Results of network flow tracking using cost functionswith/without pairwise terms. (Top Row): a pairwise term that penalizesthe overlap between different tracks helps resolving ambiguous (red tracks)tracks in crowded scenes. (Bottom Row): a pairwise term that encouragesthe consistency between two signals (here head detections and body detec-tions) helps eliminating failures (red tracks) of object detectors.

To summarize, the above optimization problem can be solved efficientlyusing existing network flow or linear algebra packages [1] when the integerconstraint is relaxed, and provides a convenient framework to transform thetracking problem into a track selection problem. We use this conversion as astarting point to add additional constraints and costs on the selection processto influence it in desirable ways to address challenging scenarios.

We make the following contributions in this paper:

• We propose a new non-greedy approach to optimize pairwise termswithin a min-cost network flow framework. Our solution is genericand allows the simultaneous optimization of any type of pairwisecosts.

• We propose a global optimization strategy with a convex relaxationthat allows us to minimize pairwise costs using linear optimization,and a principled Frank-Wolfe style rounding procedure to obtain in-teger solutions with a certificate of suboptimality. The optimizationprocedure is empirically stable, allowing the practitioner to focus onmodeling.

• To illustrate our method, we propose two particular examples of pair-wise costs: the first discourages significant overlaps between dis-tinct tracks; the second models the spatial co-occurrence of differenttypes of detections. This allows us to better model complex dynamicscenes with substantial clutter and partial occlusions.

• Using our method, we show improved tracking results on severalreal-world videos. In addition, we propose a new strategy to evaluatetracking results that better measures the longevity of overlap betweenoutput tracks and ground truth.

[1] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Networkflows: theory, algorithms, and applications. Prentice-Hall, Inc., 1993.

[2] Li Zhang, Yuan Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In CVPR, 2008.