Robust Graph-Cut Scene Segmentation and Reconstruction for ...

Robust Graph-Cut Scene Segmentation and Reconstructionfor Free-Viewpoint Video of Complex Dynamic Scenes

Jean-Yves Guillemaut, Joe Kilner and Adrian HiltonCentre for Vision, Speech and Signal Processing, University of Surrey, Guildford, GU2 7XH, UK

{j.guillemaut, j.kilner, a.hilton}@surrey.ac.uk

Abstract

Current state-of-the-art image-based scene reconstruc-tion techniques are capable of generating high-fidelity 3Dmodels when used under controlled capture conditions.However, they are often inadequate when used in more chal-lenging outdoor environments with moving cameras. In thiscase, algorithms must be able to cope with relatively largecalibration and segmentation errors as well as input im-ages separated by a wide-baseline and possibly capturedat different resolutions. In this paper, we propose a tech-nique which, under these challenging conditions, is ableto efficiently compute a high-quality scene representationvia graph-cut optimisation of an energy function combin-ing multiple image cues with strong priors. Robustness isachieved by jointly optimising scene segmentation and mul-tiple view reconstruction in a view-dependent manner withrespect to each input camera. Joint optimisation preventspropagation of errors from segmentation to reconstructionas is often the case with sequential approaches. View-dependent processing increases tolerance to errors in on-the-fly calibration compared to global approaches. We eval-uate our technique in the case of challenging outdoor sportsscenes captured with manually operated broadcast cam-eras and demonstrate its suitability for high-quality free-viewpoint video.

1. Introduction

In recent years, tremendous progress has been made inthe field of image-based scene reconstruction, to such an ex-tent that, under controlled conditions and provided a suffi-cient number of input images are captured, the performanceof such techniques is almost on a par with that of activetechniques such as laser range scanners. A good testimonyof the capabilities of the current state-of-the-art is providedby the multiple-view Middlebury dataset [22] which showstop performing algorithms capable of sub-millimetre accu-racy. The increase in quality can be attributed to improve-

Figure 1. Two moving broadcast camera views (top) and a locked-off camera view (bottom-left) from a rugby match.

ments in algorithm robustness and also the emergence ofmore powerful optimisation techniques such as graph-cuts[25], which have enabled the optimisation of otherwise in-tractable cost functions.

Although these algorithms are capable of high-fidelitymodelling in controlled studio settings, their performanceis often sub-optimal when applied in less controlled con-ditions of operation, such as those of outdoor scene cap-ture of stadium sports (see Fig. 1). In this case, reconstruc-tion algorithms must be able to cope with a number of fac-tors: calibration errors resulting from on-the-fly calibrationof moving and zooming cameras with motion blur; wide-baselines and resolution differences between camera views;non-uniform dynamic backgrounds and multiple people ofa relatively small size. Scene segmentation is more diffi-cult because of increased background complexity and thelikelihood of overlapping background and foreground dis-tributions, which make standard chroma-keying techniquesunusable. In addition, camera placement and framing aremanually controlled and not optimised for 3D modelling;cameras are sparsely located with manual operation of pan,tilt and zoom to frame the action.

In this paper, we present a robust approach to recon-struction of scenes captured under these challenging con-ditions. The approach combines multiple view visual cuessuch as colour, contrast and photo-consistency information,

1

with strong shape priors to achieve robust reconstruction forlarge-scale scenes with sparse wide-baseline moving cam-eras at different resolutions subject to errors in on-the-flycalibration. A key idea of the technique is to jointly per-form segmentation and reconstruction rather than the usualsequential approach where segmentation is followed by re-construction. We argue that a joint formulation helps dis-ambiguate segmentation of individual views by introduc-tion of information from multiple views and prevents prop-agation of errors from segmentation leading to failure inreconstruction. View-dependent optimisation with respectto each input camera also increases robustness to errors incamera calibration which prohibit a globally consistent sur-face reconstruction. A fully automatic system is presentedfor multiple view calibration, segmentation and reconstruc-tion from moving broadcast cameras to allow high-qualityfree-viewpoint rendering of sports such as rugby and soccer.Evaluation of simultaneous multiple view segmentation andreconstruction in challenging outdoor scenes demonstratesconsiderable improvements over single view segmentationand previous multiple view reconstruction approaches. Thecontribution of this work is a robust framework for recon-struction from multiple moving cameras of complex dy-namic scenes which combines visual cues from multipleviews to simultaneously segment and reconstruct the sceneovercoming limitations of previous approaches using staticcameras at a similar resolution which require high-qualitycalibration, segmentation and matching.

2. Related work

2.1. Robust view-dependent scene reconstruction

Multiple view reconstruction from images has receivedconsiderable research interest. A recent survey [22] clas-sifies these algorithms into four categories: (i) volumetricsurface extraction, (ii) surface evolution, (iii) depth map-based and (iv) feature point-based surface extraction. Ouralgorithm belongs to the third category. In their seminalpaper [18], Narayanan et al. introduced the concept of Vir-tualized Reality. They use a set of 51 cameras distributed ona hemispherical dome to compute 3D models of a dynamicscenes by first computing a depth map for each camera andthen merging them into a consistent 3D model. Since then, anumber of techniques based on similar two stage pipelineshave been proposed. In [7], Goesele et al. use a window-based voting approach in order to increase depth map ac-curacy. The technique produces incomplete reconstructionsas ambiguous or occluded areas cannot be reconstructed,but achieves high accuracy in the reconstructed areas. In[5], Campbell et al. introduce multiple depth hypotheses ateach pixel and an unknown depth label to cope with ambi-guities caused by occlusions or lack of texture.

2.2. Joint segmentation and reconstruction

In [23], Snow et al. propose a generalisation ofshape-from-silhouette reconstruction by integrating the seg-mentation into the voxel occupancy estimation process.Goldlucke and Magnor [8] proposed a joint graph-cut re-construction and segmentation algorithm which generalisesthe work of Kolmogorov and Zabih [13] by adding a back-ground layer to the formulation. Both approaches produceglobal scene reconstructions and are therefore prone to cal-ibration errors. In [12], Kolmogorov et al. proposed a jointsegmentation and reconstruction technique which, althoughstructurally similar to ours, targets a different application(background substitution) and is limited to two-layer seg-mentation from narrow baseline stereo cameras while ourapproach handles an arbitrary number of layers required formulti-player occlusions in sports. In [9], Guillemaut et al.proposed a view-dependent graph-cut approach which canbe viewed as a generalisation of Roy and Cox’s method [21]with the introduction of a background layer. Finally, in thecontext of view interpolation, Zitnick et al. performed view-dependent reconstruction using colour-segmentation basedstereo [27]. Although similar in principle with the proposedtechnique, these methods assume static cameras separatedby a relatively small baseline compared to those consideredin this paper. Consequently, they lack robustness with re-spect to errors in input calibration.

2.3. Sports related research

There is a growing interest in virtual replay production insports using free-viewpoint video techniques. Transferringstudio techniques to the sports arena is a notoriously diffi-cult problem due to the large area, limited access and costof placing additional cameras. Ideally, solutions will workfrom the existing broadcast cameras which are manually op-erated. An initial attempt used in the Eye Vision systemat the Super Bowl used camera switching between a largenumber of slaved cameras to perform transitions. More re-cent approaches used view-interpolation techniques [20] orplanar billboards [19]. These techniques are usually fast,however lack of explicit geometry can result in unrealisticartefacts. Recently some more complex offline optimisa-tion techniques based on graph-cuts [9] or deformable mod-els [11] have been reported. These techniques were able toachieve visual quality comparable to that of the input im-ages, however results were reported with a dedicated set of15 closely spaced static cameras located around a quarterof the pitch. Current commercial products include the Pierosystem which is limited to a single camera input and theLiberovision system which is capable of photo-realistic in-terpolation between match cameras but remains limited to asingle frame due to the requirement for manual interventionin calibration and segmentation. In contrast, the proposed

technique is able to a render full sequence from a set ofsparsely located input cameras which are not required to bestatic.

3. View-dependent segmentation and recon-struction

3.1. Problem statement and notation

Given a reference camera and a set of n auxiliary cam-eras (referred to by the indices 1 to n) all synchronised andcalibrated, we would like to (i) partition the reference imageinto its constituent background/foreground layers and (ii)estimate the depth at each pixel. This can be formulated asa labelling problem where we seek the mappings l : P → Land d : P → D, which respectively assign a layer label lpand a depth label dp to every pixel p in the reference im-age. P denotes the set of pixels in the reference image; Land D are discrete sets of labels representing the differentlayer and depth hypotheses. L = {l1, . . . , l|L|} may consistof one background layer and one foreground layer (classicsegmentation problem) or of multiple foreground and back-ground layers. In this paper we assume multiple foregroundlayers corresponding to players at different depths and a sin-gle background layer. More precisely, foreground layers aredefined by first computing an approximate background seg-mentation and then computing a conservative visual hull es-timate [11], from which the 3D connected components areextracted to define foreground layers. The set of depth la-bels D = {d1, . . . , d|D|−1,U} is formed of depth values diobtained by sampling the optical rays together with an un-known label U used to account for occlusions. Occlusionsare common and can be severe when the number of camerasis small (Fig. 1), especially in the background where largeareas are often visible only in a single camera.

3.2. Energy formulation

We formulate the problem of computing the optimum la-belling (l, d) as an energy minimisation problem of the fol-lowing cost function

E(l, d) = λcolourEcolour(l) + λcontrastEcontrast(l)+λmatchEmatch(d) + λsmoothEsmooth(l, d). (1)

The different energy terms correspond to various cues de-rived from layer colour models, contrast, photo-consistencyand smoothness priors, whose relative contribution is con-trolled by the parameters λcolour, λcontrast, λmatch and λsmooth.

3.3. Description of the different energy terms

The colour and contrast terms are frequently used in seg-mentation problems, while the matching and smoothnessterms are normally used in reconstruction. Here we min-imise an energy functional which simultaneously involves

these two types of terms. Colour and contrast terms are verysimilar in principle to [24] with the main distinction that weextend the formulation to an arbitrary number of layers.

3.3.1 Colour term

The colour term exploits the fact that different layers (orgroups of layers) tend to follow different colour distribu-tions. This encourages assignment of pixels to the layerfollowing the most similar colour model, and is defined as

Ecolour(l) =∑p∈P− logP (Ip|lp), (2)

where P (Ip|lp = li) denotes the probability at pixel p inthe reference image of belonging to layer li. Similarly to[24], the model for a layer li is defined as a linear combi-nation of a global colour model Pg(Ip|lp = li) and a localcolour model Pl(Ip|lp = li):

P (Ip|lp = li) = wPg(Ip|lp = li)+(1−w)Pl(Ip|lp = li),(3)

where w is a real value between 0 and 1 controlling thecontributions of the global and local model. A dual colourmodel combining global and local components allows fordynamic changes in the background. It should be noted thatthe local model is applicable only to static layers (this isoften the case for background layers).

For a given layer li, the global component of the colourmodel is represented by a Gaussian Mixture Model (GMM)

Pg(Ip|lp = li) =Ki∑k=1

wikN(Ip|µik,Σik), (4)

where N is the normal distribution and the parameters wik,µik and Σik represent the weight, the mean and the covari-ance matrix of the kth component for layer li. Ki is thenumber of components of the mixture model for layer li.

The local component of the colour model for a staticlayer li is represented by a single Gaussian distribution foreach pixel p:

Pl(Ip|lp = li) = N(Ip|µip,Σip), (5)

where the parameters µip and Σip represent the mean andthe covariance matrix of the Gaussian distribution at pixelp. Learning of the colour models is described in Section 4.

3.3.2 Contrast term

The contrast term encourages layer discontinuities to occurat high contrast locations. This naturally encourages lowcontrast regions to coalesce into layers and favours discon-tinuities to follow strong edges. This term is defined as

Econtrast(l) =∑

(p,q)∈N

econtrast(p, q, lp, lq), with (6)

econtrast(p, q, lp, lq) ={

0 if lp = lq,exp (−βC(Ip, Iq)) otherwise.

(7)N denotes the set of interacting pairs of pixels in P (a 4-connected neighbourhood is assumed) and || · || is the L2

norm. C(·, ·) represents the squared colour distance be-tween neighbouring pixels, and β is a parameter weightingthe distance function. Although various distance are possi-ble for C(·, ·), we use the attenuated contrast [24]

C(Ip, Iq) =||Ip − Iq||2

1 +(||Bp−Bq||

K

)2

exp(− z(p,q)2

σz

) , (8)

where z(p, q) = max(||Ip − Bp||, ||Iq − Bq||). Bp is thebackground colour at pixel p; it is provided by the localcomponent of the colour model defined in Section 3.3.1. β,K and σz are parameters which are set to the standard val-ues suggested in [24]. This formulation uses background in-formation to adaptively normalise the contrast, thereby en-couraging layer discontinuities to fall on foreground edges.

3.3.3 Matching term

The matching term is based on the idea that a correct depthestimate must have similar appearances in the images inwhich the point is visible. Similarity can be measured ateach pixel based on photo-consistency or robust featureconstraints when available. This term encourages depthassignments to maximise these multi-view similarity mea-sures. Our formulation combines dense (photo-consistencybased) and sparse (SIFT feature based) constraints:

Ematch(d) = Edense(d) + Esparse(d). (9)

The dense matching score is defined as

Edense(d) =∑p∈P

edense(p, dp), with (10)

edense(p, dp) ={S(P (p, dp)) if dp 6= U ,

SU if dp = U . (11)

P (p, dp) denotes the coordinates of the 3D point along theoptical ray passing through pixel p and located at a distancedp from the reference camera. The function S(·) measuresthe similarity of the reference camera with the auxiliarycameras in which the hypothesised point P (p, dp) is visi-ble. For weakly textured scenes such as the ones consideredin this paper, standard normalised cross correlation simi-larity measures are inadequate. A more appropriate choicein this case is an error tolerant photo-consistency measuresimilar to [15]. This computes photo-consistency over ex-tended regions of radius rtol rather than single pixels, andthereby compensates for calibration errors or non-uniform

image sampling. The photo-consistency score between thereference image and the auxiliary camera i is defined as

photoi(X) = max(q−πi(X))2<rtol

(Ip − Iiq)2

σ2i

, (12)

where σ2i normalises the photo-consistency measure for

each auxiliary camera i and the function πi(X) projects thehypothesised 3D point X into the image plane of camerai. A robust combination rule is defined as the sum of the kmost photo-consistent pairs denoted by Bk

S(X) =∑i∈Bk

photoi(X). (13)

The sparse matching score is defined as

Esparse(d) =∑p∈P

esparse(p, dp), with (14)

esparse(p, dp) ={

0 if F(p) = ∅ or dp ∈ F(p),∞ otherwise.

(15)F(p) denotes the set of depth labels located within a dis-tance T from a sparse constraints at pixel p. This forcesthe reconstructed surface to pass nearby existing sparse 3Dcorrespondences. Because of calibration errors, we do notrequire the reconstruction to match exactly the sparse con-straints, but allow a tolerance controlled by the parameterT . We use affine-covariant features [17, 16] which areknown to be robust to changes in viewpoint and illumi-nation. In this paper, we used the Hessian-affine featuredetector. Image features which appear located at the fore-ground/background junction cannot be used to establish re-liable correspondences; we use the learnt colour models de-fined in Section 3.3.1 to discard such features based on theirratio of foreground and background pixels. Features arerepresented using the SIFT descriptor and matched basedon a nearest neighbour strategy. Robust matching is ensuredby restricting the search to areas within a tolerance distancefrom the epipolar lines. The left-right spatial consistency(reciprocity) constraint is enforced together with temporalconsistency which requires corresponding features betweencamera views to be in correspondence temporally with theprevious or the next frame.

3.3.4 Smoothness term

The smoothness term encourages the depth labels to varysmoothly within each layer. This is useful in situationswhere matching constraints are weak (poor photoconsis-tency or a low number of sparse constraints) and insufficientto produce an accurate reconstruction without the supportfrom neighbouring pixels. It is defined as

Esmooth(l, d) =∑

(p,q)∈N

esmooth(lp, dp, lq, dq), with (16)

esmooth(lp, dp, lq, dq) = (17) min(|dp − dq|, dmax) if lp = lq and dp, dq 6= U ,0 if lp = lq and dp, dq = U ,dmax otherwise.

Discontinuities between layers are assigned a constantsmoothness penalty dmax, while within each layer thepenalty is defined as a truncated linear distance. Such a dis-tance is discontinuity preserving as it does not over-penaliselarge discontinuities within a layer; this is known to be su-perior to simpler non-discontinuity functions (see [4, 14]).This term also encourages unknown features to coalescewithin each layer.

The choice of shape prior is crucial. A commonly usedprior is to assume locally constant depth (fronto-parallel as-sumption). In this case, a label (lp, dp) corresponds to thepoint from layer lp and located at a distance dp from the ref-erence camera centre along the ray emanating from pixel p.Although this yields good quality results when supported bystrong matching cues, this results in bias towards flat figuremodels which do not give good alignment between views.An alternative approach which we use here is to place sam-ples along the iso-surfaces of the visual hull, which resultsin a reconstructed surface biased towards the visual hulliso-surfaces. We call this prior the iso-surface prior. Inthis case, a label (lp, dp) corresponds to the first point ofintersection between the ray emanating from pixel p andthe dp-th iso-surface in the interior of the visual hull’s con-nected component corresponding to layer lp. To accountfor calibration and matting error, we use the error toler-ant visual hull proposed in [11]. Unlike the fronto-parallelprior, the iso-surface prior is view-independent and resultsin reconstructions more realistic and likely to coincide inthe absence of strong matching cues. It can be noted thatthe choice of a fronto-parallel or an iso-surface prior affectsthe correspondence between labels and the 3D points theyrepresent, however it does not change the formulation inEq. (17) since the set of depth values remain an ordered setof discrete values.

3.4. Graph-cut optimisation

Optimisation of the energy defined by Eq. (1) is knownto be NP-hard. However, an approximate solution can becomputed using the expansion move algorithm based ongraph-cuts [4]. Proof of the regularity of the energy func-tion, required for alpha expansion optimisation, is providedin [1]. The expansion move proceeds by cycling throughthe set of labels α = (lα, dα) in L × D and performing anα-expansion iteration for each label until the energy cannotbe decreased (see [4]). An α-expansion iteration is a changeof labelling such that each pixel p either retains its currentvalue or takes the new label α. Each α-expansion iterationcan be solved exactly by performing a single graph-cut us-

50

m

100 m

(a) Rugby dataset

105 m

70

m

(b) Soccer dataset105 m

70

m

(c) Evaluation dataset

Figure 2. Camera layout for the different datasets. Broadcast cam-eras are represented as blue arrows, locked-off cameras as red ar-rows. Crossed-out cameras were not used due to inadequate fram-ing. The ground truth camera is indicated in yellow; additionalleft-out cameras are shown in light blue colour.

ing the min-cut/max-flow algorithm [3]. After convergenceof the algorithm, the result obtained is guaranteed to be astrong local optimum [4]. The α-expansion algorithm wasinitialised with the visual hull estimate; convergence hasbeen found to be insensitive to the choice of initialisation.In practice, convergence is usually achieved in 3 or 4 cyclesof iterations over the label set. We improve computationand memory efficiency by dynamically reusing the flow ateach iteration of the min-cut/max-flow algorithm [2]. Thisresults in a speed-up of an order of two.

4. ResultsTesting of the algorithm was performed on two wide-

baseline datasets from an international 6-Nations rugbygame (400 frames) and a European Championship soccergame (100 frames). In addition, a quantitative evaluation ofthe free-viewpoint capabilities of the different techniqueswas performed on a soccer game (100 frames) captured us-ing a special camera rig providing additional ground-truthviews. Due to space limitations, examples shown in the pa-per are restricted to single frames, however results on fullsequences can be seen in the attached video. The recon-struction pipeline is the same for all datasets and can bebroken into four stages: (i) camera calibration, (ii) initialsegmentation using difference keying, (iii) initial scene re-construction using conservative visual hull, and (iv) jointrefinement of segmentation and reconstruction based on theproposed approach.

Calibration is performed on-the-fly from pitch line fea-tures [26]. This typically produces calibration errors of

the order of 1-3 pixels which are rather large given playerresolution (a pixel roughly corresponds to 5cm on widelyframed views) and pitch dimension (100 × 50 m). Initialsegmentation is obtained by computing a background im-age plate for each input view based on standard mosaic-ing techniques (this does not require separate backgroundcapture) and thresholding the colour difference between thebackground plate and original image. Initial segmentationis inaccurate due to overlapping foreground and backgroundcolour distributions, which is exacerbated by the presenceof compression artifacts and motion blur. For this reason,segmentation thresholds are set to low values to preventforeground erosion. Initial reconstruction is obtained byerror tolerant shape-from-silhouette which prevents scenetruncation in the presence of calibration and segmentationerrors [11].

Joint refinement of segmentation and reconstruction isperformed in parallel for each input camera, using the setof neighbouring cameras as auxiliary cameras (usually nomore than 1 or 2 auxiliary cameras are usable due to thewide-baseline). We define two background layers corre-sponding to pitch and crowd area, and as many foregroundlayers as there are components in the conservative visualhull. The local colour models used for the background islearnt automatically from the background plate and its vari-ance. The global colour model for each layer is learnt usingsamples from the background plate in the case of the back-ground and from a single manually annotated key-framefor each camera in the case of the foreground; a mixtureof 5 components was used for each layer. The parame-ters weighting the contribution of the different energy termswere set to: λcolour = 0.5, λmatch = 0.5, λcontrast = 1.0 andλsmooth = 0.1. w balancing local and global colour modelswas set to 0.5. Isosurfaces were sampled every 5 mm andthe maximum penalty dmax was set to 20. The same param-eter settings were used for all datasets. Refinement run-timevaries according to the number of foreground labels in thescene and is of the order of a minute per reference frame onan Intel Core 2 Duo 2.4GHz CPU.

4.1. Rugby and soccer datasets

The rugby dataset consists of six locked-off cameras andfour moving broadcast cameras, while the soccer datasetconsists of only two locked off cameras and four broad-cast cameras (see Fig. 2). All cameras are high-definition1920x1080 acquired at 25Hz. The broadcast cameras aremanually operated zooming and rotating cameras. Thelocked-off cameras have a wider framing to give coverageof different sections of the pitch and are therefore not allsimultaneously usable.

In terms of segmentation, we compared our approachagainst three other techniques: (i) user assisted chroma-keying, (ii) difference-keying, (iii) background-cut [24].

Results for the four techniques are presented in Fig. 3 andthe accompanying video. Clearly a purely global techniquesuch as chroma-keying produces poor results as it is appli-cable only to grass areas. Difference keying produces betterresults due to the locality of the colour model it is based on,however it fails in the crowd area which is non-static andin areas where foreground and background colour are sim-ilar. Background cut is able to improve further the resultsby combining local and global models thereby increasingrobustness and adding tolerance to non-static backgroundelements, however it yields inaccurate results in ambiguousareas. The proposed approach, combining multiple view in-formation to disambiguate the problem, produces a cleanersegmentation than all other methods.

To evaluate reconstruction, our technique was comparedto three standard techniques: (i) conventional visual hull,(ii)conservative visual hull (with 2 pixel tolerance), and (iii)stereo refinement of the conservative visual hull with nocolour, contrast or smoothness term. Results are shown inFig. 4. As expected the visual hull produces large trunca-tions in the presence of calibration and segmentation errors.These truncations have been eliminated in the conservativevisual hull but have been replaced by some protrusions andphantom volumes. Stereo refinement of the conservativevisual hull results in a very noisy reconstruction; this il-lustrates the weakness of the available photo-consistencyinformation. In contrast, the proposed technique yields asmooth reconstruction with accurate player boundaries andthe elimination of phantom volumes.

The technique can be used to synthesise views fromnovel viewpoints where real cameras cannot be physicallylocated such as views from inside the pitch. Some exam-ples of free-viewpoint video sequences can be seen in theattached video. Each depth-map is converted to a mesh andthen rendered using view-dependent texture mapping tech-niques [6]. High quality rendering from view-dependentgeometry and images is made possible by the use of theproposed surface reconstruction technique. This accuratelyaligns wide baseline views based on stereo matches andgives a smooth surface approximation based on iso-surfacesof the visual-hull shape prior in regions of uniform appear-ance which commonly occur on player shirts or due to viewssampled at significantly different resolutions. Further re-sults can be found in the submitted video.

4.2. Evaluation dataset

A quantitative evaluation of free-viewpoint rendering ac-curacy was performed on a dataset of 15 closely spacedcameras. One camera (shown in yellow in Fig. 2) wasexcluded from reconstruction and used to provide groundtruth. Three sets of reconstruction cameras where definedby excluding one, two or three neighbouring cameras (seeFig. 2), thus defining sets with increasing baseline. All four

Original Chroma-key Difference key Background cut Proposed

Figure 3. Example of segmentation results on rugby (top) and soccer (bottom) data (see attached video for full sequence).

Visual hull Cons. visual hull Stereo refinement Proposed

Figure 4. Example of reconstruction results on rugby (top) and soccer (bottom) data (see attachedvideo for full sequence).

1.0 1.5 2.0 2.5 3.0 3.5

# of cameras left out

0.6

0.7

0.8

0.9

1.0

1.1

com

ple

ten

ess

completeness

Proposed

Stereo refinementVisual hullConservative visual hull

1.0 1.5 2.0 2.5 3.0 3.5


0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

sh

ap

e

shape

Proposed


1.0 1.5 2.0 2.5 3.0 3.5


46

47

48

49

50

51

52

53

54P

SN

R

PSNR

Proposed


Figure 5. Quantitative evaluation offree-viewpoint video performance.

reconstruction techniques were then used to synthesise thesequences of images from the same viewpoint as the groundtruth camera. The quality of the synthesised images was as-sessed by measuring their similarity with the ground truthimages. Evaluation was performed over 100 frames basedon the completeness, shape and PSNR scores defined in[10]. Results are shown in Fig. 5. The visual hull producesthe worst results. The conservative visual hull produces acomplete reconstruction but a poor estimate of shape andappearance. The stereo refinement of the conservative vi-sual hull and the proposed technique both score high onall scores with marginal differences. Although scores are

comparably high for these two techniques, the visual qualityis usually greater for the proposed technique (see attachedvideo). This is not reflected in the evaluation scores whichare not able to capture essential visual features such as tem-poral consistency and image smoothness. The visual qualityimprovements resulting from the proposed technique can bebest seen in the accompanying video.

5. Conclusions and future workThis paper introduced a novel approach for simultaneous

multi-layer segmentation and reconstruction of complex dy-namic scenes captured with multiple moving cameras at

different resolutions. View-dependent graph-cut optimisa-tion is proposed combining visual colour and contrast cues,previously used in 2D image segmentation, with multipleview dense photo-metric correspondence and sparse spatio-temporal feature matching, previously used in reconstruc-tion. Strong priors on shape and appearance are combinedwith multiple view visual cues to achieve accurate segmen-tation and reconstruction which is robust to relatively largeerrors (1-3 pixels RMS) in on-the-fly camera calibration.A fully automatic system is presented for multiple viewcalibration, segmentation and reconstruction from movingbroadcast cameras to allow high-quality free-viewpoint ren-dering of sports such as rugby and soccer. Evaluation ofsimultaneous multiple view segmentation and reconstruc-tion in challenging outdoor scenes demonstrates consider-able improvements over independent single view segmen-tation and conventional visual-hull or stereo reconstruction.Sports reconstruction presents a challenging problem witha small number (6-12) of independently manually oper-ated panning and zooming broadcast cameras, sparsely lo-cated around the stadium to cover a large area with multipleplayers. This results in multiple view wide-baseline cap-ture at different resolutions with motion blur due to playerand camera motion. The framework enables high-qualityrendering of novel views from wide-baseline moving cam-era acquisition of sports, overcoming limitation of previousmultiple view reconstruction algorithms.

Acknowledgements This work was supported byTSB/EPSRC projects “i3Dlive: interactive 3D methods forlive-action media” (TP/11/CII/6/I/AJ307D, TS/G002800/1)and “iview: Free-viewpoint Video for Interactive Entertain-ment Production” (TP/3/DSM/6/I/15515, EP/D033926/1).

References[1] http://www.guillemaut.org/publications/

09/GuillemautICCV09_supplemental.pdf.[2] K. Alahari, P. Kohli, and P. Torr. Reduce, reuse & recycle:

Efficiently solving multi-label MRFs. In CVPR, 2008.[3] Y. Boykov and V. Kolmogorov. An experimental comparison

of min-cut/max-flow algorithms for energy minimization invision. PAMI, 26(9):1124–1137, 2004.

[4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts. PAMI, 23(11):1222–1239,2001.

[5] N. Campbell, G. Vogiatzis, C. Hernandez, and R. Cipolla.Using multiple hypotheses to improve depth-maps for multi-view stereo. In ECCV, volume I, pages 766–779, 2008.

[6] P. Debevec, C. Taylor, and J. Malik. Modeling and render-ing architecture from photographs: a hybrid geometry- andimage-based approach. In SIGGRAPH, pages 11–20, 1996.

[7] M. Goesele, B. Curless, and S. Seitz. Multi-view stereo re-visited. In CVPR, pages 2402–2409, 2006.

[8] B. Goldlucke and M. Magnor. Joint 3D-reconstruction andbackground separation in multiple views using graph cuts. InCVPR, volume 1, pages 683–688, 2003.

[9] J.-Y. Guillemaut, A. Hilton, J. Starck, J. Kilner, and O. Grau.A Bayesian framework for simultaneous matting and 3D re-construction. In 3DIM, pages 167–174, 2007.

[10] J. Kilner, J. Starck, J.-Y. Guillemaut, and A. Hilton. Objec-tive quality assessment in free-viewpoint video production.SPIC, 24:3–16, 2008.

[11] J. Kilner, J. Starck, A. Hilton, and O. Grau. Dual-mode de-formable models for free-viewpoint video of sports events.In 3DIM, pages 177–184, 2007.

[12] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, andC. Pother. Probabilistic fusion of stereo with color andcontrast for bilayer segmentation. PAMI, 28(9):1480–1492,2006.

[13] V. Kolmogorov and R. Zabih. Multi-camera scene recon-struction via graph cuts. In ECCV, volume III, pages 82–96,2002.

[14] V. Kolmogorov and R. Zabih. What energy function can beminimized via graph cuts? PAMI, 26(2):147–159, 2004.

[15] K. Kutulakos. Approximate N-view stereo. In ECCV, vol-ume I, pages 67–83, 2000.

[16] K. Mikolajczyk and C. Schmid. A performance evaluationof local descriptors. PAMI, 27(10):1615–1630, 2005.

[17] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. Acomparison of affine region detectors. IJCV, 65(1-2):43–72,2005.

[18] P. Narayanan, P. Rander, and T. Kanade. Constructing virtualworlds using dense stereo. In ICCV, pages 3–10, 1998.

[19] Y. Ohta, I. Kitahara, Y. Kameda, H. Ishikawa, andT. Koyama. Live 3D video in soccer stadium. IJCV,75(1):173–187, 2007.

[20] I. Reid and K. Connor. Multiview segmentation and trackingof dynamic occluding layers. In BMVC, volume 2, pages919–928, 2005.

[21] S. Roy and I. Cox. A maximum-flow formulation of theN-camera stereo correspondence problem. In ICCV, pages492–499, 1998.

[22] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski.A comparison and evaluation of multi-view stereo recon-struction algorithms. In CVPR, pages 519–528, 2006.

[23] D. Snow, P. Viola, and R. Zabih. Exact voxel occupancy withgraph cuts. In CVPR, volume 1, pages 345–352, 2000.

[24] J. Sun, W. Zhang, X. Tang, and H.-Y. Shum. Backgroundcut. In ECCV, volume 3954, pages 628–641, 2006.

[25] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kol-mogorov, A. Agarwala, M. Tappen, and C. Rother. Acomparative study of energy minimization methods forMarkov random fields with smoothness-based priors. PAMI,30(6):1068–1080, 2008.

[26] G. Thomas. Real-time camera tracking using sports pitchmarkings. J. Real-Time Image Proc., 2:117–132, 2007.

[27] C. Zitnick, S. Kang, M. Uyttendaele, S. Winder, andR. Szeliski. High-quality video view interpolation usinga layered representation. In SIGGRAPH, pages 600–608,2004.

Robust Graph-Cut Scene Segmentation and Reconstruction for ...

Documents

Transcript of Robust Graph-Cut Scene Segmentation and Reconstruction for ...