LS3D: Single-View Gestalt 3D Surface Reconstruction from ... · LS3D: Single-View Gestalt 3D...

LS3D: Single-View Gestalt 3D Surface Reconstructionfrom Manhattan Line Segments

Yiming Qian1, Srikumar Ramalingam2, and James H. Elder1

1 York University, Toronto, Canadayimingq,[email protected]

2 University of Utah, Salt Lake City, [email protected]

Abstract. Recent deep learning algorithms for single-view 3D reconstruction re-cover rough 3D layout but fail to capture the crisp linear structures that grace oururban landscape. Here we show that for the particular problem of 3D Manhattanbuilding reconstruction, the explicit application of linear perspective and Manhat-tan constraints within a classical constructive perceptual organization frameworkallows accurate and meaningful reconstructions to be computed. The proposedLine-Segment-to-3D (LS3D) algorithm computes a hierarchical representationthrough repeated application of the Gestalt principle of proximity. Edges are firstorganized into line segments, and the subset that conforms to a Manhattan frameis extracted. Optimal bipartite grouping of orthogonal line segments by proximityminimizes the total gap and generates a set of Manhattan spanning trees, each ofwhich is then lifted to 3D. For each 3D Manhattan tree we identify the completeset of 3D 3-junctions and 3-paths, and show that each defines a unique minimalspanning cuboid. The cuboids generated by each Manhattan tree together define asolid model and the visible surface for that tree. The relative depths of these solidmodels are determined by an L1 minimization that is again rooted in a princi-ple of proximity in both depth and image dimensions. The method has relativelyfewer parameters and requires no training. For quantitative evaluation, we intro-duce a new 3D Manhattan building dataset (3DBM). We find that the proposedLS3D method generates 3D reconstructions that are both qualitatively and quan-titatively superior to reconstructions produced by state-of-the-art deep learningapproaches.

1 IntroductionMost 3D computer vision research focuses on multi-view algorithms or direct rang-ing methods (e.g., LiDAR, structured light). However, the human ability to appreciatethe 3D layout of a scene from a photograph shows that our brains also make use ofsingle-view cues, which complement multi-view analysis by providing instantaneousestimates, even for distant surfaces, where stereoscopic disparity signals are weak.

Recent work on single-view 3D reconstruction has focused on supervised deeplearning to directly estimate depth from pixels. Here we take a different approach, fo-cusing instead on identifying a small set of principles that together lead to a simple,unsupervised method for estimating 3D surface geometry for buildings that conform toa Manhattan constraint [1].

elder

Text Box

To be published in the Proceedings of the 2018 Asian Conference on Computer Vision (ACCV).

2 Qian Y. et al.

This approach has three advantages. First, it provides an interpretable scientific the-ory with a clear set of assumptions and domain of validity. Second, it leads to a well-specified hierarchical solid model 3D representation that may be more useful for down-stream applications than a range map. Finally, as we will show, it generates results thatare qualitatively and quantitatively superior to state-of-the-art deep learning methodsfor the domain of 3D Manhattan building reconstruction.

Single-view 3D reconstruction is ill-posed and thus solvable only for scenes thatsatisfy strong regularity conditions. Linear perspective - the assumption that features ofthe 3D scene are arranged over systems of parallel lines - is perhaps the most powerfulof these constraints; its discovery is often cited as a defining achievement of the EarlyRenaissance. Linear perspective is a particularly valuable cue for urban environments,which abound with piecewise planar and often parallel surfaces that generate familiesof parallel 3D line segments in the scene.

A stronger constraint than linear perspective is the so-called Manhattan constraint [1],which demands that there be three dominant and mutually orthogonal families of par-allel line segments. Application of this additional regularity allows line segments to belabelled with their 3D orientation, but does not directly provide an estimate of the 3Dsurfaces or solid shapes in the scene.

To bridge this gap, we appeal to a long history of research in Gestalt psychologythat identifies a principle of proximity and related cues of connectedness and commonregion as dominant factors in the perceptual organization of visual information [2–5].We use this principle of proximity repeatedly to construct successively more global andthree-dimensional representations of the visual scene.

The proposed approach is anchored on a sparse line segment representation - Fig. 1provides an overview of our Line-Segment-to-3D (LS3D) algorithm. Principles of prox-imity and good continuation are used to group local image edges into line segments,which are then labelled according to their Manhattan directions (Fig. 1(a)). A principleof proximity is then again employed to optimally group neighbouring orthogonal seg-ments, forming local 2D minimal spanning Manhattan trees (MTs, Figs. 1(b-c)). Eachof these local trees is then lifted to 3D using the Manhattan constraints. Note, however,that the relative depth of each 3D MT remains undetermined (Fig. 1(d)).

One of our main contributions is to show that each of these 3D Manhattan treescan be decomposed into a maximal set of non-subsuming 3D 3-junctions, 3-paths andL-junctions. Each of the 3-junctions and 3-paths defines a unique minimal coveringcuboid, and each L-junction defines a unique minimal covering rectangle. The union ofthese cuboids and rectangles defines the 3D surface model for the tree (Fig. 1(e)).

The definition of these surfaces now allows the relative depth of these local 3Dmodels to be resolved through a two-stage constrained optimization procedure. We firstapply a principal of common region [5], minimizing the L1 distance between parallelplanes from different models that overlap in the image, forming sets of compound 3Dmodels corresponding to connected regions in the image (Fig. 1(f)). Finally, we apply aprinciple of proximity to resolve the relative depth of these disjoint compound 3D mod-els, minimizing the L1 distance between parallel planes from distinct models, weightedby their inverse separation in the image. For both stages, occlusion constraints [6] playa crucial role in preventing physically unrealizable solutions. The resulting 3D scale

LS3D 3

(a) (b) (c) (d)

(e) (f)(g)

0m 50m 100m 150m

(h)Fig. 1. LS3D processing stages. (a) Detected Manhattan line segments. (b) Graphical structure ofidentified Manhattan spanning trees (MTs). Each vertex represents a line segment endpoint, andeach edge represents either a real line segment or a junction between orthogonal segments. (c)MTs localized in the image (d) MTs lifted to 3D. Note that the relative depth of each MT remainsunknown. (e) Minimal spanning cuboid/rectangle models. (f) Compound 3D models of connectedstructures. (g) Final model of visible surfaces. (h) Range map.

model (Fig. 1(g)) can be used to generate a range map (Fig. 1(h)) for comparison withcompeting approaches.

Unlike deep learning methods, LS3D is not designed to recover an estimate of ab-solute depth for every pixel in the image, but rather an estimate of the Euclidean 3Dlayout of the Manhattan structures in the scene, up to a single unknown scaling factor.We therefore introduce a new 3D ground-truth dataset of solid massing building modelsand evaluation framework suitable for the evaluation of such algorithms.

To summarize, our contributions are three: 1) We introduce a novel, explainablesingle-view 3D reconstruction algorithm called LS3D that infers the 3D Euclidean sur-face layout of Manhattan buildings, up to an unknown scaling factor, 2) We introduce anew 3DBM ground truth dataset of 3D Manhattan building models and a novel evalua-tion framework that allows single-view methods for 3D Manhattan building reconstruc-tion to be evaluated and compared, and 3) Using this dataset and framework, we find thatthe LS3D method outperforms state-of-the-art deep-learning algorithms, both qualita-tively and quantitatively. The goal of this work is not to reconstruct general scenes. Thisis consistent with the computer vision tradition of focusing on important sub-problems,and making use of domain constraints. As we argue here, Manhattan structures areextremely common in our built environment and many non-Manhattan scenes can bemodelled as a mixture of Manhattan frames. Thus it makes sense to have a specializedmodule for their reconstruction. Any system that does not take explicit advantage ofManhattan regularity will, we expect, fail to reconstruct crisp orthogonal structure (SeeDNN output in Fig. 4)

4 Qian Y. et al.

2 Prior WorkSingle-view 3D reconstruction is a classical computer vision problem that goes back toRoberts’ PhD thesis [7–11]. More recent work has attempted to reconstruct piecewiseplanar 3D models of real scenes but under somewhat stronger assumptions. In theirPhoto Pop-up work, Hoiem et al. [12] modeled scenes as comprising three types ofsurfaces: ground, vertical and sky. Boosted decision tree classifiers were used to labelsuperpixels from the image into one of these three semantic classes using a feature de-scriptor that includes appearance and geometric cues. The set of polylines defining theground/vertical boundary was identified to estimate the 3D orientations of the verticalsurfaces in the scene. Subsequent work globally optimizes the ground/vertical bound-ary [13] and generalizes to a larger range of camera poses and more fine-grained surfaceestimation [14].

While Hoiem et al. allowed vertical surfaces of arbitrary orientation, Coughlan andYuille[15] observed that in the built environment, 3D scenes are often dominated bythree mutually orthogonal directions (vertical + 2 horizontal) and developed a proba-bilistic approach to recover the rotation of this so-called Manhattan frame relative tothe camera. Subsequent work [16, 17] refined this model to deliver more accurate Man-hattan frames and to label the lines in the image according to their Manhattan direction.

The Manhattan constraint has been productively exploited by numerous subsequent3D reconstruction algorithms. Delage et al.[18] developed a Bayes net model to identifythe floor/wall boundary in indoor scenes and thus to recover the Euclidean floor/wallgeometry. Hedau et al.[19] employed an even stronger constraint for indoor 3D roomreconstruction, assuming that the room could be modeled as a single cuboid with inter-vening clutter. Subsequent improvements to indoor reconstruction based on this cuboidconstraint has relied on novel features [20–22], physics-based constraints [6], Bayesianmodeling [23], better inference machinery [24, 25], larger field-of-view [26], and su-pervised deep learning [27].

While these indoor room scenes are highly constrained, a more recent approach re-turns to the problem of reconstructing more general Manhattan scenes, indoor and out-door [28]. Line segments are first detected and labelled with their Manhattan directions,and then a large set of potential 3D connectivities are identified between segment pairs.While many of these potential connectivities are false, an L1 minimization frameworkcan identify the 3D solution that respects the maximal number of connection hypothe-ses, allowing the detected 3D line segments to be backprojected into 3D space.

As an alternative to the ground/vertical and Manhattan constraints one can assumethat surfaces are linear 3D sweeps of lines or planar contours detected in the image.This constraint had led to interesting interactive systems, although fully automatic re-construction is challenging [29].

The main competing fully automatic approach to constrained piecewise planar mod-els attempts to recover an unconstrained range map, using supervised machine learningtechniques. An early example is Make3D [30], which models range as a conditionalrandom field (CRF) with naıve Bayes unary potentials over local edge, texture, andcolour features and a data-dependent binary smoothness term.

More recent range map approaches tend to use deep neural networks [31–39]. Forexample, Eigen et al. train and evaluate a multi-scale CNN on their own NYU RGBDdataset of indoor scenes and the KITTI LiDAR dataset of road scenes [40, 31], while

LS3D 5

Laina et al. train and evaluate a single-scale but deeper ResNet concatenated with up-sampling layers [33] on Make3D [30] and NYU2 [31] datasets. Joint estimation ofdepth with surface orientation and/or semantic category has been found to improve theaccuracy of depth estimates [31, 37, 38].

One criticism of deep network approaches is the requirement for large amountsof labelled training data, but recent work demonstrates that deep networks for single-view range map estimation can be trained from calibrated stereo pairs [41, 39] or evenuncalibrated video sequences [42] using reprojection error as the supervisory signal.

Recent research has also been exploring fusion of deep networks with more tradi-tional computer vision approaches. The IM2CAD system [43], for example, focuseson the modeling of room interiors, optimizing configurations of 3D CAD models offurnishings and wall features by projection error, using metrics trained by CNNs.

While deep networks have become the dominant approach to single-view 3D recon-struction, this approach has limitations. First, DNN models have millions of free param-eters and are thus not easily interpretable. Second, while deep networks can provide anestimate of rough scene layout, they typically fail to deliver the crisp and accurate ge-ometry that is typical of urban environments. Third, most deep network approachesdeliver a range map, which may be appropriate for some applications (e.g., navigation),but for applications such as construction, interior design and architecture a succinctCAD model is more useful.

We thus return in this paper to the classical geometry-driven approach. In particular,we ask, for the particular problem of single-view 3D Manhattan building reconstruction,how much can be achieved by a method that uses geometry alone, without relying uponany form of machine learning or appearance features. While the geometric approachhas been criticized as unreliable [20], we show here that by integrating several keynovel ideas with state-of-the-art line segment detection [44], reliable single-view 3Dreconstruction of Manhattan objects can be achieved. By keeping the model simple wekeep it interpretable, and by focusing on geometry, we deliver the crisp surfaces weexperience in built environments, in a highly compact 3D CAD model form.

The focus on geometry and application of the Manhattan constraint links the pro-posed LS3D approach most directly to the line lifting algorithm of Ramalingam &Brand [28]. However, in this prior work there was no explicit grouping of line segmentsinto larger structures, no inference of surfaces or solid models, and no quantitative eval-uation of 3D geometric accuracy. LS3D thus goes far beyond this prior work in deliv-ering quantitatively-evaluated 3D surface models. This is achieved through three keycontributions:

1. While prior approaches [45, 20] use a ‘line sweeping’ heuristic to go from linesegments to independent Manhattan rectangles, here we introduce a novel, princi-pled approach to identify more complex 3D Manhattan trees, solving a series ofthree optimal bipartite matching problems to deliver spanning tree configurationsof orthogonal Manhattan line segments that together maximize proximity betweengrouped endpoints.

2. We introduce a novel method for converting these 3D Manhattan trees to surfacemodels. The idea is based on decomposing each Manhattan tree into a maximal setof non-subsuming 3D 3-junctions, 3-paths and L-junctions. Each of the 3-junctions

6 Qian Y. et al.

and 3-paths defines a unique minimal spanning cuboid, and each L-junction definesa unique minimal spanning rectangle. The union of these cuboids and rectanglesdefines the 3D surface model for the tree.

3. These 3D surface models contain multiple planes, providing stronger cues for es-timating the relative depth of disconnected structures. We introduce a novel two-stage L1 minimization approach that gives precedence to the Gestalt principle ofcommon region [5] to first form compound 3D models of structures connected inthe image, and later resolving distances between these disjoint structures.

3 The LS3D AlgorithmThe LS3D algorithm is summarized in Fig. 1 and detailed below. Line segments arefirst detected and labelled according to Manhattan direction (Fig. 1(a), Section 3.1).From these, Manhattan spanning trees (MTs) are recovered (Fig. 1(b-c), Section 3.2)and then lifted to 3D (Fig. 1(d), Section 3.3). A maximal set of minimal cuboids andrectangles that span each 3D MT is then identified (Fig. 1(e), Section 3.4) and theirsurfaces aligned in depth through a constrained L1 optimization first for overlappingMTs (Fig. 1(f)) and finally for disjoint MTs (Section 3.5). The resulting 3D CADmodel 1(g)) can be rendered as a range map (Fig. 1(h)) to compare with algorithmsthat only compute range maps.

3.1 Manhattan Line Segment Detection

We employ the method of Tal & Elder [17] to estimate Manhattan lines, based uponprobabilistic Houghing and optimization on the Gauss sphere, and then the MCMLSDline segment detection algorithm [44], which employs an efficient dynamic program-ming algorithm to estimate segment endpoints. MCMLSD produces line segment re-sults that are quantitatively superior to prior approaches - Fig. 1(a) shows an example.

The MCMLSD algorithm identifies line segments that are co-linear: LS3D groupsnearby co-linear segments into a single ‘super-segment’, but retains a record of theintermediate endpoints to support later surface creation (see below). We retain onlysegments over a threshold length.3

3.2 Manhattan Tree Construction

Prior line-based single-view 3D algorithms [45, 20] attempt to leap directly from linesegments to 3D with no intermediate stages of perceptual organization. One of our mainhypotheses is that the Gestalt principle of proximity coupled with sparsity constraintscan yield a much stronger intermediate Manhattan tree representation that will subse-quently facilitate global 3D model alignment.

First, a dense graph is formed by treating each segment as a vertex, and definingedges between pairs of vertices representing orthogonal segments with endpoints sep-arated by less than a threshold distance.4 (Note that an endpoint can lie on the interior

3 We use a minimum segment length of 100 pixels, and maximum gap between co-linear seg-ments of 300 pixels. Sensitivity to these threshold is studied in Section 5.

4 We use a threshold distance of 100 pixels - sensitivity to this threshold is studied in Section 5.

LS3D 7

of a super-segment.) To sparsify the graph we apply the constraint that each endpointconnects to at most one other endpoint in each of the two orthogonal Manhattan direc-tions. This is achieved through a series of three optimal bipartite matchings, using aproximity-based objective function. Specifically, we seek the bipartite matching of allX segment endpoints to all Y segment endpoints that minimizes the total image dis-tance between matched endpoints, and repeat forX and Z segments as well as Y and Zsegments. These optimal bipartite matches are found in cubic time using the Hungarianalgorithm [46]. We further sparsify the graph by computing the minimum spanning tree(MST) for each connected subgraph, generating what we will call local Manhattan trees(MTs, Fig. 1(b-c)).

3.3 Lifting 2D MTs to 3D

Each of the MTs can be back-projected from 2D to 3D space using Manhattan directionconstraints, up to an unknown distance/scaling constant λ. Assume a camera-centeredworld coordinate frame (X,Y, Z) in which theX and Y axes are aligned with the x andy axes of the image. Then any endpoint xi = (xi, yi)

> in the image back-projects toa 3D point Xi = λ(xi, yi, f)

> in the scene, where f is the focal length of the camera.Note that while λ is unknown it must be the same for all endpoints in the MT.

Due to noise, Manhattan line segments will never be perfectly aligned with theManhattan directions. Lifting an MT thus entails rectifying each segment to the exactManhattan direction. We employ a sequential least-squares process. One of the end-points X0 of the MT is first randomly selected as the 3D anchor of the tree: the 3D treeis assumed to pass exactly through X0. Then a depth-first search from X0 is executed,during which the Manhattan 3D location X′j of each endpoint Xj is determined fromthe Manhattan location X′i of its parent on the depth-first path by X′j = X′i + αλVij ,where Vij is the 3D vanishing point direction for segment (i, j) and α is determined byminimizing ||X′i+αλVij −Xj ||2. (Note that λ factors out of this minimization.) Figs.1(c-d) show the MTs for an example image, each lifted to 3D up to a random scalingconstant λ.

3.4 From Line Segments to Surfaces

A key contribution of our work is a novel method for inferring surface structure from 3DMTs. We define a Manhattan three-junction as a triplet of orthogonal line segments thatmeet at a vertex of the MT, and a Manhattan three-path as a sequence of three orthogonalsegments meeting end-to-end (Fig. 2). Since each segment can radiate from a junctionin two ways, there are eight types of three-junctions, four that may be observed belowthe horizon and four that may be observed above (Fig. 2 Columns 2-3). Each three-pathmust begin at one of two endpoints of one of three segment types, and then continueto one of two endpoints of one of the remaining two segment types, and finally one oftwo endpoints of the remaining segment type. This leads to 2 × 3 × 2 × 2 × 2 = 48three-paths, however each of these has a metamer path that has been traversed in theopposite direction, so there are only 24 distinct three-paths, 12 that can be observedabove the horizon and 12 that can be observed below (Fig. 2 Columns 4-9). Our maininsight is that this collection of 32 three-junctions and three-paths can be viewed as the

8 Qian Y. et al.

the outcome of a generative process involving just four generic cuboid poses, two lyingabove the horizon and two below (Fig. 2 Columns 1).

10

1

2

3

4

5

6

7

8

9

12 16

11

13

14

15 19 23

20 24

17 21

18 22

25

26

28

27 31

30

32

29

Fig. 2. The 32 unique classes of Manhattan three-junctionsand three-paths shown on the four classes of generic Man-hattan cuboid poses.

This observation leads toa simple algorithm for bridg-ing line segments to surfaces.We first decompose an MT intoan exhaustive set of Manhat-tan three-junctions and three-paths. If any segments remainthese are used to form two-paths with neighbouring orthog-onal segments. This collectionof three-junctions, three-pathsand two-paths spans the MT. Three-junctions and three-paths are then used to spawnminimal spanning Manhattan cuboids as per Fig. 2, and two-paths, if they exist, spanminimum spanning Manhattan rectangles. (Note that at an intermediate endpoint of asuper-segment, the entire super-segment is considered to support the generated cuboidor rectangle - this serves to complete occluded surfaces.) Together, these cuboids andrectangles form a surface model for the MT.

It is important to distinguish our approach to inferring 3D surface models fromprior work on recovering indoor scenes that ‘sweeps’ segments in orthogonal Manhattandirections [45, 20]. This sweeping approach estimates the 3D orientation of Manhattanrectangles, but not their relative depth, which must be resolved using strong constraintson the structure of the room (single floor and single ceiling connected by ‘accordion’Manhattan walls).

By first connecting proximal orthogonal line segments into minimal Manhattanspanning trees, we provide the connectivity constraints necessary for producing morecomplex locally-connected 3D surface models, which generate much stronger con-straints for resolving relative depth (next section). This approach can be considereda quantitative expression of the 3D reasoning philosophy advocated by Gupta et al. [6],who argued for the use of simple solid models to make qualitative inferences about3D scenes. While their goal was to compute qualitative spatial relationships betweenindependent cuboids, we show that it is possible to recover quantitative 3D scene struc-ture involving much more complex compound objects composed of many cuboids andrectangular surfaces.

3.5 Constrained L1-Minimization for Manhattan Building Reconstruction

A typical building generates many MTs and their relative distance/scaling must be de-termined. Our surface models allow us now to formulate a constrained L1 optimizationthat identifies the scaling parameters minimizing separation between parallel planeswhile respecting occlusion constraints [6]. We partition the process into two stagesbased upon Gestalt principles of common region and proximity [5].Stage 1 (Common Region): Let M represent the number of MTs in the model and letλ1, . . . λM represent the unknown scaling parameters for these MTs. Visible rectan-gular facets from all MTs are projected to the image. The overlap in these projections

LS3D 9

defines an undirected common region graph Gcr = (Vcr, Ecr) in which each vertexi ∈ Vcr represents a facet and each edge (i, j) ∈ Ecr represents overlap between parallelfacets from different MTs. Fig. 1(f) shows the MTs within each connected componentof this graph.

For each connected component c ∈ [1, . . . , C] of the graph we identify the MT mc

with the largest image projection and clamp its scaling parameter to λmc= 1. Our goal

is now to use linear programming (LP) to determine the remaining scaling parametersΛ = λ1, . . . , λM \ λm1

, . . . , λmC that minimize the weighted distance between

overlapping parallel planes from different MTs.This minimization must, however, respect depth ordering constraints induced by the

visibility of line segments. To code these constraints, for each MT i we identify all linesegment endpoints pijk ∈ Pij that lie within a rectangular facet from another MT j.Letting d−ijk represent the depth of endpoint pijk when λi = 1 and d+ijk represent thedistance to the overlapping facet from MT j along the ray from the camera centre toendpoint pijk when λj = 1, we have the depth ordering constraint λid−ijk ≤ λjd

+ijk.

The resulting constrained optimization is thus:

minΛ

∑(i,j)∈Ecr

|Ai ∩Aj | · |λidi − λjdj |

s.t. λid−ijk ≤ λjd

+ijk, pijk ∈ Pij

(1)

Here |λidi − λjdj | is the distance between two parallel planes. We weight thisdistance by the area of overlap |Ai ∩Aj | of the two planar facets in the image.Stage 2: (Proximity) Even after the scaling parameters of MTs within each connectedcomponent of the common region graph have been optimized to merge parallel planesin 3D, the relative scaling of each connected component will remain unknown.

To resolve these remaining degrees of freedom, we first identify a disjoint regiongraph Gdr = (Vdr, Edr) in which each vertex i ∈ Vdr represents a facet and each edge(i, j) ∈ Edr represents two parallel facets from different MTs and different componentsthat do not overlap in the image.

We then identify the connected component c with the largest image area and clampits scaling parameter to λmc

= 1. We will now again use LP to determine the remainingscaling parameters ΛC = λm1

, . . . , λmC \ λmc

that minimize the weighted distancebetween non-overlapping parallel planes from different MTs and different components.We weight this minimization by the sum of the areas |Ai ∪ Aj |, and inversely by theminimum separation lij of the two planar facets in the image.

Note that although the pairs of planar facets entered into the minimization do notoverlap, there may still be overlap between one or more visible line segments from onecomponent and one or more facets from the other, and these must again be encoded asordering constraints. The resulting constrained minimization is thus:

minΛC

∑(i,j)∈Edr

1

lij|Ai ∪Aj | · |λidi − λjdj |

s.t. λid−ijk ≤ λjd

+ijk, pijk ∈ Pij

(2)

Fig. 1(g) shows the 3D surface model that results from this two-stage constrainedminimization for an example image.

10 Qian Y. et al.

4 Evaluation DatasetTo evaluate the LS3D algorithm and compare against state-of-the-art, we have createda new 3D ground truth dataset of 57 urban buildings that largely conform to the Man-hattan constraint. The 3D building massing models (3DBMs) were obtained throughthe City of Toronto Open Data project from www.toronto.ca/city-government/data-research-maps/open-data and were simplified in MeshLab[47] to speed processing. Fig.3 shows some examples.

Fig. 3. Some examples of 3DBM models in our dataset.

The number of images takenof each building depended uponaccess and the complexity ofthe architecture - 118 imageswere taken in total. We useda Sony NEX-6 camera with4912 × 3264 pixel resolution.The camera was calibrated usingthe MATLAB Camera Calibration Toolbox to determine focal length (15.7mm) andprincipal point. The NEX-6 corrects for barrel distortion - our calibration procedureconfirmed that it is negligible.

The camera was held roughly horizontally, but no attempt was made to preciselycontrol height, roll or tilt. We attempted to take generic views of the buildings, but theexact viewing distance and vantage depended upon access and foreground obstructions.This dataset will be made available at elderlab.yorku.ca/resources.

To use the 3DBM dataset to evaluate single-view 3D reconstruction algorithms, weneed to determine the rotation Ω and translation τ of the camera relative to each ofthe 3DBMs. To this end, we manually identified between 5-20 point correspondences(wi,xi) in the 3DBM model and the 2D image, and then used a standard nonlinearoptimization method (MATLAB fmincon) to minimize projection error.

5 EvaluationWe compare LS3D against the CRF-based Make3D algorithm[30] and four state-of-the-art deep learning approaches: the multi-scale deep network of Eigen et al. [48],the fully convolutional residual network (FCRN) of Laina et al. [33], the deep ordinalregression network(DORN) of Fu et al. [36] and the very recent PlaneNet algorithm ofLiu et al.[49].

The LS3D method estimates range only up to an unknown scaling factor α. Al-though the FCRN and DORN algorithm are trained to estimate absolute range, theEigen algorithm is trained to minimize a partially scale-invariant loss function, andtherefore should not be expected to deliver accurate absolute range estimates. More-over, global scaling error has been reported as a significant contributor to overall errorfor such methods [48]. For these reasons we estimate a global scaling factor α for eachalgorithm and image independently by fitting the range estimates to the 3DBM groundtruth. In particular, we estimate the value of α that minimizes the RMS deviation of es-timated range d from ground truth range d, over all pixels that project from the 3DBMmodel.

LS3D 11

The LS3D algorithm is not guaranteed to return a range estimate for every pixelthat projects from the 3DBM, particularly when foliage and other objects intervene.To account for this, we employ two different methods to compare error between theLS3D and competing methods. In the intersection method, we measure the RMS errorfor all algorithms only for the intersection of the pixel set that projects from the 3DBMand the pixel set for which LS3D returns a range estimate. In the diffusion method,we interpolate estimates of range at 3DBM pixels for which LS3D does not return anestimate by solving Laplace’s equation, with boundary conditions given by the LS3Drange estimates at pixels where estimates exist and reflection boundary conditions atthe frame of the image. This allows us to compare RMS error for all algorithms over allpixels projecting from the 3DBMs. The input and output resolution of each algorithmvaries - our 4912× 3264 pixels images were resized to meet the input requirements ofeach algorithm.

Qualitative results are shown in Fig. 4. Make3D and the deep networks deliver rangeestimates that are sometimes correlated with ground truth, but these estimates are noisyand highly regularized. They generally fail to capture the dynamic range of depths overthe 3DBM surface (deep red to dark blue). Moreover in some cases the estimates seemwildly inaccurate. In Column 1, for example, both versions of FCRN completely fail.In Column 2, all competing algorithms except perhaps DORN estimate the left face ofthe building as farther away than the right. In Column 3, all networks seem to fail. InColumn 4 all networks fail to capture the receding depth of the left wall of the building.

The LS3D results are qualitatively different. The crisp architectural structure of eachbuilding is captured, along with the full dynamic range of depths. As expected, wheregood connectivity is achieved errors are minimal (Columns 2,4). In Column 1 and 3,however, some limitations can be seen, stemming from the failure to extract parts of thebuilding occluded by vegetation.

We find that on average the LS3D method returns a range estimate for 83.3% of pix-els projecting from the 3DBM model. For quantitative evaluation, we first average errorover all images of a particular building, and then report the mean and standard errorover the 57 buildings in the dataset. Table 1 shows quantitative results based on the in-tersection measure of error. Of the prior algorithms, we find that PlaneNet[49] performsbest, achieving a mean error of 9.35m (23.4%). However, LS3D beats this by a substan-tial margin (24.6%), achieving a mean error of 7.05m (17.7%). Matched-sample t-testsconfirm that this improvement is statistically significant (Table 1). A comparison ofLS3D performance with and without occlusion constraints shows that these constraintsyield a substantial improvement in performance. Mean errors are somewhat higher forall methods when using the diffusion method to evaluate over all pixels projecting fromthe 3DBM model (Table 2). PlaneNet is again the best of the deep networks, achievinga mean error of 10.6m (26.1%). However, LS3D beats this by 22.9%, achieving a meanerror of 8.17m (19.4%). Matched-sample t-tests again confirm that this improvement isstatistically significant (Table 2).

Our method is not intended to reconstruct an entire image or to operate on non-Manhattan structure. Nevertheless, we have evaluated its performance on the indoorNY U2 dataset. We achieve a mean error of 1.08m on the subset of pixels for which arange estimate is returned. This is not competitive with deep networks trained on NYU,

12 Qian Y. et al.

RG

BIm

age

Mak

e3D

Eig

enFC

RN

-Mak

e3D

FCR

N-N

YU

DO

RN

Plan

eNet

LS3

D

0m 10m 20m 30m 40m 50m 60m

LS3

D+

Diff

us.

Gro

und

Trut

h

0m 50m 100m 0m 20m 40m 60m 80m 100m 0m 10m 20m 30m 40m 50m 60m 0m 50m 100m 150m

Fig. 4. Example results for Make3D [30], Eigen [48], FCRN [33], DORN [36], PlaneNet [49],and the proposed LS3D method, with and without diffusion.

LS3D 13Error Rate p Value

Methods RMSE(m) RMSPE(%) RMSE RMSPE

Make3D[30] 25.3 63.3 7.11E-18 4.36E-30Eigen[48] 11.9 31.4 2.59E-07 2.74E-10FCRN(Make3D)[33] 14.1 34.9 1.40E-10 1.32E-13FCRN(NYU)[33] 11.0 28.1 3.91E-08 1.42E-09DORN[36] 11.5 29.0 1.23E-08 1.05E-10planeNet[49] 9.33 24.0 8.1E-03 1.2E-03LS3D (no occlusion constraint) 8.02 20.2 5.60E-03 3.21E-02LS3D (with occlusion constraint) 7.03 18.0 N/A N/A

Table 1. Quantitative results using the intersection method of evaluation. Errors are computedonly for pixels where the LS3D method returns a range estimate. p-values for matched-sample t-tests of the LS3D method (with occlusion constraint) against competing deep network algorithmsare reported.

Error Rate p Value

Methods RMSE(m) RMSPE(%) RMSE RMSPE

Make3D[30] 27.1 65.9% 1.72E-19 7.52E-33Eigen[48] 13.2 34.2 2.88E-08 4.21E-12FCRN(Make3D)[33] 15.8 38.1 1.60E-12 5.76E-16FCRN(NYU)[33] 12.3 30.7 2.01E-09 1.64E-12DORN[36] 13.0 31.8 5.42E-12 3.60E-13PlaneNet[49] 10.6 26.5 1.98E-05 1.49E-04LS3D (with occlusion constraint) 8.11 19.7 N/A N/A

Table 2. Quantitative results using the diffusion method of evaluation. Errors are computed forall pixels projecting from the 3DBM model. p-values for matched-sample t-tests of the LS3Dmethod (with occlusion constraint) against competing deep network algorithms are reported.

for which mean error is on the order of 0.5 - 0.64m over the entire image, but is betterthan Make3D (1.21m). We believe the higher performance of deep networks on NYU2is due to deviation from Manhattan constraints and the fact that DNNs overfit to theconstant camera pose and similarity of environments in the dataset.

Figure 5(a) shows best, median and worst case performance of our LS3D algorithmon our dataset. The worst case does not actually look that bad qualitatively, but thealgorithm incorrectly underestimates the depth of a small part of the building in thelower right corner of the image, and this leads to a large quantitative error.

LS3D has three main free parameters: 1) the minimum length of a line segment,2) the maximum endpoint separation of connected orthogonal segments, and 3) themaximum threshold of connected collinear line segments. Both 1) and 2) are currentlyset to 100 pixels, and 3) is set to 300 pixels. The dependence of performance on theexact value of these parameters is shown in Fig. 5(b). This analysis shows that thesethreshold values are reasonable, and that variation of up to ±50% in threshold valuesleads to at most a 10% reduction in coverage and a 7% increase in error.

Our current Matlab implementation of LS3D takes about 21 seconds to produce a3D model from a 640×480 image. It could be made much faster by optimizing in C++.

14 Qian Y. et al.

Lowest RMSE Median RMSE Highest RMSE

RG

BIm

age

LS3

DG

roun

dTr

uth

0m 20m 40m 60m 80m 100m 0m 20m 40m 60m 0m 50m 100m 150m 200m

40 50 60 70 80 90Coverage (%)

7

7.5

8

8.5

RM

SE (m

)

Maximum endpoint separation20

50100

150

200

65 70 75 80 85Coverage (%)

7

7.5

8

8.5

RM

SE (m

)

300350

100Minimum line segment length

200

50

75 80 85 90Coverage (%)

7

8

9

10

11

RM

SE (m

)

1000

800

500

300200

Collinearity threshold

(a) (b)Fig. 5. (a) Best, median and worst case LS3D performance on the 3DBM dataset. (b) LS3Dparameter sensitivity analysis.

6 Conclusion & Future WorkWe have developed a novel algorithm called LS3D for single-view 3D Manhattan recon-struction. This geometry-driven method uses no appearance cues or machine learningyet outperforms state-of-the-art deep learning methods on the problem of 3D Manhattanbuilding reconstruction. While this algorithm is not designed to reconstruct general 3Denvironments, we believe it will be useful for architectural applications. Future workwill explore a mixture-of-experts approach which fuses the LS3D approach to recon-structing Manhattan portions of the environment with deep learning approaches forestimating the 3D layout of non-Manhattan structure.

7 AcknowledgementsThis research was supported by the NSERC Discovery program and the NSERC CRE-ATE Training Program in Data Analytics & Visualization, the Ontario Research Fund,and the York University VISTA and Research Chair programs.

LS3D 15

References

1. Coughlan, J.M., Yuille, A.L.: Manhattan World: Orientation and outlier detection byBayesian inference. Neural Computation 15 (2003) 1063–1088

2. Kubovy, M., Wagemans, J.: Grouping by proximity and multistability in dot lattices: Aquantitative Gestalt theory. Psychological Science 6 (1995) 225–234

3. Kubovy, M., Holcombe, A.O., Wagemans, J.: On the lawfulness of grouping by proximity.Cognitive Psychology 35 (1998) 71–98

4. Elder, J.H., Goldberg, R.M.: Ecological statistics of Gestalt laws for the perceptual organi-zation of contours. Journal of Vision 2 (2002) 324–353

5. Wagemans, J., Elder, J.H., Kubovy, M., Palmer, S.E., Peterson, M.A., Singh, M., von derHeydt, R.: A century of Gestalt psychology in visual perception: I. Perceptual grouping andfigure–ground organization. Psychological Bulletin 138 (2012) 1172

6. Gupta, A., Efros, A.A., Hebert, M.: Blocks world revisited: Image understanding usingqualitative geometry and mechanics. In: ECCV. (2010)

7. Roberts, L.G.: Machine perception of three-dimensional solids. PhD thesis, MassachusettsInstitute of Technology (1963)

8. Guzman, A.: Computer recognition of three-dimensional objects in a visual scene. PhDthesis, MIT (1968)

9. Waltz, D.L.: Generating semantic descriptions from drawings of scenes with shadows. Tech-nical Report AITR-271, MIT (1972)

10. Kanade, T.: A theory of Origami world. Artificial Intelligence 13 (1980) 279–31111. Sugihara, K.: Machine interpretation of line drawings. Volume 1. MIT press Cambridge

(1986)12. Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. International

Journal of Computer Vision 75 (2007) 151–17213. Barinova, O., Konushin, V., Yakubenko, A., Lee, K., Lim, H., Konushin, A.: Fast automatic

single-view 3-d reconstruction of urban scenes. In: ECCV. (2008)14. Haines, O., Calway, A.: Recognising planes in a single image. IEEE TPAMI 37 (2015)

1849–186115. Coughlan, J.M., Yuille, A.L.: Manhattan World: Compass direction from a single image by

Bayesian inference. In: CVPR. Volume 2. (1999) 941–947 vol.216. Denis, P., Elder, J., Estrada, F.J.: Efficient edge-based methods for estimating Manhattan

frames in urban imagery. In: ECCV, Springer (2008) 197–21017. Tal, R., Elder, J.H.: An accurate method for line detection and Manhattan frame estimation.

In: ACCV, Springer (2012) 580–59318. Delage, E., Lee, H., Ng, A.Y.: Automatic single-image 3D reconstructions of indoor Man-

hattan world scenes. In: Robotics Research. Springer (2007) 305–32119. Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In:

ICCV. (2009) 1849–185620. Gupta, A., Hebert, M., Kanade, T., Blei, D.M.: Estimating spatial layout of rooms using

volumetric reasoning about objects and surfaces. In Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A., eds.: NIPS. Curran Associates, Inc. (2010)

21. Ramalingam, S., Pillai, J.K., Jain, A., Taguchi, Y.: Manhattan junction catalogue for spatialreasoning of indoor scenes. In: CVPR 2013. (2013) 3065–3072

22. Mallya, A., Lazebnik, S.: Learning informative edge maps for indoor scene layout prediction.In: ICCV. (2015) 936–944

23. Pero, L.D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E., Barnard, K.: Bayesian geomet-ric modeling of indoor scenes. In: CVPR. (2012) 2719–2726

16 Qian Y. et al.

24. Felzenszwalb, P.F., Veksler, O.: Tiered scene labeling with dynamic programming. In:CVPR. (2010) 3097–3104

25. Schwing, A.G., Urtasun, R.: Efficient exact inference for 3D indoor scene understanding.In: ECCV 2012. (2012) 299–313

26. Yang, H., Zhang, H.: Efficient 3d room shape recovery from a single panorama. In: CVPR.(2016) 5422–5430

27. Dasgupta, S., Fang, K., Chen, K., Savarese, S.: Delay: Robust spatial layout estimation forcluttered indoor scenes. In: CVPR. (2016) 616–624

28. Ramalingam, S., Brand, M.: Lifting 3D Manhattan lines from a single image. In: ICCV.(2013) 497–504

29. Kushal, A., Seitz, S.M.: Single view reconstruction of piecewise swept surfaces. In: 3DV.(2013) 239–246

30. Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single stillimage. IEEE TPAMI 31 (2009) 824–840

31. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a commonmulti-scale convolutional architecture. In: CVPR. (2015) 2650–2658

32. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deepconvolutional neural fields. IEEE TPAMI 38 (2016) 2024–2039

33. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth predictionwith fully convolutional residual networks. In: 3DV. (2016) 239–248

34. Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a singleimage. In: CVPR. (2015) 5162–5170

35. Zhuo, W., Salzmann, M., He, X., Liu, M.: 3d box proposals from a single monocular imageof an indoor scene. In: AAAI. (2018)

36. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression networkfor monocular depth estimation. In: CVPR. (2018)

37. Xu, D., Ouyang, W., Wang, X., Sebe, N.: Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: CVPR. (2018)

38. Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: Geometric neural network for jointdepth and surface normal estimation. In: CVPR. (2018)

39. Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos.In: CVPR. (2018)

40. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. Inter-national Journal of Robotics Research (2013)

41. Garg, R., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Ge-ometry to the rescue. In: ECCV, Springer (2016) 740–756

42. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR. (2017)

43. Izadinia, H., Shan, Q., Seitz, S.M.: IM2CAD. In: CVPR, IEEE (2017) 2422–243144. Almazan, E.J., Tal, R., Qian, Y., Elder, J.H.: MCMLSD: A dynamic programming approach

to line segment detection. In: CVPR. (2017)45. Lee, D., Hebert, M., Kanade, T.: Geometric reasoning for single image structure recovery.

In: CVPR, IEEE (2009) 2136–214346. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the

Society for Industrial and Applied Mathematics 5 (1957) 32–3847. Cignoni, P., Callieri, M., Corsini, M., Dellepiane, M., Ganovelli, F., Ranzuglia, G.: MeshLab:

an Open-Source Mesh Processing Tool. In: Eurographics Italian Chapter Conference. (2008)48. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-

scale deep network. In: NIPS. (2014) 2366–237449. Liu, C., Yang, J., Ceylan, D., Yumer, E., Furukawa, Y.: PlaneNet: Piece-wise planar recon-

struction from a single RGB image. In: CVPR. (2018) 2579–2588

LS3D: Single-View Gestalt 3D Surface Reconstruction from ... · LS3D: Single-View Gestalt 3D...

Documents

Transcript of LS3D: Single-View Gestalt 3D Surface Reconstruction from ... · LS3D: Single-View Gestalt 3D...