Download - Rent3D: Floor-Plan Priors for Monocular Layout Estimationcseweb.ucsd.edu/~mkchandraker/classes/CSE291/2017/Presentations/14_1... · 215 apartments that had a floor plan of sufficient

Rent3D: Floor-Plan Priors for Monocular Layout Estimation

Kwokfung Tang

Given a floorplan and an image...

Predict:

● Room layout (3D box)● The room and wall the camera is facing

(localization)

Assumptions● Manhattan world (exists three dominant orthogonal vanishing points)● The vanishing points can be reliably detected

Dataset● Crawled from a London rental site● 215 apartments that had a floor plan of sufficient resolution and at least one

photo● 1312 rooms, 6628 walls, 1923 doors, and 1268 windows● 1570 images (1259 indoor, others are pictures of the apartment building, or

public facilities inside the building)● Roughly 10% of the rooms aren’t rectangular● Ground truth annotated internally (room layout, which room/wall the camera is

facing in each image)● Out of the 1259 indoor photos, 71 of them do not have ground-truth room

alignment, and 11 photos do not have ground-truth wall alignment

Estimating layout● Same as fitting a 3D box to the room - could be hard because the edges

could be occluded by clutter● Plenty of prior work has been done on this● Since room is assumed to be cuboid, if the locations of vanishing points are

known, sufficient to specify the direction of rays from vp_0 and vp_1 (4 degrees of freedom)

Finding vanishing points

Prior work by Schwing and Urtasuny = (y1, y2, y3, y4) ∈ Y, set of all possible scene layouts

yi ∈ [yi_min, yi_max], possible values of yi (discretized)

Inference:

Training: Structured SVM, use pixel-wise classification error for loss

Prior work by Schwing and Urtasun

F = {left-wall, right-wall, ceiling, floor, front-wall}

Features:

Orientation map (OM) - Regions are classified as surfaces by orientation

Geometric context (GC) - For each pixel, the probability of belonging to {left-wall, right-wall, ceiling, floor, front-wall, object}

OM/GC

Columns 2, 5:Orientation map (OM) - Regions are classified as surfaces in F

Columns 3,6:Geometric context (GC) - For each pixel, the probability of belonging to {left-wall, right-wall, ceiling, floor, front-wall, object}

Branch and bound● Searching for this optimal y by exhaustive force is expensive● Much faster method: branch and bound● Assume: f(Y) gives upper bound of scores of all members in Y● Start with the set of all possible layouts Y, put (f(Y), Y) into priority queue● At each iteration:

○ Take best candidate Y_hat from PQ, stop if |Y_hat| = 1 (optimum)○ Otherwise split into 2 disjoint sets Y1, Y2○ Calculate new bounds, put them back into PQ

● Authors didn’t mention how Y_hat is split● Does not explore regions which are not promising, allowing for efficient exact

inference

How to define f?● The features are all positive (counts/probabilities), so we can split the

equation according to positive/negative weights

How to define f?● Find maximal positive and minimal negative possible contribution for each

face individually (defined by end-points of [yi_min, yi_max])

● Summing over the faces gives us f

Reduced parametrization● Since the floor plans provide aspect ratios of the walls, in this case we can parametrize the layout

using just 3 parameters: (p1, p2, p3)

● Reduces search space and speeds up inference by an order of magnitude

Problem formulation

Where: r ∈ {1, … R} represents the room numberc_r ∈ {1, … C_r} represents the wall in room r which the camera is facingy = (y1, y2, y3) is the layout parametrization within room

Inference strategy: Exhaustive branch and bound

We can do this because the number of rooms and walls are small, y only has 3 dimensions

Layout energy and bounds

Features and bounds exactly follow Schwing’s previous work

Window energy and bounds

● Use cross-ratio projective invariant provided by floor plan to calculate vertical window rays

● Predict windows in image by training a pixel-level classifier

● Features defined by:○ fraction of the predicted window pixels falling within the window rays for each face○ fraction of window pixels falling outside the window area

● Bounds: not clear but author said it is computable effectively

Aspect ratios of side walls

● Intuition: The ray between vp_1 and bottom left corner of left wall shouldn’t intersect image (same for right wall)

● Additional constraint that trims B&B candidates● Bounds: 0 as long as at least 1 feasible candidate in set

Scene energy: Exhaustive search since it only depends on room r

Learning

● Structured SVM● Loss function: Layout pixel-wise classification error● Training set: 100 apartments (751 photos)● Validation set: 30 apartments (222 photos)● Test set: 85 apartments (597 photos)

ExperimentsScene classifier

● Uses pretrained Caffe model to extract feature for each image (fc7, 4096 features), then train with multi-class SVM

● 5 scene labels: Reception, Bedroom, Kitchen, Bathroom, Outdoor - 485, 332, 213, 235, 305 photos respectively

● 91.4% on 5-class setting

ExperimentsScene classifier

ExperimentsWindow segmentation

● Trained pixel-level classifier for door,window and other● Metric: Intersection over Union● Results on test set: 0.4% for door, 51.75% for window, and 95.6% for other

ExperimentsLayout estimation

● Assumes we know which wall the camera is facing, so the model imposes the correct aspect ratio on the front wall

● 3 models○ Aspect ratio only○ Ground truth windows○ Trained window classifier

Experiments

ExperimentsLocalization

● Accuracy defined by predicting the right (room, wall) combination● +Scene: scene classifier included in potential equation● +Room: tells the model which room an image is in

Qualitative results

Qualitative results - Failure case

Takeaways

● Localization is a complex problem - even though the number of room/wall configurations is small, this task doesn’t perform well

● Aspect ratio alone doesn’t provide enough information, Humans would need to use details from images to match with floor plan

● Should try training with CNN if can get enough training data, will probably give better results (possible for layout prediction)