Rent3D: Floor-Plan Priors for Monocular Layout Estimation
Kwokfung Tang
Given a floorplan and an image...
Predict:
● Room layout (3D box)● The room and wall the camera is facing
(localization)
Assumptions● Manhattan world (exists three dominant orthogonal vanishing points)● The vanishing points can be reliably detected
Dataset● Crawled from a London rental site● 215 apartments that had a floor plan of sufficient resolution and at least one
photo● 1312 rooms, 6628 walls, 1923 doors, and 1268 windows● 1570 images (1259 indoor, others are pictures of the apartment building, or
public facilities inside the building)● Roughly 10% of the rooms aren’t rectangular● Ground truth annotated internally (room layout, which room/wall the camera is
facing in each image)● Out of the 1259 indoor photos, 71 of them do not have ground-truth room
alignment, and 11 photos do not have ground-truth wall alignment
Estimating layout● Same as fitting a 3D box to the room - could be hard because the edges
could be occluded by clutter● Plenty of prior work has been done on this● Since room is assumed to be cuboid, if the locations of vanishing points are
known, sufficient to specify the direction of rays from vp_0 and vp_1 (4 degrees of freedom)
Finding vanishing points
Prior work by Schwing and Urtasuny = (y1, y2, y3, y4) ∈ Y, set of all possible scene layouts
yi ∈ [yi_min, yi_max], possible values of yi (discretized)
Inference:
Training: Structured SVM, use pixel-wise classification error for loss
Prior work by Schwing and Urtasun
F = {left-wall, right-wall, ceiling, floor, front-wall}
Features:
Orientation map (OM) - Regions are classified as surfaces by orientation
Geometric context (GC) - For each pixel, the probability of belonging to {left-wall, right-wall, ceiling, floor, front-wall, object}
OM/GC
Columns 2, 5:Orientation map (OM) - Regions are classified as surfaces in F
Columns 3,6:Geometric context (GC) - For each pixel, the probability of belonging to {left-wall, right-wall, ceiling, floor, front-wall, object}
Branch and bound● Searching for this optimal y by exhaustive force is expensive● Much faster method: branch and bound● Assume: f(Y) gives upper bound of scores of all members in Y● Start with the set of all possible layouts Y, put (f(Y), Y) into priority queue● At each iteration:
○ Take best candidate Y_hat from PQ, stop if |Y_hat| = 1 (optimum)○ Otherwise split into 2 disjoint sets Y1, Y2○ Calculate new bounds, put them back into PQ
● Authors didn’t mention how Y_hat is split● Does not explore regions which are not promising, allowing for efficient exact
inference
How to define f?● The features are all positive (counts/probabilities), so we can split the
equation according to positive/negative weights
How to define f?● Find maximal positive and minimal negative possible contribution for each
face individually (defined by end-points of [yi_min, yi_max])
● Summing over the faces gives us f
Reduced parametrization● Since the floor plans provide aspect ratios of the walls, in this case we can parametrize the layout
using just 3 parameters: (p1, p2, p3)
● Reduces search space and speeds up inference by an order of magnitude
Problem formulation
Where: r ∈ {1, … R} represents the room numberc_r ∈ {1, … C_r} represents the wall in room r which the camera is facingy = (y1, y2, y3) is the layout parametrization within room
Inference strategy: Exhaustive branch and bound
We can do this because the number of rooms and walls are small, y only has 3 dimensions
Layout energy and bounds
Features and bounds exactly follow Schwing’s previous work
Window energy and bounds
● Use cross-ratio projective invariant provided by floor plan to calculate vertical window rays
● Predict windows in image by training a pixel-level classifier
● Features defined by:○ fraction of the predicted window pixels falling within the window rays for each face○ fraction of window pixels falling outside the window area
● Bounds: not clear but author said it is computable effectively
Aspect ratios of side walls
● Intuition: The ray between vp_1 and bottom left corner of left wall shouldn’t intersect image (same for right wall)
● Additional constraint that trims B&B candidates● Bounds: 0 as long as at least 1 feasible candidate in set
Scene energy: Exhaustive search since it only depends on room r
Learning
● Structured SVM● Loss function: Layout pixel-wise classification error● Training set: 100 apartments (751 photos)● Validation set: 30 apartments (222 photos)● Test set: 85 apartments (597 photos)
ExperimentsScene classifier
● Uses pretrained Caffe model to extract feature for each image (fc7, 4096 features), then train with multi-class SVM
● 5 scene labels: Reception, Bedroom, Kitchen, Bathroom, Outdoor - 485, 332, 213, 235, 305 photos respectively
● 91.4% on 5-class setting
ExperimentsScene classifier
ExperimentsWindow segmentation
● Trained pixel-level classifier for door,window and other● Metric: Intersection over Union● Results on test set: 0.4% for door, 51.75% for window, and 95.6% for other
ExperimentsLayout estimation
● Assumes we know which wall the camera is facing, so the model imposes the correct aspect ratio on the front wall
● 3 models○ Aspect ratio only○ Ground truth windows○ Trained window classifier
Experiments
ExperimentsLocalization
● Accuracy defined by predicting the right (room, wall) combination● +Scene: scene classifier included in potential equation● +Room: tells the model which room an image is in
Qualitative results
Qualitative results
Qualitative results - Failure case
Takeaways
● Localization is a complex problem - even though the number of room/wall configurations is small, this task doesn’t perform well
● Aspect ratio alone doesn’t provide enough information, Humans would need to use details from images to match with floor plan
● Should try training with CNN if can get enough training data, will probably give better results (possible for layout prediction)
Top Related