Indoor Scene Segmentation using a Structured Light Sensor
description
Transcript of Indoor Scene Segmentation using a Structured Light Sensor
![Page 1: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/1.jpg)
Indoor Scene Segmentation using a Structured Light Sensor
Nathan Silberman and Rob Fergus
ICCV 2011 Workshop on 3D Representation and Recognition
Courant Institute
![Page 2: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/2.jpg)
Overview
Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues
![Page 3: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/3.jpg)
Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure
![Page 4: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/4.jpg)
Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure
• Kinect gives us depth map (and RGB)– Direct access to shape and geometry information
![Page 5: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/5.jpg)
Overview
Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues
![Page 6: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/6.jpg)
Capturing our Dataset
![Page 7: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/7.jpg)
Statistics of the DatasetScene Type Number of
Scenes Frames Labeled Frames *
Bathroom 6 5,588 76
Bedroom 17 22,764 480
Bookstore 3 27,173 784
Cafe 1 1,933 48
Kitchen 10 12,643 285
Living Room 13 19,262 355
Office 14 19,254 319
Total 64 108,617 2,347
* Labels obtained via LabelMe
![Page 8: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/8.jpg)
Dataset Examples
Living Room
RGB Raw Depth Labels
![Page 9: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/9.jpg)
Dataset Examples
Living Room
RGB Depth* Labels
* Bilateral Filtering used to clean up raw depth image
![Page 10: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/10.jpg)
Dataset Examples
Bathroom
RGB Depth Labels
![Page 11: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/11.jpg)
Dataset Examples
Bedroom
RGB Depth Labels
![Page 12: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/12.jpg)
Existing Depth Datasets
[1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010
RGB-D Dataset [1]
Stanford Make3d [2]
![Page 13: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/13.jpg)
Existing Depth Datasets
[1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011[2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011
Point Cloud Data [1] B3DO [2]
![Page 14: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/14.jpg)
Dataset Freely Availablehttp://cs.nyu.edu/~silberman/nyu_indoor_scenes.html
![Page 15: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/15.jpg)
Overview
Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues
![Page 16: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/16.jpg)
Segmentation using CRF ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)€
∑
• Standard CRF formulation• Optimized via graph cuts• Discrete label set (~12 classes)
€
i∈ pixels
€
i, j ∈ pairs of pixels
€
∑
![Page 17: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/17.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
•€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
![Page 18: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/18.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
•€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
![Page 19: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/19.jpg)
Appearance Term
Appearance(label i | descriptor i)
Several Descriptor Types to choose from:o RGB-SIFTo Depth-SIFTo Depth-SPINo RGBD-SIFTo RGB-SIFT/D-SPIN
![Page 20: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/20.jpg)
Descriptor Type: RGB-SIFT
Extracted Over Discrete Grid
RGB image from the Kinect
128 D
![Page 21: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/21.jpg)
Descriptor Type: Depth-SIFTDepth image from kinect with linear scaling
128 D
Extracted Over Discrete Grid
![Page 22: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/22.jpg)
Descriptor Type: Depth-SPINDepth image from kinect with linear scaling
50 D
Radius
Depth
A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999
Extracted Over Discrete Grid
![Page 23: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/23.jpg)
Descriptor Type: RGBD-SIFT
Concatenate
256 D
RGB image from the Kinect
Depth image from kinectwith linear scaling
![Page 24: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/24.jpg)
Descriptor Type: RGD-SIFT, D-SPIN
Concatenate
RGB image from the Kinect
Depth image from kinectwith linear scaling
178 D
![Page 25: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/25.jpg)
Appearance Model
Descriptor at each location
Appearance(label i | descriptor i)- Modeled by a Neural Network with a
single hidden layer
![Page 26: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/26.jpg)
Appearance Model
Descriptor at each location
Appearance(label i | descriptor i)
13 Classes
1000-D Hidden Layer
128/178/256-D Input
Softmax output layer
![Page 27: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/27.jpg)
Appearance Model
13 Classes
1000-D Hidden Layer
128/178/256-D Input
Descriptor at each location
Probability Distribution over classes
Appearance(label i | descriptor i)
Interpreted as p(label | descriptor)
![Page 28: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/28.jpg)
Appearance Model
13 Classes
1000-D Hidden Layer
128/178/256-D Input
Descriptor at each location
Probability Distribution over classes
Appearance(label i | descriptor i)
Trained with backpropagation
![Page 29: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/29.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
€
•
![Page 30: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/30.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
•€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
![Page 31: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/31.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
3D Priors2D Priors
€
•
![Page 32: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/32.jpg)
Location Priors: 2D
• 2D Priors are histograms of P(class, location)• Smoothed to avoid image-specific artifacts
![Page 33: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/33.jpg)
Motivation: 3D Location Priors
• 2D Priors don’t capture 3d geomety• 3D Priors can be built from depth data
• Rooms are of different shapes and sizes, how do we align them?
![Page 34: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/34.jpg)
Motivation: 3D Location Priors
• To align rooms, we’ll use a normalized cylindrical coordinate system:
Band of maximum depths along each vertical scanline
![Page 35: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/35.jpg)
Relative Depth DistributionsTable Television
Bed Wall
Relative Depth
Density
0 01 1
![Page 36: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/36.jpg)
Location Priors: 3D
![Page 37: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/37.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
3D Priors2D Priors
€
•
![Page 38: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/38.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
Penalty for adjacent labels disagreeing(Standard Potts Model)
![Page 39: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/39.jpg)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
Spatial Modulation of Smoothness• None• RGB Edge • Depth Edges• RGB + Depth Edges
• Superpixel Edges• Superpixel + RGB Edges• Superpixel + Depth Edges
![Page 40: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/40.jpg)
Experimental Setup
• 60% Train (~1408 images)• 40% Test (~939 images)• 10 fold cross validation• Images of the same scene cannot appear apart• Performance criteria is pixel-level classification
(mean diagonal of confusion matrix)• 12 most common classes, 1 background class
(from the rest)
![Page 41: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/41.jpg)
Evaluating Descriptors
2D Descriptors 3D Descriptors
Perc
ent
RGB-SIFT Depth-SIFT Depth-SPIN RGBD-SIFT RGB-SIFT/D-SPIN30
32
34
36
38
40
42
44
46
48
50
UnaryCRF
![Page 42: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/42.jpg)
Evaluating Location Priors
RGB-SIFT
RGB-SIFT
+2D Prio
rs
RGBD-SIFT
RGBD-SIFT
+2D Prio
rs
RGBD-SIFT
+3D Prio
rs
RGBD-SIFT
+3D Prio
rs (ab
s)30
35
40
45
50
55
UnaryCRF
Perc
ent
2D Descriptors 3D Descriptors
![Page 43: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/43.jpg)
![Page 44: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/44.jpg)
![Page 45: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/45.jpg)
Conclusion
• Kinect Depth signal helps scene parsing• Still a long way from great performance• Shown standard approaches on RGB-D data.• Lots of potential for more sophisticated
methods.• No complicated geometric reasoning• http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html
![Page 46: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/46.jpg)
Preprocessing the Data
[1] N. Burrus. Kinect RGB Demo v0.4.0. http://nicolas.burrus.name/index.php/Research/KinectRgbDemoV4?from=Research.KinectRgbDemoV2, Feb. 2011
We use open source calibration software [1] to infer:• Parameters of RGB & Depth cameras• Homography between cameras.
![Page 47: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/47.jpg)
Preprocessing the data
• Bilateral filter used to diffuse depth across regions of similar RGB intensity
• Naïve GPU implementation runs in ~100 ms
![Page 48: Indoor Scene Segmentation using a Structured Light Sensor](https://reader031.fdocuments.us/reader031/viewer/2022012914/568166a3550346895dda8f36/html5/thumbnails/48.jpg)
Motivation
Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes.[1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006