3D Object Recognition and Scene...
Transcript of 3D Object Recognition and Scene...
EECS 442 – Computer vision
3D Object Recognition and Scene Understanding
Object: Building 8-10 meters away
Object: Car, ¾ view 2-3 meters away
Interpreting the visual world
Object: Traffic light
How can we achieve all of this?
• 3D modeling – no semantic • Semantic reasoning – no 3D geometry • Joint 3D modeling and semantic reasoning
… Chen & Medioni, 92 Debevec et al 96 Pollefeys et al 02 Nister 04 Hartley & Zisserman, 00 Levoy et al., 00 Brown & Lowe, 04 Schindler et al 08 Snavely et al 08 Agarwal et al 09 Etc…
How can we achieve all of this?
… Weber et al. 00 Felzenszwalb & Huttenlocher, 00 Leibe & Schiele, 04 Kumar & Hebert ’04 Fei-Fei & Perona, ‘05 Sivic et al. ’05 Shotton et al ‘05 Grauman et al. ‘05
Ullman et al. 02 Fergus et al. ’03 Torralba et al. ‘03
Lazebnik et al, 06 Maji & Malik, 07
Vedaldi & Soatto ’08 Zhu et al 08 Etc…
• 3D modeling – no semantic • Semantic – no 3D geometry
How can we achieve all of this?
• 3D modeling – no semantic • Semantic – no 3D geometry
• Semantic from range data – disjoint 3D modeling and recognition
• … • Huber 01 • Rusu et al. 08 • Brostow et al. 08 • Son & Kim 10 • Tang et al. 10 • Adan et al. 11 • etc …
Cou
rtes
y of
Ada
n e
t al
, 201
1
How can we achieve all of this?
• 3D modeling – no semantic • Semantic – no 3D geometry
• Joint 3D modeling and semantic reasoning
• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09
• Gupta et al, 10 • Ladick´y et al, 10 • Bao, Sun, Savarese 10 • Sun, Bao, Savarese 10 • Bao & Savarese, 11
• Semantic from range data – disjoint 3D modeling and recognition
Joint 3D modeling and recognition
• Given the scene the layout, objects can be detected more robustly
• Objects and their geometrical attributes provide constraints for estimating the scene layout
• 3D Object detectors
– Robust to view point transformation
– Allow to estimate pose, scale and 3D shape
• Methods for coherent object detection and scene layout estimation – single view
– multi-view
– videos
In this lecture….
Viewing sphere
• Detect objects under generic view points • Estimate object pose • General and work for any object category
Azimuth , Zenith
3D Object Detectors
3D Object Detectors
• Detect objects under generic view points • Estimate object pose • General and work for any object category
3D Object Categorization
•Felzenszwalb & Huttenlocher ‘03 •Fei-Fei et al. ‘04
•Leibe et al. ‘04
•Sudderth et al ‘05 •Torralba et al. ‘05 •Lazebnik et al. ‘06 •Todorovic et al. ’06 •Bosh et al ‘07 •Vedaldi & Soatto ‘08
•Kumar & Hebert ‘04 •Sivic et al. ’05 •Shotton et al ‘05
•Grauman et al. ‘05
•Leung et al ‘99 •Weber et al. ‘00 •Ullman et al. 02 •Fergus et al. ’03 •Torralba et al. ‘03
Single view object categorization
•Zhang et al ’95 •Schmid & Mohr, ‘96 •Schiele & Crowley, ’96 •Lowe, ‘99 •Jacob & Barsi, ‘99 •Rothganger et al., ‘04
•Edelman et al. ’91 •Ullman & Barsi, ’91 • Rothwell ‘92 •Linderberg, ’94 •Murase & Nayar ‘94
•Ferrari et al, ’05 •Brown & Lowe ’05 •Snavely et al ’06 •Yin & Collins, ‘07
•Ballard, ‘81 •Grimson & L.-Perez, ‘87 •Lowe, ’87
Single 3D object recognition
3D Object Categorization
3D models - Explicit 3d models - Implicit 3d models
• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07
3D Object Categorization
Mixture of 2D single view models
• Weber et al. ‘00 • Schneiderman et al. ’01 • Bart et al. ’04 • Gu & Ren, ‘10
•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08
Single view model
Single view model
Mixture of 2D models • Weber et al. ’00 • Schneiderman et al. ’01 • Ullman et al. 02 • Fergus et al. ’03 • Torralba et al. ’03
• Felzenszwalb & Huttenlocher ‘03 • Leibe et al. ’04 • Shotton et al. ‘05 • Grauman et al. ’05
• Savarese et al, ‘06 •Todorovic et al. ’06 • Vedaldi & Soatto ’08 • Zhu et al 08 • Gu & Ren, ‘10
3D Category model
…
…
CONS: Single view models are independent Non scalable to large number of categories/view-points Just b. boxes Cannot estimate 3D pose or 3D layout
3D models - Implicit 3d models - Explicit 3d models
• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07 …. • Xiang & Savarese ‘12
3D Object Categorization
Mixture of 2D single view models
• Weber et al. ‘00 • Schneiderman et al. ’01 • Bart et al. ’04 • Gu & Ren, ‘10
•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08 • Sun et al. ’09 …
Implicit 3D models
…
Sparse set of interest points or parts of the objects are linked across views by implicit 3D transformations (H, F)
…
… 3D Category
model
x’ x
Linking features or parts across views: Perspective or affine transformation constraints
x’ = H x
l’ = FT x x
x’
l’ = FT x x’ l’
Linking features or parts across views: Epipolar Transformation Constraints
…
Sparse set of interest points or parts of the objects are linked across views.
…
… Multi-view
model
• Thomas et al. ’06 • Leibe et al. ‘04
Implicit 3D models by ISM representations
Cou
rtesy o
f Thom
as et al. 06
Set of region-tracks connecting model views Each track is composed of image regions of a single physical surface patch along the model views in which it is visible.
[Ferrari et al. ’04, ‘06]
Region tracks
Implicit 3D models by ISM representations
Results
Implicit 3D models by ISM representations
• Canonical parts captures view invariant diagnostic appearance information
Savarese, Fei-Fei, ICCV 07 Savarese, Fei-Fei, ECCV 08 Sun, et al, CVPR 2009, ICCV 09
• Parts and relationship are modeled in a probabilistic fashion • Parameters are learnt so as to maximize detection accuracy
• 2d ½ structure linking parts via weak geometry
Implicit 3D models by graph-based representations
Parameterization on view-sphere
T
• Model the object as collection of parts for any T and S on the viewing sphere
S
Multi-view generative part-based model
T, S
Image
Yn=Codeword Xn=Location
Yn=Codeword Xn=Location
A
Image
V
K
= Part Prop. Prior
= Part Appearance
= Part Location/shape
Yn=Codeword Xn=Location
T, S
Multi-view generative part-based model
A
Image
V
K
= Part Prop. Prior
= Part Appearance
= Part Location/shape
Yn=Codeword Xn=Location
T, S
Multi-view generative part-based model • Learning: estimate the latent variables and relevant parameters, given the observations
• Variational EM can be used Blei, ICML 2004.
T
Within triangle constraints:
im
ji
ji mmM
jm
jiM
Encoded as a penalty term in variational EM
Incorporating geometrical constraints
T
Encoded as a penalty term in variational EM
Incorporating geometrical constraints
View morphing constraints:
= Shape
= Center
Seitz &Dyer SIGGRAPH 96 Xiao & Shah CVIU ‘04
S. M. Seitz and C. R. Dyer, Proc. SIGGRAPH 96, 1996, 21-30
Incorporating geometrical constraints
hhhhhOPPPI ,,,: 321
kkkkkOPPPI ,,,: 321
Sequential ransac J-linkage [toldo et al 07]
•Defining initial parts and part correspondences
Initializing the model
Semi-supervised
• Class label • Object bounding box
• No need to observe same object instance from multiple views
• No pose labels [unlike Sun CVPR 09]
[unlike Savarese & Fei-Fei, 07, 08]
• No part labels
Incremental learning
• Enable unorganized and on-line collection training images • Increase efficiency in learning (no need large storage space)
T
Incremental learning
• Evidence of training image is used to update model parameters
T
• Assign new training image to a triangle of the view sphere
• Re-estimate sufficient statistics in a iterative fashion
Evolution of learnt parts
38
Car
Examples of learnt part-based models
39
Travel iron
Examples of learnt part-based models
Experimental results
• Object detection from any viewing angles • Accurate estimation of the object pose
• PASCAL 2006 dataset • 3D Object Dataset
41
Car
42
Travel Iron
Our model
Detection
Car Bicycle
Savarese & Fei-Fei ICCV ’07
Sun et al, CVPR 09
- 3D Object Dataset
ROC ROC
V1
V2
V3
V4
V5
V6
V7
V8
Our model Savarese ICCV ’07
0º
45º
90º
135º
180º
225º
270º
315º
Classification Accuracy
Viewpoint Classification 3D object dataset
Predicting object appearance from novel views
Viewing sphere
?
[For natural scenes, see Hoiem et al 07; Saxena et al 07]
Thomas et al 08 Cremer et al 09
Predicting object appearance from novel views
Affine transformation
Our model
Predicting object appearance from novel views
3D models - explicit 3d models - Implict 3d models
• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07
3D Object Categorization
Mixture of 2D single view models
• Weber et al. ‘00 • Schneiderman et al. ’01 • Bart et al. ’04 • Gu & Ren, ‘10
•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08
• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07 • Xiang & Svarese, 12
Explicit 3D Models
…
…
• Part configuration is modeled as a conditional random fields with maximal margin parameter estimation
• Enable 6DOF object pose estimation • 3D layout estimation of object parts
3D Category model
Hij
3D models - Explicit 3D models - Implicit 3D models
• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07
Mixture of 2D single view models
• Weber et al. ‘00 • Schneiderman et al. ’01 • Bart et al. ’04 • Gu & Ren, ‘10
•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08
[3D object dataset, 07]
• Xiang & Savarese, CVPR 12
Explicit 3D Models
• 3D Object detectors
– Robust to view point transformation
– Allow to estimate pose, scale and 3D shape
• Methods for coherent object detection and scene layout estimation – single view
– multi-view
– videos
In this lecture….
• Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras
Work both in indoors and outdoors
3D scene understanding from a single image Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012
• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09 •Lee et al. ‘09, 10 • Gupta et al, 10, 11 • Tsai et al. ‘11
• Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras
Work both in indoors and outdoors
3D scene understanding from a single image
• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09 •Gupta et al, 10, 11
Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012
• 3D Object detectors
– Robust to view point transformation
– Allow to estimate pose, scale and 3D shape
• Methods for coherent object detection and scene layout estimation – single view
– multi-view
– videos
In this lecture….
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Bao & S. Savarese, CVPR 2011 Bao, Bagra, Savarese . CORP – ICCV 2011 Bao, Bagra, Chao, Savarese, CVPR 2012
Bao, Xiang, Savarese, ECCV 2012
3D scene understanding from multiple images Semantic Structure from Motion (SSFM)
• Huber 01 • Rusu et al. 08 • Brostow et al. 08
•Son & Kim 10 • Tang et al. 10 • Adan et al. 11 • etc …
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Semantic Structure from Motion (SSFM)
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Semantic Structure from Motion (SSFM)
Y CO
Y CB
Y CQ
Fact
or
grap
h
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
SSFM: point-level compatibility
Y CQ
• Tomasi & Kanade ‘92 • Triggs et al ’99 • Soatto & Perona 99 • Hartley & Zisserman 00 • Dellaert et al. 00
Point re-projection error
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
SSFM: point-level compatibility
projection
observation
• Pollefeys & V. Gool 02 • Nister 04 • Brown & Lowe 07 • Snavely et al. 08
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
SSFM: Object-level compatibility
Y CO
Object “re-projection” error
Camera 1 Camera 2
• Agreement with measurements is computed using position, pose and scale
SSFM: Object-level compatibility
Class = “car” scale=1 pose=“back“
SSFM: Object-level compatibility
• Savarese, Fei-Fei, ICCV 07 • Savarese, Fei-Fei, ECCV 08
• Su et al, ICCV 2009 • Sun, et al, CVPR 2009 • Sun et al, ECCV 2010
• Yu & Savarese, CVPR 2012
• A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y
Class = “car” scale=3 pose=“3/4“
• Savarese, Fei-Fei, ICCV 07 • Savarese, Fei-Fei, ECCV 08
• Su et al, ICCV 2009 • Sun, et al, CVPR 2009 • Sun et al, ECCV 2010
• Yu & Savarese, CVPR 2012
SSFM: Object-level compatibility
• A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y
Camera 1 Camera 2
SSFM: Object-level compatibility
Class = “car” scale=1 pose=“back“
Class = “car” scale=1 pose=“3/4“
• Efficiently implemented using a parallel computing architecture
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
SSFM: Region-level compatibility
Y CB
Region “re-projection” error
SSFM with interactions
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Y OB
Y QB
Y QO
Y CO
Y CB
Y CQ
Bao, Bagra, Chao, Savarese CVPR 2012
SSFM with interactions
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Object-Point Interactions:
x
x
x
Bao, Bagra, Chao, Savarese CVPR 2012
SSFM with interactions
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Point-Region Interactions:
x
x
x
SSFM with interactions
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Object-Region Interactions:
SSFM with interactions
•Measurements I • Points (x,y,scale)
• Objects (x,y, scale, pose)
• Regions (x,y, pose)
•Model Parameters:
• Q = 3D points • O = 3D objects • B = 3D regions
• = cam. prm. K, R, T
Object-Region Interactions:
Solving the SSFM problem
• Modified Markov Chain Monte Carlo (MCMC) sampling algorithm
• Initialization of the cameras, objects, and points are critical for the sampling
• Initialize configuration of cameras using: • SFM • consistency of object/region properties across views
F. Dellaert, S. Seitz, S. Thrun, and C. Thorpe. Feature correspondence: A markov chain monte carlo approach. In NIPS, 2000
Public Ford Campus Vision and LiDAR Dataset
• Object categories: Cars • Ground truth depth provided by LiDAR
[Pandey et al, International Journal of Robotics Research, 2011]
In-house Office dataset
• Object categories: mugs, mice, keyboards • Ground truth depth provided by Kinect
In-house Street dataset
• Object categories: humans • No ground truth depth available
Results
Observations Joint reconstruction & recognition
Det
ecti
on
s Se
gmen
tati
on
Vie
w 1
V
iew
N
…
Results
Observations Joint reconstruction & recognition
Det
ecti
on
s Se
gmen
tati
on
Vie
w 1
V
iew
N
…
Results
Observations Joint reconstruction & recognition
Det
ecti
on
s Vie
w 1
V
iew
N
…
SSFM Source code available! http://www.eecs.umich.edu/vision/research.html
Results
Object detection results
[1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2009
DPM [1] SSFM (2011) with 2 views
SSFM (2012) with 2 views
SSFM (2012) with 4 views
54.5% 61.3% 62.8% 66.5%
FOR
D
CA
MP
US
(car
s)
Average precision in detecting objects (cars) in the 2D image
Accuracy in localizing objects in the 3D space (AP)
Hoiem [2]
SSFM
[2011]
SSFM
[2012]
FORD CAMPUS – cars 21.4% 32.7% 43.1%
OFFICE – keyboards, mice,
monitors
15.5% 20.2% 21.6%
[2] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 2008.
Camera estimation results
Camera translation error
SFM [1] SSFM (2011)
SSFM (2012)
FORD CAMPUS 26.5 19.9 12.1
OFFICE 8.5 4.7 4.2
STREET 27.1 17.6 11.4
Camera rotation error
SFM [1] SSFM (2011)
SSFM (2012)
<1 <1 <1
9.6 4.2 3.5
21.1 3.1 3.0
[1] N. Snavely, S. M. Seitz, and R. S. Szeliski. Modeling the world from internet photo collections. IJCV, (2), Nov. 2008
Camera parameter reconstruction errors
Results
Results
eT
Source code available!
• http://www.eecs.umich.edu/vision/research.html
• 3D Object detectors
– Robust to view point transformation
– Allow to estimate pose, scale and 3D shape
• Methods for coherent object detection and scene layout estimation – single view
– multi-view
– videos
In this lecture….
• Choi & Shahid & Savarese , WMC 2010
• Choi & Savarese , ECCV 2010
• Wu et al 07, • Breitenstein et al 09, • Zhao et al 04 • Ess et al 09
• Monocular cameras • Un-calibrated cameras • Arbitrary motion • Highly cluttered scenes
•Occlusion • Background clutter
•Moving targets
Joint 3D modeling and recognition from videos
Joint tracking and camera estimation
Interest Points in 3D
Tracked Interest Points
Camera Parameters
Pedestrian Detections
Target Location in 3D
Ω : set of state variables
Χ : set of observations
• Easily add additional evidence • 3d depth • IMU, etc…
• 5 frames/second! • Code available on line soon!
Safe Driving Applications
Autonomous navigation
• Intelligent vision requires joint reconstruction -recognition
• Geometry provides critical contextual cues for robust recognition
• High level semantics help establish robust geometrical constraints for reconstruction
– Within a single view
– Across views
• High level semantics help scalability in reconstruction problems
– Fewer images are needed with wider baseline
Conclusions
EECS 442 – Computer vision
• Hope you have enjoyed this class!
• Good luck with your projects & presentations!