3D Object Recognition and Scene...

EECS 442 – Computer vision

3D Object Recognition and Scene Understanding

Object: Building 8-10 meters away

Object: Car, ¾ view 2-3 meters away

Interpreting the visual world

Object: Traffic light

How can we achieve all of this?

• 3D modeling – no semantic • Semantic reasoning – no 3D geometry • Joint 3D modeling and semantic reasoning

… Chen & Medioni, 92 Debevec et al 96 Pollefeys et al 02 Nister 04 Hartley & Zisserman, 00 Levoy et al., 00 Brown & Lowe, 04 Schindler et al 08 Snavely et al 08 Agarwal et al 09 Etc…


… Weber et al. 00 Felzenszwalb & Huttenlocher, 00 Leibe & Schiele, 04 Kumar & Hebert ’04 Fei-Fei & Perona, ‘05 Sivic et al. ’05 Shotton et al ‘05 Grauman et al. ‘05

Ullman et al. 02 Fergus et al. ’03 Torralba et al. ‘03

Lazebnik et al, 06 Maji & Malik, 07

Vedaldi & Soatto ’08 Zhu et al 08 Etc…

• 3D modeling – no semantic • Semantic – no 3D geometry



• Semantic from range data – disjoint 3D modeling and recognition

• … • Huber 01 • Rusu et al. 08 • Brostow et al. 08 • Son & Kim 10 • Tang et al. 10 • Adan et al. 11 • etc …

Cou

rtes

y of

Ada

n e

t al

, 201

1



• Joint 3D modeling and semantic reasoning

• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09

• Gupta et al, 10 • Ladick´y et al, 10 • Bao, Sun, Savarese 10 • Sun, Bao, Savarese 10 • Bao & Savarese, 11

• Semantic from range data – disjoint 3D modeling and recognition

Joint 3D modeling and recognition

• Given the scene the layout, objects can be detected more robustly

• Objects and their geometrical attributes provide constraints for estimating the scene layout

• 3D Object detectors

– Robust to view point transformation

– Allow to estimate pose, scale and 3D shape

• Methods for coherent object detection and scene layout estimation – single view

– multi-view

– videos

In this lecture….

Viewing sphere

• Detect objects under generic view points • Estimate object pose • General and work for any object category

Azimuth , Zenith

3D Object Detectors

3D Object Detectors

• Detect objects under generic view points • Estimate object pose • General and work for any object category

3D Object Categorization

•Felzenszwalb & Huttenlocher ‘03 •Fei-Fei et al. ‘04

•Leibe et al. ‘04

•Sudderth et al ‘05 •Torralba et al. ‘05 •Lazebnik et al. ‘06 •Todorovic et al. ’06 •Bosh et al ‘07 •Vedaldi & Soatto ‘08

•Kumar & Hebert ‘04 •Sivic et al. ’05 •Shotton et al ‘05

•Grauman et al. ‘05

•Leung et al ‘99 •Weber et al. ‘00 •Ullman et al. 02 •Fergus et al. ’03 •Torralba et al. ‘03

Single view object categorization

•Zhang et al ’95 •Schmid & Mohr, ‘96 •Schiele & Crowley, ’96 •Lowe, ‘99 •Jacob & Barsi, ‘99 •Rothganger et al., ‘04

•Edelman et al. ’91 •Ullman & Barsi, ’91 • Rothwell ‘92 •Linderberg, ’94 •Murase & Nayar ‘94

•Ferrari et al, ’05 •Brown & Lowe ’05 •Snavely et al ’06 •Yin & Collins, ‘07

•Ballard, ‘81 •Grimson & L.-Perez, ‘87 •Lowe, ’87

Single 3D object recognition

3D models - Explicit 3d models - Implicit 3d models

• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07


Mixture of 2D single view models

• Weber et al. ‘00 • Schneiderman et al. ’01 • Bart et al. ’04 • Gu & Ren, ‘10

•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08

Single view model

Single view model

Mixture of 2D models • Weber et al. ’00 • Schneiderman et al. ’01 • Ullman et al. 02 • Fergus et al. ’03 • Torralba et al. ’03

• Felzenszwalb & Huttenlocher ‘03 • Leibe et al. ’04 • Shotton et al. ‘05 • Grauman et al. ’05

• Savarese et al, ‘06 •Todorovic et al. ’06 • Vedaldi & Soatto ’08 • Zhu et al 08 • Gu & Ren, ‘10

3D Category model

…

…

CONS: Single view models are independent Non scalable to large number of categories/view-points Just b. boxes Cannot estimate 3D pose or 3D layout

3D models - Implicit 3d models - Explicit 3d models

• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07 …. • Xiang & Savarese ‘12




•Thomas et al. ‘06 • Kushal, et al., ’07 • Savarese et al, 07, 08 • Sun et al. ’09 …

Implicit 3D models

…

Sparse set of interest points or parts of the objects are linked across views by implicit 3D transformations (H, F)

…

… 3D Category

model

x’ x

Linking features or parts across views: Perspective or affine transformation constraints

x’ = H x

l’ = FT x x

x’

l’ = FT x x’ l’

Linking features or parts across views: Epipolar Transformation Constraints

…

Sparse set of interest points or parts of the objects are linked across views.

…

… Multi-view

model

• Thomas et al. ’06 • Leibe et al. ‘04

Implicit 3D models by ISM representations

Cou

rtesy o

f Thom

as et al. 06

Set of region-tracks connecting model views Each track is composed of image regions of a single physical surface patch along the model views in which it is visible.

[Ferrari et al. ’04, ‘06]

Region tracks


Results


• Canonical parts captures view invariant diagnostic appearance information

Savarese, Fei-Fei, ICCV 07 Savarese, Fei-Fei, ECCV 08 Sun, et al, CVPR 2009, ICCV 09

• Parts and relationship are modeled in a probabilistic fashion • Parameters are learnt so as to maximize detection accuracy

• 2d ½ structure linking parts via weak geometry

Implicit 3D models by graph-based representations

Parameterization on view-sphere

T

• Model the object as collection of parts for any T and S on the viewing sphere

S

Multi-view generative part-based model

T, S

Image

Yn=Codeword Xn=Location


A

Image

V

K

= Part Prop. Prior

= Part Appearance

= Part Location/shape


T, S

Multi-view generative part-based model

A

Image

V

K

= Part Prop. Prior

= Part Appearance

= Part Location/shape


T, S

Multi-view generative part-based model • Learning: estimate the latent variables and relevant parameters, given the observations

• Variational EM can be used Blei, ICML 2004.

T

Within triangle constraints:

im

ji

ji mmM

jm

jiM

Encoded as a penalty term in variational EM

Incorporating geometrical constraints

T

Encoded as a penalty term in variational EM


View morphing constraints:

= Shape

= Center

Seitz &Dyer SIGGRAPH 96 Xiao & Shah CVIU ‘04

S. M. Seitz and C. R. Dyer, Proc. SIGGRAPH 96, 1996, 21-30


hhhhhOPPPI ,,,: 321

kkkkkOPPPI ,,,: 321

Sequential ransac J-linkage [toldo et al 07]

•Defining initial parts and part correspondences

Initializing the model

Semi-supervised

• Class label • Object bounding box

• No need to observe same object instance from multiple views

• No pose labels [unlike Sun CVPR 09]

[unlike Savarese & Fei-Fei, 07, 08]

• No part labels

Incremental learning

• Enable unorganized and on-line collection training images • Increase efficiency in learning (no need large storage space)

T

Incremental learning

• Evidence of training image is used to update model parameters

T

• Assign new training image to a triangle of the view sphere

• Re-estimate sufficient statistics in a iterative fashion

Evolution of learnt parts

38

Car

Examples of learnt part-based models

39

Travel iron

Examples of learnt part-based models

Experimental results

• Object detection from any viewing angles • Accurate estimation of the object pose

• PASCAL 2006 dataset • 3D Object Dataset

41

Car

42

Travel Iron

Our model

Detection

Car Bicycle

Savarese & Fei-Fei ICCV ’07

Sun et al, CVPR 09

- 3D Object Dataset

ROC ROC

V1

V2

V3

V4

V5

V6

V7

V8

Our model Savarese ICCV ’07

0º

45º

90º

135º

180º

225º

270º

315º

Classification Accuracy

Viewpoint Classification 3D object dataset

Predicting object appearance from novel views

Viewing sphere

?

[For natural scenes, see Hoiem et al 07; Saxena et al 07]

Thomas et al 08 Cremer et al 09


videos/bicycle.wmv

videos/calculator.wmv

videos/car.wmv

Affine transformation

Our model


videos/car_003.wmv

videos/car_003.wmv

3D models - explicit 3d models - Implict 3d models






• Chiu et al. ‘07 • Hoiem, et al., ’07 • Yan, et al. ’07 • Xiang & Svarese, 12

Explicit 3D Models

…

…

• Part configuration is modeled as a conditional random fields with maximal margin parameter estimation

• Enable 6DOF object pose estimation • 3D layout estimation of object parts

3D Category model

Hij

3D models - Explicit 3D models - Implicit 3D models





[3D object dataset, 07]

• Xiang & Savarese, CVPR 12

Explicit 3D Models





– multi-view

– videos

In this lecture….

• Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras

Work both in indoors and outdoors

3D scene understanding from a single image Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012

• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09 •Lee et al. ‘09, 10 • Gupta et al, 10, 11 • Tsai et al. ‘11

• Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras

Work both in indoors and outdoors

3D scene understanding from a single image

• Hoiem et al. 06-10 • Gould et al. 09 • Hedau et al. 09 •Gupta et al, 10, 11

Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012





– multi-view

– videos

In this lecture….

•Measurements I • Points (x,y,scale)

• Objects (x,y, scale, pose)

• Regions (x,y, pose)

•Model Parameters:

• Q = 3D points • O = 3D objects • B = 3D regions

• = cam. prm. K, R, T

Bao & S. Savarese, CVPR 2011 Bao, Bagra, Savarese . CORP – ICCV 2011 Bao, Bagra, Chao, Savarese, CVPR 2012

Bao, Xiang, Savarese, ECCV 2012

3D scene understanding from multiple images Semantic Structure from Motion (SSFM)

• Huber 01 • Rusu et al. 08 • Brostow et al. 08

•Son & Kim 10 • Tang et al. 10 • Adan et al. 11 • etc …







Semantic Structure from Motion (SSFM)







Semantic Structure from Motion (SSFM)

Y CO

Y CB

Y CQ

Fact

or

grap

h







SSFM: point-level compatibility

Y CQ

• Tomasi & Kanade ‘92 • Triggs et al ’99 • Soatto & Perona 99 • Hartley & Zisserman 00 • Dellaert et al. 00

Point re-projection error







SSFM: point-level compatibility

projection

observation

• Pollefeys & V. Gool 02 • Nister 04 • Brown & Lowe 07 • Snavely et al. 08







SSFM: Object-level compatibility

Y CO

Object “re-projection” error

Camera 1 Camera 2

• Agreement with measurements is computed using position, pose and scale


Class = “car” scale=1 pose=“back“


• Savarese, Fei-Fei, ICCV 07 • Savarese, Fei-Fei, ECCV 08

• Su et al, ICCV 2009 • Sun, et al, CVPR 2009 • Sun et al, ECCV 2010

• Yu & Savarese, CVPR 2012

• A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y

Class = “car” scale=3 pose=“3/4“

• Savarese, Fei-Fei, ICCV 07 • Savarese, Fei-Fei, ECCV 08

• Su et al, ICCV 2009 • Sun, et al, CVPR 2009 • Sun et al, ECCV 2010

• Yu & Savarese, CVPR 2012


• A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y

Camera 1 Camera 2


Class = “car” scale=1 pose=“back“

Class = “car” scale=1 pose=“3/4“

• Efficiently implemented using a parallel computing architecture







SSFM: Region-level compatibility

Y CB

Region “re-projection” error

SSFM with interactions







Y OB

Y QB

Y QO

Y CO

Y CB

Y CQ

Bao, Bagra, Chao, Savarese CVPR 2012








Object-Point Interactions:

x

x

x

Bao, Bagra, Chao, Savarese CVPR 2012








Point-Region Interactions:

x

x

x








Object-Region Interactions:

Solving the SSFM problem

• Modified Markov Chain Monte Carlo (MCMC) sampling algorithm

• Initialization of the cameras, objects, and points are critical for the sampling

• Initialize configuration of cameras using: • SFM • consistency of object/region properties across views

F. Dellaert, S. Seitz, S. Thrun, and C. Thorpe. Feature correspondence: A markov chain monte carlo approach. In NIPS, 2000

Public Ford Campus Vision and LiDAR Dataset

• Object categories: Cars • Ground truth depth provided by LiDAR

[Pandey et al, International Journal of Robotics Research, 2011]

In-house Office dataset

• Object categories: mugs, mice, keyboards • Ground truth depth provided by Kinect

In-house Street dataset

• Object categories: humans • No ground truth depth available

Results

Observations Joint reconstruction & recognition

Det

ecti

on

s Se

gmen

tati

on

Vie

w 1

V

iew

N

…

Results

Observations Joint reconstruction & recognition

Det

ecti

on

s Vie

w 1

V

iew

N

…

SSFM Source code available! http://www.eecs.umich.edu/vision/research.html

Results

http://www.eecs.umich.edu/vision/research.html

Object detection results

[1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2009

DPM [1] SSFM (2011) with 2 views

SSFM (2012) with 2 views

SSFM (2012) with 4 views

54.5% 61.3% 62.8% 66.5%

FOR

D

CA

MP

US

(car

s)

Average precision in detecting objects (cars) in the 2D image

Accuracy in localizing objects in the 3D space (AP)

Hoiem [2]

SSFM

[2011]

SSFM

[2012]

FORD CAMPUS – cars 21.4% 32.7% 43.1%

OFFICE – keyboards, mice,

monitors

15.5% 20.2% 21.6%

[2] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 2008.

Camera estimation results

Camera translation error

SFM [1] SSFM (2011)

SSFM (2012)

FORD CAMPUS 26.5 19.9 12.1

OFFICE 8.5 4.7 4.2

STREET 27.1 17.6 11.4

Camera rotation error

SFM [1] SSFM (2011)

SSFM (2012)

<1 <1 <1

9.6 4.2 3.5

21.1 3.1 3.0

[1] N. Snavely, S. M. Seitz, and R. S. Szeliski. Modeling the world from internet photo collections. IJCV, (2), Nov. 2008

Camera parameter reconstruction errors

Results

Results

eT

Source code available!

• http://www.eecs.umich.edu/vision/research.html







– multi-view

– videos

In this lecture….

• Choi & Shahid & Savarese , WMC 2010

• Choi & Savarese , ECCV 2010

• Wu et al 07, • Breitenstein et al 09, • Zhao et al 04 • Ess et al 09

• Monocular cameras • Un-calibrated cameras • Arbitrary motion • Highly cluttered scenes

•Occlusion • Background clutter

•Moving targets

Joint 3D modeling and recognition from videos

Joint tracking and camera estimation

Interest Points in 3D

Tracked Interest Points

Camera Parameters

Pedestrian Detections

Target Location in 3D

Ω : set of state variables

Χ : set of observations

• Easily add additional evidence • 3d depth • IMU, etc…

• 5 frames/second! • Code available on line soon!

Safe Driving Applications

Autonomous navigation

• Intelligent vision requires joint reconstruction -recognition

• Geometry provides critical contextual cues for robust recognition

• High level semantics help establish robust geometrical constraints for reconstruction

– Within a single view

– Across views

• High level semantics help scalability in reconstruction problems

– Fewer images are needed with wider baseline

Conclusions

EECS 442 – Computer vision

• Hope you have enjoyed this class!

• Good luck with your projects & presentations!

3D Object Recognition and Scene...

Documents

Transcript of 3D Object Recognition and Scene...