3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional...

85
3D + Graphics QI ZHU & JUHO KIM

Transcript of 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional...

Page 1: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D + GraphicsQI ZHU & JUHO KIM

Page 2: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Outline

• Pose estimation (3D recovery from 2D images)

• Novel Image / View synthesis

• Reconstruction and generation of 3D

Page 3: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Part 1POSE ESTIMATION (3D RECOVERY FROM 2D IMAGES)

Page 4: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Viewpoint Estimation

INPUT: RGB image

OUTPUT: Camera pose = Rotation (yaw, pitch, roll) and Translation Matrix

Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild WACV14

Page 5: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Render for CNN [ICCV15]

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

Page 6: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Render for CNN [ICCV15]

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

Page 7: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Render for CNN [ICCV15]

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

CNN

Page 8: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Render for CNN [ICCV15]

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

Page 9: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Generating synthesizedimagesStructure-preserving deformation

◦ Symmetry-preserving free-form deformation

◦ Embed object in uniform grid

◦ Represent every point in space as a weighted combination of the control points

◦ Draw i.i.d vectors for control point

◦ Translate vector of each control point

◦ Set the translations of symmetric control points to be equal

Scott Schaefer Free-Form Deformation of Solid Geometric Models TAMU

Page 10: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Cont’dOverfit-Resistant Image Synthesis ◦ Variation in rendering

◦ vary light condition and camera pose

◦ Background synthesis ◦ alpha composition blend

◦ Cropping◦ Adding real annotated image

Image Compositing and Blending CMU15-463 2007 Fall

Page 11: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Final Training Dataset

Image Compositing and Blending CMU15-463 2007 Fall

Page 12: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Problem formulationInput: single RGB imageViewpoint as a tuple (θ, φ, ψ) of camera rotation parameters

◦ Discretized and divided into 360, 180, 360 bins

Rotation reference: predefined initial pose face camera

Output: probabilities of each viewpoint

Loss function:

where Pv(s; cs ) is the probability of view v for sample s from the soft-max viewpoint classifier of class cs

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

Page 13: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Class-Dependent Network Architecture

• Based on Alex Net [NIPS12]

• Shared weights but different class FC Layer

• Large number of outputs! (380+180+360) x N

• Claim is that different output layers handle the large variance among

different object categories

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

Page 14: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Does synthetic training data help?

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

Page 15: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Several results

Su, Hao, et al. "Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views." ICCV2015

Page 16: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Keypoint Estimation: 2D to 3D

Wei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013]

Page 17: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Single Image 3D Interpreter Network [ECCV16]

Simultaneously infer 2D keypoints heatmap and

3D structures from single image!

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 18: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

ChallengesAnnotations?◦ 2D keypoints labels – easy to get, e.g. crowdsourcing

◦ 3D object annotations in real 2D images – hard to acquire

Synthetic training data?◦ Statistics of real and synthesized images is different

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 19: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Using both real and synthetic image

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 20: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Using both real and synthetic image

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 21: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D Object Representation

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 22: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-Skeleton RepresentationAssumption:

◦ objects can only have constrained deformations

◦ The first base shape is the mean shape of all objects within the category

◦ 3D keypoint locations are a weighted sum of base shapes

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 23: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Step 1: Estimates 2D keypoint heatmaps (a and b)

Step 2: Train 3D interpreter on 3D synthetic data (c)

Step 3: Jointly train projection layer from 2D annotations (d)

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Network Overview

Page 24: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Zoom in: Keypoint Estimator

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 25: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Zoom in: 3D Interpreter

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 26: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Zoom In: End to End

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 27: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Results

Wu, Jiajun, et al. “Single image 3d interpreter network.” ECCV 2016

Page 28: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Part 2NOVEL IMAGE / V IEW SYNTHESIS

Page 29: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

New scene synthesis?

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Page 30: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Given Images

Newly Generated Image

Note the change in viewpoint

Page 31: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DeepStereo: Learning to Predict New Views from the World’s Imagery

Issues:

• Interpret rotation and

image reprojection

• Long-distance pixel

correlation

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Given V1, V2, generate C

Page 32: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DeepStereo: plane–sweep volumes

Solve two other

problems before

generating a new

view!But why?

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Page 33: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Plane–sweep volumes (Cont’d)

• Sweep family of planes at different depths w.r.t. a reference camera

• For each depth, project each input image onto that plane

• This is equivalent to a homography warping each input image into the

reference view

R. Collins. A space-sweep approach to true multi-image matching. CVPR 1996.

Page 34: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DeepStereo: overview

{S} Depth Selection tower

output

{C} Color tower output

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Output is a masked sum over

images selected at different

depths

Page 35: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DeepStereo: Selection Tower

INPUT: Plane-sweep volumes

OUTPUT: A probability map (or selection map) for each depth indicating the likelihood of each pixel having that depth.

Weight-sum imagesynthesis: Can beinterpreted as expectationamong depth

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Page 36: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DeepStereo: Color Tower

Bonus: D input planes

reduce the effect of

occlusion

OUTPUT: 3D volume

of nodes -> R,G,B

channels each pixel

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Page 37: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DeepStereo: TrainingApply multi-resolution patches in Color Tower

Feed 26 x 26 patches Produce 8 x 8 patches

Use 96 depth planes

Trained via Adagrad

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Page 38: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DeepStereo: Results

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Page 39: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Flynn, John, et al. “DeepStereo: Learning to predict new views from the world‘s imagery.” CVPR 2016

Page 40: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Part 3RECONSTRUCTION AND GENERATION OF 3D

Page 41: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Part 3

3D structure reconstruction and generation

• DC-IGN model (Kulkarni et al. NIPS 2015)

• Perspective transformer networks (Yan et al. NIPS 2016)

• 3D-GAN (Wu et al. NIPS 2016)

Page 42: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Deep Convolutional Inverse Graphics Network (DC-IGN)

• Motivation: Can a deep network learn to disentangle factors of image generation such as lighting, rotation, etc.?

• Can we learn a renderer?

• Recall: Conditional VAEs and constraints on latent space

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

Page 43: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Generate transformed images w.r.t rotations, light variations, etc.

• Input: 2D image (150 x 150 pixels)

• Output: 2D image that has one different 3D property

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

rotation light variation

• Learn latent variables that represent complex transformations

• Use an encoder-decoder structure based on VAE

Link: http://willwhitney.github.io/dc-ign/www/

Page 44: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Graphics codes z

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

rotation of the object

elevation of the object with respect to the camera

variations of the light source

Extrinsic properties

Page 45: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Model architecture

- Deep convolution and de-convolution within a VAE formulation

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

Page 46: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Encoder output:

• Model parameter:

• Distribution parameters:

• Variational objective function:

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

Page 47: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Training on a minibatch in which only one extrinsic property

changes i.e. Only a specific latent variable changes

• Key Idea: Force all other latent variables to be same across examples – Force all of them to be close to the minibatch mean

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

Page 48: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Generation w.r.t. manipulation of pose variables

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

Page 49: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Generation w.r.t. light directions

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

Page 50: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

DC-IGN• Manipulate rotation for a different dataset (chair dataset)

Kulkarni et al., Deep Convolutional Inverse Graphics Network, NIPS 2015

Page 51: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Pixel to Voxel

• DC-IGN: 2D image (pixel) → transformed 2D image (pixel)

• Next models: 2D image (pixel) → 3D image (voxel)

Page 52: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

• Predict the underlying true 3D shape of an object given a 2D single image

• Learn 3D object reconstruction without 3D ground-truth data

• Use different 2D images from multiple viewpoints

• Define two loss functions to generate 3D structures

Perspective Transformer Nets

Yan et al., Perspective Transformer Nets, NIPS 2016

Page 53: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer Nets• 𝐼 𝑘 : 2D image from k-th viewpoint α 𝑘 by projection

→ 𝐼 𝑘 = 𝑃(𝑋 ; α 𝑘 )

Yan et al., Perspective Transformer Nets, NIPS 2016

Page 54: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer NetsCase 1 we know the ground truth 3D volume 𝑉

• Generate 3D volume 𝑉 = 𝑓 𝐼 𝑘 = 𝑔(ℎ(𝐼 𝑘 ))

where ℎ · learns a viewpoint-invariant latent representation

𝑔 · is a volume generator

• Loss function

• However, ground truth volume may not be available in practice

Yan et al., Perspective Transformer Nets, NIPS 2016

Page 55: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer NetsCase 2 we do NOT know the ground truth 3D volume 𝑉

• Use 2D silhouette images

• 𝑆(𝑗): ground truth 2D silhouette image for the j-th viewpoint α 𝑗

• መ𝑆(𝑗): generated silhouettes

• Loss function:

Page 56: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer Nets

Page 57: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

• Consider a combination of ℒ𝑣𝑜𝑙 and ℒ𝑝𝑟𝑜𝑗

Perspective Transformer Nets

Yan et al., Perspective Transformer Nets, NIPS 2016

𝑓 𝐼 𝑘 = 𝑔(ℎ(𝐼 𝑘 ))

Reference for encoder: Yang et al., NIPS 2015

Page 58: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer Nets• How to obtain 2D silhouette መ𝑆(𝑗) - perspective projection

• Transformation matrix

where K: camera calibration matrix & (R, t): extrinsic parameters

• Perspective transformation:

where 3D coordinates:

screen coordinates:

Yan et al., Perspective Transformer Nets, NIPS 2016

Page 59: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer Nets• Use spatial transformer network (Jaderberg et al. NIPS 2015)

(1) Perform dense sampling from input volume in 3D coordinates

to output volume in screen coordinates

(2) Flatten the 3D spatial output across disparity dimension.

Yan et al., Perspective Transformer Nets, NIPS 2016

Page 60: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer Nets• Training on single category

Yan et al., Perspective Transformer Nets, NIPS 2016

Page 61: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer Nets

Page 62: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Perspective Transformer Nets• Training on multiple category

Page 63: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-GAN• Generate an object in 3D voxel space from a randomly sampled vector

• Use the Generative Adversarial Network (GAN)

• Map randomly sampled vector in a latent space to an object in 3D voxel space

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Random vector

Link: http://3dgan.csail.mit.edu/

Page 64: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-GAN• GAN – Generator + Discriminator

• Generator 𝐺: 𝑧 → 𝐺(𝑧)

where 𝑧: latent vector (200 dimension),

𝐺(𝑧): 3D object in 3D voxel space (64 x 64 x 64 cube)

• Discriminator D: output a confidence value of whether an input is

real or synthetic

• Overall adversarial loss function:

where 𝑥: a real object in a 64 x 64 x 64 space,

𝑧: randomly sampled noise from 𝑝(𝑧)

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 65: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-GAN• Network structure of generator in 3D-GAN

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 66: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-VAE-GAN• Extension to 3D-GAN

• Inspired by VAE-GAN of Larsen et al. (ICML 2016)

• Take a 2D image as input to generate a 3D object

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 67: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-VAE-GAN• New loss function

where 𝑥: a 3D shape from the training set,

𝑦: 𝑥′𝑠 corresponding 2D image,

𝑞(𝑧|𝑦): variational distribution of 𝑧,

𝑝(𝑧): multivariate Gaussian prior

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 68: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-GAN• 3D object generation

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 69: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-GAN• 3D object classification

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 70: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-VAE-GAN• Single image 3D reconstruction

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 71: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-VAE-GAN• Single image 3D reconstruction

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 72: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

3D-GAN• Shape arithmetic for chairs

Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016

Page 73: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Summary

• 3D vision and graphics based on deep learning

- Pose estimation (3D recovery from 2D images)

- Novel Image / View synthesis

- Reconstruction and generation of 3D

Page 74: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

References• John Flynn, Ivan Neulander, James Philbin, Noah Snavely, DeepStereo: Learning to Predict New Views from the World’s Imagery,

CVPR 2016.

• Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS 2015

• Alex Kendall, Matthew Grimes, Roberto Cipolla, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, ICCV 2015.

• Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, Josh Tenenbaum, Deep Convolutional Inverse Graphics Network, NIPS 2015.

• Danilo Jimenez Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, Nicolas Heess, Unsupervised Learning of 3D Structure from Images, NIPS 2016.

• Hao Su, Charles R. Qi, Yangyan Li, Leonidas J. Guibas, Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views. ICCV 2015.

• Jiajun Wu, Tianfan Xue, Joseph J. Lim, Yuandong Tian, Joshua B. Tenenbaum, Antonio Torralba, and William T. Freeman, Single Image 3D Interpreter Network, ECCV 2016.

• Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, Joshua B. Tenenbaum, Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, NIPS 2016.

• Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, Honglak Lee, Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision, NIPS 2016.

• Jimei Yang, Scott Reed, Ming-Hsuan Yang, Honglak Lee, Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis, NIPS 2015.

Page 75: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Backup slidesREZENDE ET AL . , N IPS 2016 .

Page 76: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Construct the underlying 3D structures from 2D observations

• Learn a generative model of 3D structures

• Recover the structure from 2D images via inference

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 77: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Consider a conditional latent variable model

𝑥: observed image or volume

𝑐: observed contextual information (nothing, object

class label, or one or more views of the scene from

different cameras)

𝑧: low-dimensional codes of latent manifold of object

shapes

ℎ: 3D representations (volume or mesh)

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 78: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Proposed framework

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 79: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Sequential generative process: refinement of hidden representations

• Each step generates an independent set of 𝑧𝑡

• 𝑓𝑟𝑒𝑎𝑑: task dependent context encoder

• 𝑓𝑠𝑡𝑎𝑡𝑒: transition function (fully connected LSTM)

• 𝑓𝑤𝑟𝑖𝑡𝑒: volumetric spatial transformer (Jaderberg et al. NIPS 2015)

• Proj: projection operator from latent 3D representation h𝑇 to the training data’s domain

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 80: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Volumetric spatial transformer (VST)

where : simple affine transformation of a 3D

grid of points that uniformly covers the input image

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 81: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Projection operator (3 types)

- 3D → 3D (identity)

- 3D → 2D neural network projection (learned)

- 3D → 2D OpenGL projection (fixed)

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 82: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Probabilistic volume completion (Necker cube, Primitives and MNIST3D)

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 83: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Class-conditional samples (one-hot encoding of class as context)

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 84: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Recover 3D structure from 2D images

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Page 85: 3D + Graphicsslazebni.cs.illinois.edu/spring17/lec14_3d.pdfWei, Shih-En, et al. “Convolutional pose machines.”CVPR 2016, IKEA dataset [Lim et al., 2013] Single Image 3D Interpreter

Rezende et al. (NIPS 2016)• Unsupervised learning of 3D structure (mesh representations)

Rezende et al., Unsupervised Learning of 3D structure from Images, NIPS 2016

Link: https://www.youtube.com/watch?v=stvDAGQwL5c