Image Domain Transfer - MIT Deep Learning...

Post on 04-Jun-2020

4 views 0 download

Transcript of Image Domain Transfer - MIT Deep Learning...

Jan Kautz, VP of Learning and Perception Research

Image Domain Transfer

Image Domain Transfer: enabling machines to have human-like imagination abilities

Input image Domain transferred image

F

This image is generated by our method.

Example use cases

Low-res to high-res Blurry to sharp

Noisy to clean

LDR to HDR Thermal to color

Image to painting

Day to night Summer to winter

Synthetic to real

Two Approaches

• Example-based§ Non-parametric model§ The transfer function F is defined by an example image.

• Learning-based§ Parametric model§ The transfer function is learned via fitting a training dataset.

Input image Domain transferred image

F

F ( |Training Dataset)

F ( |xexample)

Example-based Image Domain Transfer

F ( |xexample)

Example-based image domain transfer

Stylized content photoContent photoStyle photo

xoutput = F (xinput|xexample)

Often referred to as Style Transfer

Example-based image domain transfer• Artistic Style Transfer• Content: real photo; • Style: painting• Gatys et. al. , Johnson

et. al., Li et al., Huang et. al.

• Photo Style Transfer• Content: real photo; • Style: real photo• Luan et. al., Pitie et.

al., Reinhard et. al.

Style (painting) Content Output

Style Content Output

FastPhotoStyle

• ”A Closed-form Solution to Photorealistic Image Stylization” by Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, Jan Kautz,ECCV 2018• Code: https://github.com/NVIDIA/FastPhotoStyle

Fast Photo Style

Style

Content

Stylization step Photo smoothing step

We model photo style transfer as a close-form function mapping given by

Content image

Style image

Stylization Step

VGG19 up to conv4_1

Assumption: Covariance matrix of deep features encodes the style information.

HCHTC = EC⇤CE

TC

HSHTS = ES⇤SE

TS

Stylization StepDecoder

whitening coloring

Architecture is similar to Li et al., but uses unpooling.

When semantic label maps are availableContent Style

ContentStyle Output

Photo Smoothing

Assumption: If we can compute a new image where the image pixel values resemble those in the intermediate image but the similarities between neighboring pixels resemble those in the content image, then we have a photorealistic stylization outputs.

Style

Content

Stylization step Photo smoothing step

Intermediate image pixel values

Similarity between neighboring pixels(Gaussian/matting affinity)

R⇤ = (1� ↵)(I � ↵D� 12WD� 1

2 )�1YClose-form solution:

ContentStyle Output

ContentStyle Output

Comparison

Comparison

Quantitative results

Learning-based Image Domain Transfer

F ( |Training Dataset)

Generative Adversarial Networks (GANs)

Generator Discriminator

Discriminator

z

True

False

p(G(z)) ! p(X)

Supervised Unsupervised

Supervised vs Unsupervised

Unimodal vs Multimodal

F ( ) =

F ( ) =

Unimodal

Multimodal

p(Y |X) = �(F (X))

p(Y |X) = F (X,S)

Categorization

Supervised Unsupervised

Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN

Multimodal pix2pixHD, vid2vid, BiCycleGAN

MUNIT

Categorization

Supervised Unsupervised

Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN

Multimodal pix2pixHD, vid2vid, BiCycleGAN

MUNIT

pix2pixHD: Supervised and multimodalimage domain transfer• ”High-Resolution Image Synthesis and Semantic Manipulation with

Conditional GANs” by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, Bryan Catanzaro, CVPR 2018

• Code: https://github.com/NVIDIA/pix2pixHD

pix2pixHD

pix2pix

Generator Discriminator

True

False

Discriminator

p(G(S), S) ! p(X,S)

pix2pixHD

Generator Discriminator False

FeatureEncoder

Low-dimensional feature map

InstancePooling

Once we finish training, we extract the instance-pooled feature maps from all the training images and fit a mixture of Gaussian model to features belong to the same semantic class.

During inference, we sample from the mixture of Gaussian distributions for achieving multimodal image translation.

D3

D2

Multi-scale Discriminators Robust Objective(GAN + discriminator feature matching loss)

D1

Coarse-to-fine Generator

... Residual blocks

Residual blocks

...

2real synthesized

Di Di

real synthesized

match

match

real

pix2pixHD multimodal results

pix2pixHD label changes

Comparison

Categorization

Supervised Unsupervised

Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN

Multimodalpix2pixHD, vid2vid, BiCycleGAN

MUNIT

vid2vid: Video-to-Video Synthesis

• ”Video-to-Video Synthesis” by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro, NIPS 2018

• Code: https://github.com/NVIDIA/vid2vid

Motivation

Using pix2pixHD

...

Spatially progressive

Temporally progressive

Spatio-temporally Progressive Training

...

Residual blocks

Alternating training

T

T T

SS S

Image Discriminator Video Discriminator

D1

D2

D3

Final Image

Interm

ediate

Flow map

Sequential Generator

W

Warped

Semantic

maps

Past im

ages

Multi-scale Discriminators

Results

Results

AI Rendered Game

Categorization

Supervised Unsupervised

Unimodal pix2pix, CRN, SRGAN, … UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN

Multimodal pix2pixHD, vid2vid, BiCycleGAN

MUNIT

UNIT: Unsupervised and unimodal image domain transfer• ”Unsupervised Image-to-image Translation Networks” by

Ming-Yu Liu, Thomas Breuel, Jan Kautz, NIPS 2017• Code: https://github.com/mingyuliutw/UNIT

Supervised Unsupervised

Supervised vs Unsupervised

Generator 1 Discriminator 1

True

False

Discriminator 1

Generator 2 Discriminator 2

True

False

Discriminator 2

p(G1(X1)|X1) ! p(X2)

p(G2(X2)|X2) ! p(X1)

p(G1(X1)|X1) 9 p(X2|X1)But

p(G2(X2)|X2) 9 p(X1|X2)But

UNIT assumption: Shared Latent Space

Coupling the mapping

function via weight-sharing

Domain 2 GAN Discriminator

Encoder 1 Discriminator 1

True

False

Discriminator 1

Discriminator 2

True

False

Discriminator 2

Decoder 2

Encoder 2

Decoder 1

Shared latent space

Encoder 1

Decoder 2

Encoder 2

Decoder 1

Day to Night Translation

Input Translated Input Translated

Resolution640x480

Input Translated Input Translated

Snowy to Summery TranslationResolution640x480

Input Translated Input Translated

Sunny to Rainy TranslationResolution640x480

Categorization

Supervised Unsupervised

Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN

Multimodal pix2pixHD, BiCycleGAN MUNIT

MUNIT: Unsupervised and multimodalimage domain transfer• ”Multimodal Unsupervised Image-to-image Translation” by

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz, ECCV 2018

• Code: https://github.com/NVlabs/MUNIT

MUNIT assumption: Partially Shared Latent Space

Content Encoder

1Discriminator 1

True

False

Discriminator 1

Discriminator 2

True

False

Discriminator 2

Decoder 2

Content Encoder

2

Decoder 1

Shared contentlatent space

Style Encoder

2

Style Encoder

1

MUNIT network

MUNIT results

MUNIT Style TransferContent Encoder

1

Decoder 2

Style Encoder

2

Styl

e T

ransf

er

Com

par

ison

Conclusion• Example-based image domain transfer• Learning-based image domain transfer

Learning-based Unimodal MultimodalSupervised

Unsupervised

77

100 Best Companies to Work For— Fortune

Employees’ Choice: Highest Rated CEOs— Glassdoor

Most Innovative Companies— Fast Company

World’s Most Admired Companies— Fortune

50 Smartest Companies— MIT Tech Review

JOIN NVIDIA

YOUR LIFE’S WORK STARTS HERE

INTERESTED? Email: aijobs@nvidia.com