Domain.Com Promotions: Save 25%, can use for domain transfer
Image Domain Transfer - MIT Deep Learning...
Transcript of Image Domain Transfer - MIT Deep Learning...
Jan Kautz, VP of Learning and Perception Research
Image Domain Transfer
Image Domain Transfer: enabling machines to have human-like imagination abilities
Input image Domain transferred image
F
This image is generated by our method.
Example use cases
Low-res to high-res Blurry to sharp
Noisy to clean
LDR to HDR Thermal to color
Image to painting
Day to night Summer to winter
Synthetic to real
Two Approaches
• Example-based§ Non-parametric model§ The transfer function F is defined by an example image.
• Learning-based§ Parametric model§ The transfer function is learned via fitting a training dataset.
Input image Domain transferred image
F
F ( |Training Dataset)
F ( |xexample)
Example-based Image Domain Transfer
F ( |xexample)
Example-based image domain transfer
Stylized content photoContent photoStyle photo
xoutput = F (xinput|xexample)
Often referred to as Style Transfer
Example-based image domain transfer• Artistic Style Transfer• Content: real photo; • Style: painting• Gatys et. al. , Johnson
et. al., Li et al., Huang et. al.
• Photo Style Transfer• Content: real photo; • Style: real photo• Luan et. al., Pitie et.
al., Reinhard et. al.
Style (painting) Content Output
Style Content Output
FastPhotoStyle
• ”A Closed-form Solution to Photorealistic Image Stylization” by Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, Jan Kautz,ECCV 2018• Code: https://github.com/NVIDIA/FastPhotoStyle
Fast Photo Style
Style
Content
Stylization step Photo smoothing step
We model photo style transfer as a close-form function mapping given by
Content image
Style image
Stylization Step
VGG19 up to conv4_1
Assumption: Covariance matrix of deep features encodes the style information.
HCHTC = EC⇤CE
TC
HSHTS = ES⇤SE
TS
Stylization StepDecoder
whitening coloring
Architecture is similar to Li et al., but uses unpooling.
When semantic label maps are availableContent Style
ContentStyle Output
Photo Smoothing
Assumption: If we can compute a new image where the image pixel values resemble those in the intermediate image but the similarities between neighboring pixels resemble those in the content image, then we have a photorealistic stylization outputs.
Style
Content
Stylization step Photo smoothing step
Intermediate image pixel values
Similarity between neighboring pixels(Gaussian/matting affinity)
R⇤ = (1� ↵)(I � ↵D� 12WD� 1
2 )�1YClose-form solution:
ContentStyle Output
ContentStyle Output
Comparison
Comparison
Quantitative results
Learning-based Image Domain Transfer
F ( |Training Dataset)
Generative Adversarial Networks (GANs)
Generator Discriminator
Discriminator
z
True
False
p(G(z)) ! p(X)
Supervised Unsupervised
Supervised vs Unsupervised
Unimodal vs Multimodal
F ( ) =
F ( ) =
Unimodal
Multimodal
p(Y |X) = �(F (X))
p(Y |X) = F (X,S)
Categorization
Supervised Unsupervised
Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN
Multimodal pix2pixHD, vid2vid, BiCycleGAN
MUNIT
Categorization
Supervised Unsupervised
Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN
Multimodal pix2pixHD, vid2vid, BiCycleGAN
MUNIT
pix2pixHD: Supervised and multimodalimage domain transfer• ”High-Resolution Image Synthesis and Semantic Manipulation with
Conditional GANs” by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, Bryan Catanzaro, CVPR 2018
• Code: https://github.com/NVIDIA/pix2pixHD
pix2pixHD
pix2pix
Generator Discriminator
True
False
Discriminator
p(G(S), S) ! p(X,S)
pix2pixHD
Generator Discriminator False
FeatureEncoder
Low-dimensional feature map
InstancePooling
Once we finish training, we extract the instance-pooled feature maps from all the training images and fit a mixture of Gaussian model to features belong to the same semantic class.
During inference, we sample from the mixture of Gaussian distributions for achieving multimodal image translation.
D3
D2
Multi-scale Discriminators Robust Objective(GAN + discriminator feature matching loss)
D1
Coarse-to-fine Generator
... Residual blocks
Residual blocks
...
2real synthesized
Di Di
real synthesized
match
match
real
pix2pixHD multimodal results
pix2pixHD label changes
Comparison
Categorization
Supervised Unsupervised
Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN
Multimodalpix2pixHD, vid2vid, BiCycleGAN
MUNIT
vid2vid: Video-to-Video Synthesis
• ”Video-to-Video Synthesis” by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro, NIPS 2018
• Code: https://github.com/NVIDIA/vid2vid
Motivation
Using pix2pixHD
...
Spatially progressive
Temporally progressive
Spatio-temporally Progressive Training
...
Residual blocks
Alternating training
T
T T
SS S
Image Discriminator Video Discriminator
D1
D2
D3
Final Image
Interm
ediate
Flow map
Sequential Generator
W
Warped
Semantic
maps
Past im
ages
Multi-scale Discriminators
Results
Results
AI Rendered Game
Categorization
Supervised Unsupervised
Unimodal pix2pix, CRN, SRGAN, … UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN
Multimodal pix2pixHD, vid2vid, BiCycleGAN
MUNIT
UNIT: Unsupervised and unimodal image domain transfer• ”Unsupervised Image-to-image Translation Networks” by
Ming-Yu Liu, Thomas Breuel, Jan Kautz, NIPS 2017• Code: https://github.com/mingyuliutw/UNIT
Supervised Unsupervised
Supervised vs Unsupervised
Generator 1 Discriminator 1
True
False
Discriminator 1
Generator 2 Discriminator 2
True
False
Discriminator 2
p(G1(X1)|X1) ! p(X2)
p(G2(X2)|X2) ! p(X1)
p(G1(X1)|X1) 9 p(X2|X1)But
p(G2(X2)|X2) 9 p(X1|X2)But
UNIT assumption: Shared Latent Space
Coupling the mapping
function via weight-sharing
Domain 2 GAN Discriminator
Encoder 1 Discriminator 1
True
False
Discriminator 1
Discriminator 2
True
False
Discriminator 2
Decoder 2
Encoder 2
Decoder 1
Shared latent space
Encoder 1
Decoder 2
Encoder 2
Decoder 1
Day to Night Translation
Input Translated Input Translated
Resolution640x480
Input Translated Input Translated
Snowy to Summery TranslationResolution640x480
Input Translated Input Translated
Sunny to Rainy TranslationResolution640x480
Categorization
Supervised Unsupervised
Unimodal pix2pix, CRN, SRGAN UNIT, Coupled GAN, DTN, DiscoGAN, CycleGAN, DualGAN, StarGAN
Multimodal pix2pixHD, BiCycleGAN MUNIT
MUNIT: Unsupervised and multimodalimage domain transfer• ”Multimodal Unsupervised Image-to-image Translation” by
Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz, ECCV 2018
• Code: https://github.com/NVlabs/MUNIT
MUNIT assumption: Partially Shared Latent Space
Content Encoder
1Discriminator 1
True
False
Discriminator 1
Discriminator 2
True
False
Discriminator 2
Decoder 2
Content Encoder
2
Decoder 1
Shared contentlatent space
Style Encoder
2
Style Encoder
1
MUNIT network
MUNIT results
MUNIT Style TransferContent Encoder
1
Decoder 2
Style Encoder
2
Styl
e T
ransf
er
Com
par
ison
Conclusion• Example-based image domain transfer• Learning-based image domain transfer
Learning-based Unimodal MultimodalSupervised
Unsupervised
77
100 Best Companies to Work For— Fortune
Employees’ Choice: Highest Rated CEOs— Glassdoor
Most Innovative Companies— Fast Company
World’s Most Admired Companies— Fortune
50 Smartest Companies— MIT Tech Review
JOIN NVIDIA
YOUR LIFE’S WORK STARTS HERE
INTERESTED? Email: [email protected]