Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf-...
Transcript of Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf-...
![Page 1: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/1.jpg)
Semantic Image Synthesis with Spatially-Adaptive Normalization
Presented by - Ayushi Bansal, Bhargav Sundararajan, Mridula Gupta
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu
![Page 2: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/2.jpg)
Problem - Semantic Image SynthesisGenerate photorealistic image based on given input semantic layout
![Page 3: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/3.jpg)
Normalization
● Adjusting values measured on different scales to a notionally common scale.
● Done by subtracting the mean and dividing by the standard deviation.
● Usually a data preprocessing step.
![Page 4: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/4.jpg)
Normalization
It is basically used to adjust and scale activations in neural network
http://gradientscience.org/batchnorm/
![Page 5: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/5.jpg)
Normalization (Batch Norm)
Mean
Variance
TrainableParameters(Scale and Shift)
![Page 6: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/6.jpg)
Why Normalization between layers?
Ensures that output statistics of the layer are fixed
![Page 7: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/7.jpg)
● No dependence on external data during normalization process
● Types○ Batch Normalization○ Instance Normalization○ Layer Normalization○ Group Normalization○ Weight Normalization
Unconditional Normalization
http://luoping.me/post/family-normalization/
![Page 8: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/8.jpg)
Unconditional Normalization
Single Channel over Multiple training instances
Multiple Channel over single training instance
Single Channel over single training instance
H
W
N
H Height of Image
W Width of Image
C Channels
N Number of Instances in a batch
C
H
WC
N
H
WWC
N
![Page 9: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/9.jpg)
Conditional Normalization
● External data is used to condition the normalization.
● Used in Style transfer and Visual Question Answering
● Example: Conditional Instance Normalization and Adaptive Instance Normalization
![Page 10: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/10.jpg)
SPADE Model - Conditional NormalizationUnlike prior conditional normalization methods, γ and β are not vectors, but tensors with spatial dimensions. Hence, the name Spatially Adaptive Normalization
![Page 11: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/11.jpg)
SPADE Model - Conditional Normalization
Learned modulation parameters
Previous Activation
Activation Value = (n ∈ N, c ∈ Ci, y ∈ Hi, x ∈ Wi)
Mean
SD
![Page 12: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/12.jpg)
GANs (Generative Adaptive Networks)
https://developers.google.com/machine-learning/gan/gan_structure
![Page 13: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/13.jpg)
● Generator: Learns to generates plausible image● Discriminator: Learns to distinguish generator’s fake image from real
image● Training
○ Generator and Discriminator alternatively trained (1 epoch each)○ Converge the results
GANs (Generative Adaptive Networks)
![Page 14: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/14.jpg)
Conditional GANsIn addition to random noise (for generator) and input image (for discriminator), we are giving one-hot encoded vector as input (Y)
https://medium.com/@connorshorten300/conditional-gans-639aa3785122
![Page 15: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/15.jpg)
Pix2PixHD Model
Training● Train global generator G1 separately first● Fine-tune whole network with G1 and G2
Global Generator G1
Local Enhancer GeneratorG2
Local Enhancer GeneratorG2
![Page 16: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/16.jpg)
Two main contributions:
1. Instance-level object segmentation information is used, which can separate different object instances within the same category.
2. Generate diverse results given the same input label map, allowing the user to edit the appearance of the same object interactively.
Pix2PixHD Model
![Page 17: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/17.jpg)
Motivation - Pix2PixHD
Normalization “washes” away semantic information when applied to uniform or flat segmentation mask
![Page 18: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/18.jpg)
Resolves Pix2PixHD problem
![Page 19: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/19.jpg)
The Model Architecture
Encodes real image to mean vector and variance vector
Generates images based on noise distribution
Discriminates synthesized images from ground truth and propagates loss
![Page 20: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/20.jpg)
SPADE Model - Residual Block
![Page 21: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/21.jpg)
SPADE Model - Generator
![Page 22: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/22.jpg)
Discriminator
![Page 23: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/23.jpg)
Loss Function LFM compares the feature map of every intermediate layer of the discriminator.
Number of Intermediate Layers
![Page 24: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/24.jpg)
Loss - Hinge Loss
Prediction for real Image
![Page 25: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/25.jpg)
Loss - Hinge LossPrediction for
Generated Image
![Page 26: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/26.jpg)
Loss - Hinge Loss
Loss of Generator
![Page 27: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/27.jpg)
Image Encoder - Style Transfer
![Page 29: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/29.jpg)
Datasets
COCO-stuff Derived from COCO # of classes: 182 # of training: 11.8k # of validation: 5k
ADE20K # of classes: 150 # of training: 20.2k # of validation: 2k
ADE20K-outdoor Only outdoor images
Cityscapes dataset Street scenes in German cities # of training: 3k # of validation: 500
Flickr Landscapes # of training: 41k # of validation: 1k
![Page 30: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/30.jpg)
Experimental SetupHyperparameters:
- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004- ADAM Optimizer: β1 = 0, β2 = 0.999
System Setup: NVIDIA DGX1 with 8 V100 GPUs
Baseline Models:1. pix2pixHD: State-of-the-art GAN-based model2. CRN (Cascaded Refinement Network): Deep Network refines the output from low to high resolution3. SIMs (Semi-parametric IMage synthesis): Composites real segments from training set and refines boundaries
![Page 31: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/31.jpg)
Results - Qualitative on ADE20K outdoor & Cityscapes
![Page 32: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/32.jpg)
Results - Multimodal
![Page 33: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/33.jpg)
Results - Semantic and Style Control
![Page 34: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/34.jpg)
Experiment - Metrics
Segment Label Comparison: 1. mIoU (mean Intersection over Union)
IoU: Area of Overlap/Area of UnionmIoU: Mean IoU for each class
2. Accu (Pixel Accuracy): % of pixels classified correctly i.e. TP
Images Comparison:- FID (Fréchet Inception Distance) - Distance between the distributions of synthesized results and the distribution of
real images.
For mIoU and pixel accuracy, higher is better. For FID, lower is better.
SPADE
![Page 35: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/35.jpg)
Result - Quantitative
Outperforms current leading methods in semantic segmentation scores
Theirs
![Page 36: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/36.jpg)
Result - Human Evaluation
Users preferred the results of the proposed method over the competing methods
Randomly generated 500 questions for each dataset, and each question answered by 5 different human evaluators
![Page 37: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/37.jpg)
Result - Ablation Study● pix2pixHD++: Including all the techniques with pix2pixHD except SPADE● pix2pixHD++ w/ Concat: Concat the segment mask input at all the intermediate layers● pix2pixHD++ w/ SPADE: Strong baseline with SPADE● Also, compare models with different capacity generators
Comparison done with respect to parameters used and mIOU scores
![Page 38: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/38.jpg)
Result - SPADE generator variations● Two inputs - Segmentation map, random noise input● Varying kernel size, different capacity, and types of normalization
All scores are mIOU scores, Standard used bold ones
![Page 39: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/39.jpg)
Contributions● Semantics of image is captured by adding SPatially-Adaptive
DEnormalization in the pix2pixHD architecture.
● Users can control both semantic and style for image synthesis
![Page 40: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/40.jpg)
Strengths● Diverse results for the same input, depending on the user’s interaction with
objects (Style and Semantic support)
● SPADE resblk can be integrated with any existing architecture
● Reduced the number of trainable parameters and improved efficiency, when compared to previous state-of-the-art pix2pixHD
● Created semantic image synthesis interface(Live Demo) to interact and play on a canvas
![Page 41: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/41.jpg)
Weaknesses● Doesn’t train well for fine-grained details like
faces
● Limited number of classes in demo
● Can be easily forced to generate un-natural shapes.
![Page 42: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/42.jpg)
Future Work● Can be extended to video synthesis
● Technique can be used for image super-pixelation task also
● In-the-wild technique can be integrated to put a constraint on the shapes of the object generated
![Page 43: Semantic Image Synthesis with Spatially-Adaptive Normalizationyjlee/teaching/ecs269-fall2019/19.pdf- Learning Rate (Generator & Discriminator): 0.0001 and 0.0004 - ADAM Optimizer:](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecac3dbc964a732ae787614/html5/thumbnails/43.jpg)
Questions?