Joint Caption Detection and Inpainting using Generative...
Transcript of Joint Caption Detection and Inpainting using Generative...
![Page 1: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/1.jpg)
Joint Caption Detection and Inpainting using GenerativeNetwork
Anubha Pandey, Vismay Patel
Indian Institute of Technology Madras
19th September 2018
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 1 / 15
![Page 2: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/2.jpg)
Overview
1 Problem Statement
2 Introduction
3 Proposed Solution
4 Network Architecture
5 Training
6 Results
7 Conclusion
8 Future Work
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 2 / 15
![Page 3: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/3.jpg)
Problem Statement
Chalearn LAP Inpainting Competition Track2 - Video Decaptioning
Objective To develop algorithms that can inpaint video frames thatcontain text overlays in various size, background, color, location.
DatasetConsists of a set of (X,Y) pairs where X is a 5 second video clip and Yis the corresponding target video clip.150 hours of diverse videos clips in 128x128 RGB pixels, containingboth captioned and de-captioned versions, taken from YouTube.70000 training samples and 5000 samples in Validation and Test set
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 3 / 15
![Page 4: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/4.jpg)
Problem Statement
Chalearn LAP Inpainting Competition Track2 - Video Decaptioning
Objective To develop algorithms that can inpaint video frames thatcontain text overlays in various size, background, color, location.
DatasetConsists of a set of (X,Y) pairs where X is a 5 second video clip and Yis the corresponding target video clip.150 hours of diverse videos clips in 128x128 RGB pixels, containingboth captioned and de-captioned versions, taken from YouTube.70000 training samples and 5000 samples in Validation and Test set
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 3 / 15
![Page 5: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/5.jpg)
Introduction
Video Decaptioning involves two tasks caption detection and thegeneral video inpainting.
Existing patch-based video inpainting methods search for completespatio-temporal patches to copy into the missing area.
Despite recent advances in machine learning, it is still challenging toaim at fast (real time) and accurate automatic text removal in videosequences.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 4 / 15
![Page 6: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/6.jpg)
Introduction
Video Decaptioning involves two tasks caption detection and thegeneral video inpainting.
Existing patch-based video inpainting methods search for completespatio-temporal patches to copy into the missing area.
Despite recent advances in machine learning, it is still challenging toaim at fast (real time) and accurate automatic text removal in videosequences.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 4 / 15
![Page 7: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/7.jpg)
Introduction
Video Decaptioning involves two tasks caption detection and thegeneral video inpainting.
Existing patch-based video inpainting methods search for completespatio-temporal patches to copy into the missing area.
Despite recent advances in machine learning, it is still challenging toaim at fast (real time) and accurate automatic text removal in videosequences.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 4 / 15
![Page 8: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/8.jpg)
Related Works
More recently, Globally and locally consistent image completion [1]CVPR 2017 paper, has shown promising results for the task of ImageInpainting. It has improved the results by introducing local and globaldiscriminators. In addition, it uses dilated convolutions to increase thereceptive fields and replace the fully connected layers adopted in thecontextual encoders.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 5 / 15
![Page 9: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/9.jpg)
Proposed Solution: Frame Level Inpainting and CaptionDetection
We propose a generative CNN to do joint caption detection anddecaptioning task in an end-to-end fashion. Our network is inspired bythe work of Globally and locally consistent image completion [1].
The network has two branches each for the image generation and themask generation tasks
Both the branches share the parameters up to first three convolutionlayers and the layers thereafter, are trained independently.
InputsFrames from the captioned videos.The caption masks, extracted by taking the difference between thecorresponding frames of the ground truth decaptioned videos and theinput captioned videos.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15
![Page 10: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/10.jpg)
Proposed Solution: Frame Level Inpainting and CaptionDetection
We propose a generative CNN to do joint caption detection anddecaptioning task in an end-to-end fashion. Our network is inspired bythe work of Globally and locally consistent image completion [1].
The network has two branches each for the image generation and themask generation tasks
Both the branches share the parameters up to first three convolutionlayers and the layers thereafter, are trained independently.
InputsFrames from the captioned videos.The caption masks, extracted by taking the difference between thecorresponding frames of the ground truth decaptioned videos and theinput captioned videos.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15
![Page 11: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/11.jpg)
Proposed Solution: Frame Level Inpainting and CaptionDetection
We propose a generative CNN to do joint caption detection anddecaptioning task in an end-to-end fashion. Our network is inspired bythe work of Globally and locally consistent image completion [1].
The network has two branches each for the image generation and themask generation tasks
Both the branches share the parameters up to first three convolutionlayers and the layers thereafter, are trained independently.
InputsFrames from the captioned videos.The caption masks, extracted by taking the difference between thecorresponding frames of the ground truth decaptioned videos and theinput captioned videos.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15
![Page 12: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/12.jpg)
Proposed Solution: Frame Level Inpainting and CaptionDetection
We propose a generative CNN to do joint caption detection anddecaptioning task in an end-to-end fashion. Our network is inspired bythe work of Globally and locally consistent image completion [1].
The network has two branches each for the image generation and themask generation tasks
Both the branches share the parameters up to first three convolutionlayers and the layers thereafter, are trained independently.
InputsFrames from the captioned videos.The caption masks, extracted by taking the difference between thecorresponding frames of the ground truth decaptioned videos and theinput captioned videos.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 6 / 15
![Page 13: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/13.jpg)
Network Architecture
Figure: Architecture of the discriminator module ofthe inpainting network. Each building block isdescribed in Figure 7.
Figure: Buildingblocks of thenetwork.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 7 / 15
![Page 14: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/14.jpg)
Network Architecture
Figure: Architecture of the generator module of theinpainting network. Building block is shown in Fig 7.
Figure: Architecture of thediscriminator module of theinpainting network.Buildingblock is shown in Fig 7.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 8 / 15
![Page 15: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/15.jpg)
Loss Functions
Following loss functions have been used to train the network-
Reconstruction Loss [2]Lr = 1
K
∑Ki=1 |I iy − I iimitation|+ α ∗ 1
K
∑Ki=1(I iMask − O i
Mask)2
where, K is the batch size and alpha = 0.000001.
Adversarial Loss [2] Lreal = −log(p), Lfake = −log(1− p)Ld = Lreal + β ∗ Lfakewhere, p is the output probability of the discriminator module and β= 0.01 (hyper parameter)
Perceptual Loss [3]Lp = 1
K
∑Ki=1(φ(Iy )− φ(Iimitation))2
where, φ represents features from VGG16 network pretrained onMicrosoft COCO dataset.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 9 / 15
![Page 16: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/16.jpg)
Loss Functions
Following loss functions have been used to train the network-
Reconstruction Loss [2]Lr = 1
K
∑Ki=1 |I iy − I iimitation|+ α ∗ 1
K
∑Ki=1(I iMask − O i
Mask)2
where, K is the batch size and alpha = 0.000001.
Adversarial Loss [2] Lreal = −log(p), Lfake = −log(1− p)Ld = Lreal + β ∗ Lfakewhere, p is the output probability of the discriminator module and β= 0.01 (hyper parameter)
Perceptual Loss [3]Lp = 1
K
∑Ki=1(φ(Iy )− φ(Iimitation))2
where, φ represents features from VGG16 network pretrained onMicrosoft COCO dataset.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 9 / 15
![Page 17: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/17.jpg)
Loss Functions
Following loss functions have been used to train the network-
Reconstruction Loss [2]Lr = 1
K
∑Ki=1 |I iy − I iimitation|+ α ∗ 1
K
∑Ki=1(I iMask − O i
Mask)2
where, K is the batch size and alpha = 0.000001.
Adversarial Loss [2] Lreal = −log(p), Lfake = −log(1− p)Ld = Lreal + β ∗ Lfakewhere, p is the output probability of the discriminator module and β= 0.01 (hyper parameter)
Perceptual Loss [3]Lp = 1
K
∑Ki=1(φ(Iy )− φ(Iimitation))2
where, φ represents features from VGG16 network pretrained onMicrosoft COCO dataset.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 9 / 15
![Page 18: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/18.jpg)
Training
The network is trained using Adam Optimizer with learning rate 0.006and batch size 20.
For first 8 epochs only the generator module of the network is trainedminimizing only the reconstruction loss and perceptual loss
For the next 12 epochs, the entire GAN network [2] is trainedend-to-end minimizing all three losses- Reconstruction loss,Adversarial Loss and Perceptual loss.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 10 / 15
![Page 19: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/19.jpg)
Training
The network is trained using Adam Optimizer with learning rate 0.006and batch size 20.
For first 8 epochs only the generator module of the network is trainedminimizing only the reconstruction loss and perceptual loss
For the next 12 epochs, the entire GAN network [2] is trainedend-to-end minimizing all three losses- Reconstruction loss,Adversarial Loss and Perceptual loss.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 10 / 15
![Page 20: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/20.jpg)
Training
The network is trained using Adam Optimizer with learning rate 0.006and batch size 20.
For first 8 epochs only the generator module of the network is trainedminimizing only the reconstruction loss and perceptual loss
For the next 12 epochs, the entire GAN network [2] is trainedend-to-end minimizing all three losses- Reconstruction loss,Adversarial Loss and Perceptual loss.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 10 / 15
![Page 21: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/21.jpg)
Results
With our proposed solution we secured 3rd position in thecompetition.
To evaluate the quality of the reconstruction, metrics as mentionedon the competitions website are used for pairwise frame comparison.
Evaluation Metrics Training Phase Testing Phase
PSNR 30.5311 32.0021MSE 0.0016 0.0012DSSIM 0.0610 0.0499
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 11 / 15
![Page 22: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/22.jpg)
Results
With our proposed solution we secured 3rd position in thecompetition.
To evaluate the quality of the reconstruction, metrics as mentionedon the competitions website are used for pairwise frame comparison.
Evaluation Metrics Training Phase Testing Phase
PSNR 30.5311 32.0021MSE 0.0016 0.0012DSSIM 0.0610 0.0499
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 11 / 15
![Page 23: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/23.jpg)
Conclusion
We have proposed an end-to-end network for de-captioning which cansimultaneously do frame level caption detection and inpainting.
However, this method requires individual frames from the clip to doits task which lacks the temporal context required to produce thedesired result.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 12 / 15
![Page 24: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/24.jpg)
Conclusion
We have proposed an end-to-end network for de-captioning which cansimultaneously do frame level caption detection and inpainting.
However, this method requires individual frames from the clip to doits task which lacks the temporal context required to produce thedesired result.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 12 / 15
![Page 25: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/25.jpg)
Future Work
In future work, we aim to improve performance by exploiting thetemporal information of the video clips.
We aim to explore models that use both temporal and semanticinformation.
Techniques used in intermediate frame prediction can be employed tomake the network temporally-aware.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 13 / 15
![Page 26: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/26.jpg)
Future Work
In future work, we aim to improve performance by exploiting thetemporal information of the video clips.
We aim to explore models that use both temporal and semanticinformation.
Techniques used in intermediate frame prediction can be employed tomake the network temporally-aware.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 13 / 15
![Page 27: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/27.jpg)
Future Work
In future work, we aim to improve performance by exploiting thetemporal information of the video clips.
We aim to explore models that use both temporal and semanticinformation.
Techniques used in intermediate frame prediction can be employed tomake the network temporally-aware.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 13 / 15
![Page 28: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/28.jpg)
Thank You
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 14 / 15
![Page 29: Joint Caption Detection and Inpainting using Generative ...chalearnlap.cvc.uab.es/media/slides/eccvw18/eccv18... · jI i y Ii imitation j+ 1 K P (Ii Mask O i)2 where, K is the batch](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1ad33865e652f1d041ff4/html5/thumbnails/29.jpg)
References
Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa.Globally and locally consistent image completion.ACM Transactions on Graphics (TOG), 36(4):107, 2017.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.Generative adversarial nets.In Advances in neural information processing systems, pages2672–2680, 2014.
Justin Johnson, Alexandre Alahi, and Li Fei-Fei.Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, pages 694–711.Springer, 2016.
Anubha Pandey, Vismay Patel (IITM) Track2: Video decaptioning 19th September 2018 15 / 15