Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016

@DocXavi

Deep Learning for Computer VisionVideo Analytics 5 May 2016

Xavier Giró-i-Nieto

Master en Creació Multimedia

https://imatge.upc.edu/web/people/xavier-giro


Outline

1. Recognition2. Optical Flow3. Object Tracking

2

Recognition

Demo: Clarifai

MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)3

http://www.clarifai.com/

http://www.technologyreview.com/news/534631/a-startups-neural-network-can-understand-video/

Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

4

Recognition

http://cs.stanford.edu/people/karpathy/deepvideo/



5

Recognition

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

http://vlg.cs.dartmouth.edu/c3d/



6

Recognition


Previous lectures




7

Recognition





Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.

Slides extracted from ReadCV seminar by Victor Campos 8

Recognition: DeepVideo




https://docs.google.com/presentation/d/1neslYvs4I35_4iXHfpkWOTn1Zw3xBs2PLki63zdMHYQ/edit?usp=sharing

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 9

Recognition: DeepVideo: Demo




http://www.youtube.com/watch?v=qrzQ_AB1DZk


Recognition: DeepVideo: Architectures





Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]

Recognition: DeepVideo: Features




http://ai.stanford.edu/~quocle/LeZouYeungNg11.pdf


Recognition: DeepVideo: Multiscale





Recognition: DeepVideo: Results




14

Recognition





15

Recognition: C3D





16Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

Recognition: C3D: Demo


http://www.youtube.com/watch?v=LslfD4V_bMs

17K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.

Recognition: C3D: Spatial dimensionSpatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).

http://arxiv.org/pdf/1409.1556

http://www.robots.ox.ac.uk/~vgg/research/very_deep/


Recognition: C3D: Temporal dimension3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets

Temporal depth

2D ConvNets



A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets

Recognition: C3D: Temporal dimension



No gain when varying the temporal depth across layers.

Recognition: C3D: Temporal dimension



Recognition: C3D: Architecture

Featurevector



Recognition: C3D: Feature vector

Video sequence

16 frames-long clips

8 frames-long overlap



Recognition: C3D: Feature vector

16-frame clip

16-frame clip

16-frame clip

16-frame clip

...

Average

4096

-dim

vid

eo d

escr

ipto

r

4096

-dim

vid

eo d

escr

ipto

r

L2 norm



Recognition: C3D: VisualizationBased on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.


http://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

https://docs.google.com/presentation/d/18XRQvw193uPnMyPl-0LQqGyDqwnfFtzcn7BJsmoCMLo/edit?usp=sharing


Recognition: C3D: Compactness



Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4 different benchmarks and are comparable with state of the art methods on other 2 benchmarks

Recognition: C3D: Performance



Recognition: C3D: SoftwareImplementation by Michael Gygli (GitHub)


https://github.com/Lasagne/Recipes/blob/master/examples/Video%20features%20with%20C3D.ipynb

https://github.com/Lasagne/Recipes/blob/master/examples/Video%20features%20with%20C3D.ipynb

28Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694-4702. 2015.

Recognition:

http://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Ng_Beyond_Short_Snippets_2015_CVPR_paper.html



29

Recognition: ImageNet Video

[ILSVRC 2015 Slides and videos]

http://image-net.org/challenges/ilsvrc+mscoco2015


30





31





32


Kai Kang et al, Object Detection in Videos with TubeLets and Multi-Context Cues (ILSVRC 2015) [video] [poster]

http://www.ee.cuhk.edu.hk/~kkang/

https://youtu.be/XuR-Kabh1AY

http://www.ee.cuhk.edu.hk/~kkang/download/cuvideo_poster_image2015.pdf


33







34







35







Optical Flow

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 36

http://www.cv-foundation.org/openaccess/content_iccv_2015/html/Dosovitskiy_FlowNet_Learning_Optical_ICCV_2015_paper.html



Optical Flow: FlowNet





http://www.youtube.com/watch?v=g-peWXaQnQc



End to end supervised learning of optical flow.




Optical Flow: FlowNet (contracting)


Option A: Stack both input images together and feed them through a generic network.






Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.






Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.

Correlation layer: Convolution of data patches from the layers to combine.




Optical Flow: FlowNet (expanding)


Upconvolutional layers: Unpooling features maps + convolution.Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.





Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 43

Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color).

Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.

Data augmentation




Object tracking: MDNet

45Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

http://cvlab.postech.ac.kr/research/mdnet/

Object tracking: MDNet



http://www.youtube.com/watch?v=zYM7G5qd090

Object tracking: MDNet: Architecture


Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.


Object tracking: MDNet: Online update


MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.


http://dx.doi.org/10.1109/34.655648

Object tracking: FCNT

49Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]

http://www.cv-foundation.org/openaccess/content_iccv_2015/html/Wang_Visual_Tracking_With_ICCV_2015_paper.html

https://github.com/scott89/FCNT

Object tracking: FCNT

50Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.

conv4-3 conv5-3



Object tracking: FCNT: Specialization


Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.



Object tracking: FCNT: Localization


Although trained for image classification, feature maps in conv5-3 enable object localization…...but is not discriminative enough to different objects of the same category.



Object tracking: Localization

53Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.

Other works have shown how features maps in convolutional layers allow object localization.

http://arxiv.org/abs/1412.6856

Object tracking: FCNT: Localization


On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…

conv4-3 conv5-3



Object tracking: FCNT: Architecture


SNet=Specific Network (online update)

GNet=General Network (fixed)



Object tracking: FCNT: Results




Object tracking: DeepTracking

57P. Ondruska and I. Posner, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks,” in The Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, Arizona USA, 2016. [code]

http://www.robots.ox.ac.uk/~mobile/Papers/2016AAAI_ondruska.pdf

https://github.com/pondruska/DeepTracking



http://www.youtube.com/watch?v=cdeWCpfUGWc



Object tracking: Compact features

61Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013 [Project page with code]

http://papers.nips.cc/paper/5192-learning-a-deep-compact-image-representation-for-visual-tracking

http://winsty.net/dlt.html

62

Thank you !


https://twitter.com/DocXavi

https://www.facebook.com/ProfessorXavi

[email protected]

Xavier Giró-i-Nieto









Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016

Technology

Transcript of Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016