Deep active learning for multiple object...

PhD Topic Proposal UDOPIA 2020 Paris-Saclay University

Host organization : Laboratoire Systèmes et Applications des Technologies del’Information et de l’Energie (SATIE), Université Paris-SaclayBat. 660, rue Noetzlin, Gif-sur-YvetteSupervision : Emanuel Aldea (SATIE), Gianni Franchi (U2IS, ENSTA Paris)[email protected], [email protected]

Deep active learning for multiple object tracking

Keywords Multiple object tracking, crowd analysis, active learning, epistemic uncertainty, recurrent neural net-works

Context The computer vision community relies on deep learning algorithms that have been trained to performvarious tasks and learned the necessary concepts on large amounts of data. However, if one needs to target newtasks, it is necessary to redevelop an appropriate dataset which is costly as the annotation process is expensive. In thecase of video data, this is particularly relevant for multiple object tracking which requires fine-grained annotations inboth spatial and temporal domains, in contrast with object detection and semantic segmentation for example whichrequire labeling only in the spatial dimension. The aim of active learning is to assist the costly human annotationwhile at the same time continuously improving the task performance.

Application wise, the study follows the work performed in the ANR project MOHICANS 1 (2015-2019) which aimsto extract high-quality observations from dense crowds in order to facilitate the understanding of these complex,hyper-connected systems and to prevent dangerous jamming or turbulence phenomena.

We propose to the prospective student to work on a deep learning algorithm that would improve the state of theart in the field of dense pedestrian tracking.

Objectives The PhD project will tackle two coupled methodological challenges which need to be overcome in orderto reach for tracking similar levels of performance as in simpler tasks such as classification or detection.

The first objective is to adapt the active learning paradigm to track inference, by relying jointly on uncertaintyestimations for label proposals in the spatial domain (detections) as well as in the temporal domain (detection totrack associations).

The second objective is to propose a solution able to perform tracking in the presence of long term occlusionsoccurring frequently in high-density scenes, using a recurrent neural network able to save in memory a hidden state.

Methods for active learning in tracking The concept of active learning [1] has been widely used in supervisedlearning in order to accelerate the progress on a given task and to greatly reduce the amount of labeled data requiredfor training by asking iteratively an oracle to label samples which are considered the most informative for the taskat hand. There exist multiple active learning paradigms that can be categorized into two types : pooling basedmethods[2, 3], and annotation generation based methods[4]. The pooling based methods have an unlabeled setof data, then using an acquisition function they find the samples which are hard to learn that they provide to anannotator. The generation-based methods synthesize training instances for querying to increase learning speed. Wewill focus on pooling based methods.

The core concept for being able to query effectively such samples is the epistemic uncertainty of the classifier.A great amount of effort has been devoted by the community to uncertainty estimation as it is known that deepnetworks suffer from calibration issues [5], the most widely-known approaches being initially MC dropout [6] andmore recently Deep Ensembles [7]. In this topic, our group works on tracking the weight distribution during trainingas an indicator of the learning process variance [8].

Many of the above-mentioned approaches mainly focus on image classification, image segmentation or objectdetection. None of them deals with tracking annotation.

Methods for multiple object tracking with deep architectures Tracking is commonly divided into two steps :object detection and data association. All the objects are detected in each frame of the sequence, then an associationalgorithm builds the tracks. Tracking might suffer from different challenges :

— Missed detections (false negative) that could be due to the detection algorithm or to long-term occlusions ;— false alarms (measurements considered as nuisance) ;— similar appearance ;— handling the beginning and the end of a track

1. http://hebergement.u-psud.fr/emi/MOHICANS/

http://hebergement.u-psud.fr/emi/MOHICANS/

(a) (b)

FIGURE 1 – (a) Our results on the CVPR MOT 2019 challenge [9], ranked fourth (b) detections in a high-densitycrowd (work submitted to ICIP 2020)

The challenges raised in tracking by the presence of multiple, interacting targets are well known and they havebeen generally tackled by approximate or exact (but intractable for large problems) data association algorithms.These approaches are not trainable end-to-end and they do not scale when analyzing persistent occlusions and/orhundreds or thousands of objects simultaneously. State-of-the-art algorithms [10] still perform greedily incrementaltrack augmentation which has the advantage of speed and of a good performance in reasonably sparse scenes.

To solve these problems Recurrent Neural Networks (RNN) can handle sequential data but they do not handlewell long term dependencies. However, attention mechanisms [11] are a technique originally used in NLP to solvethe bottleneck issue in the Encoder-Decoder architecture.

The student will explore multiple avenues, ranging from adapting existing works for track extension and for multi-frame data association [12], to adapting self-attention approaches [11] which start being adopted in vision as theyare well suited for sequential processing with missing data [13, 14]. To this end, we intend to start a collaborationwith the NLP group in LIMSI which has a strong expertise on this problem.

Required skills— Programming (Python, C++/C),— Familiarity with deep learning, machine learning,— Pytorch, Tensorflow,— Good knowledge of the fundamentals of statistics— Knowledge of image processing or computer vision, NLP, pedestrian interaction models will be appreciated.

Supervision The PhD project will be directed by Emanuel Aldea and co-supervised by Gianni Franchi.

Prospective collaborations We intend to collaborate with François Yvon (LIMSI) on the adaptation of sequencetransduction models robust to persistent occlusion. Some joint work on the extracted crowd parameters with physi-cists interested in crowd phenomena (i.e. LPT, LPP, Jülich Research Centre) may be possible as well.

Dissemination and expected outcomes We intend to publish the results of this study in the main computer vision(ECCV, ICCV, CVPR), machine learning (NeurIPS, ICML) and application-specific (AVSS) events.

[1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” Journal of artificialintelligence research, vol. 4, pp. 129–145, 1996.

[2] L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen, “Suggestive annotation : A deep active learning fra-mework for biomedical image segmentation,” in International conference on medical image computing andcomputer-assisted intervention. Springer, 2017, pp. 399–407.

[3] D. Yoo and I. S. Kweon, “Learning loss for active learning,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2019, pp. 93–102.

[4] J.-J. Zhu and J. Bento, “Generative adversarial active learning,” arXiv preprint arXiv :1702.07956, 2017.

[5] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in Proceedings ofthe 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1321–1330.

[6] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation : Representing model uncertainty in deeplearning,” in international conference on machine learning, 2016, pp. 1050–1059.

[7] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation usingdeep ensembles,” in Advances in Neural Information Processing Systems, 2017, pp. 6402–6413.

[8] G. Franchi, A. Bursuc, E. Aldea, S. Dubuisson, and I. Bloch, “Tradi : Tracking deep neural network weightdistributions,” arXiv preprint arXiv :1912.11316, 2019.

[9] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixe,“Cvpr19 tracking and detection challenge : How crowded can it get ?” arXiv preprint arXiv :1906.04567, 2019.

[10] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2019, pp. 941–951.

[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attentionis all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.

[12] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,”IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 9, pp. 1806–1819, 2011.

[13] F. B. Fuchs, A. R. Kosiorek, L. Sun, O. P. Jones, and I. Posner, “End-to-end recurrent multi-object tracking andtrajectory prediction with relational reasoning,” arXiv preprint arXiv :1907.12887, 2019.

[14] F. Giuliari, I. Hasan, M. Cristani, and F. Galasso, “Transformer networks for trajectory forecasting,” arXiv preprintarXiv :2003.08111, 2020.

Deep active learning for multiple object...

Documents

Transcript of Deep active learning for multiple object...