T-CNN, Object Detection from Video

T-CNNObject Detection from Video

Kang, Kai and Ouyang, Wanli and Li, Hongsheng and Wang, Xiaogang

CVPR 2016

[arxiv] [code]

Slides by Andrea Ferri ([email protected])Computer Vision Reading Group @ UPC BarcelonaTech (Spring 2016)

http://arxiv.org/abs/1604.02532

https://github.com/myfavouritekk/T-CNN


mailto:[email protected]

https://github.com/imatge-upc/readcv/blob/master/README.md

Summary:

•Introduction;•Architecture;

I. Still-Image Detection;II. MCS & MGP;

III. Tubelet Re-Scoring;

•Experiment.

Introduction:

DET & VID challenges

are strongly DIFFERENT.

DET applied to VID has:→ Large Temporal Fluctuations→ Generate False Positives

T-CNN means:Tubelets - Convolutional

Neural Network Where Tubelets are:

Bounding Box Sequences Having:• Temporal Information;• Contextual Information.

Architecture:

T-CNN is a composition of nowadays State of the Art:• Still-Image Object Detection;• Object Tracking Algorithm;• A Lot of Cool Tricks.

I. Still-Image DetectionThe used Detectors are:•DeepID-Net (Improvement of R-CNN);•CRAFT (Extension of Fast R-CNN).Both use different Region Proposal pre-trained models and training strategies.

II. MCS & MGPMulti-Context Suppression

Multi-Context Suppression

→ Sort all detection scores of all proposals in a video in descending order

→ The classes of the high rankings are denoted as the confident

→ The scores of classes with low rankings are suppressed, while the scores of confident classes remain unchanged.

Motion-Guided Propagation

Motion-Guided Propagation

→ In each frame, some objects are not found by detector. However, detections on adjacent frames are complementary to each other;

→Detections are propagated to adjacent frames. Optical flow is used for guiding the propagation;

→Propagation results in redundant boxes, which can be easily handled by non- maximum suppression (NMS).

III. Tubelet Re-Scoring

1.High Confidence Tracking;

2.Spatial Max Pooling;

3.Temporal Re-Scoring.

High Confidence Tracking

1 → Obtain detection results from still-image detectors;

2 → Choose high-confidence detections as starting points (anchors) for tracking;

3 → Obtain tubelets, which are bounding box sequences generated from tracking algorithms.

Spatial Max Pooling

- Still-image detection results that have large overlaps with tubelet boxes are chosen for each tubelet;

- Only detections with maximum detection scores are left after spatial max-pooling;

Used the Kalman Filter to smooth the bounding box locations.

Temporal Re-Scoring

• Tubelet Classification. Classify tubelets based on statistics of detection scores (mean, median, top-k). A linear classifier is learnt based on the statistics;

• Tubelet Re-scoring. Map detection scores of positive tubelets to [0.5, 1], negative ones to [0, 0.5].

Used a Bayesian Classifier.

Experiments:

•Tricky work behind Dataset for training (Dataset Ratio 2:1=DET:VID);•Main Parameters:•MGP: 7 Frames;•MCS: 0,0003 Top classes of Boxes;

Results:

Reference:

• T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos : Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and Wanli Ouyang.

Andrea Ferri, [email protected]




T-CNN, Object Detection from Video

Technology

Transcript of T-CNN, Object Detection from Video